From unicode at unicode.org Mon Jan 7 02:13:37 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 7 Jan 2019 09:13:37 +0100 Subject: A last missing link for interoperable representation Message-ID: Previous discussions have already brought up how Unicode is supporting those languages that despite being old in Unicode still require special attention for their peculiar way of spacing punctuation or indicating abbreviations. Now I wonder whether s?t?r?e?s?s? can likewise be noted in plain text without non-traditional markup such as *?* or ?'? when a language does not accept extra acute accents for that purpose. One character we can think of is the combining underline. Like everything else?new letters, narrow no-break space, superscripts? the quality of the rendering depends on the fonts used on the computer. Strings containing U+0332 COMBINING LOW LINE to denote stress, as a replacement of italic, may be postprocessed to apply formatting, or used as-is if interoperability matters along with semantic accuracy. Best wishes, Marcel From unicode at unicode.org Mon Jan 7 21:46:57 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 8 Jan 2019 03:46:57 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: Message-ID: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> Living languages and writing systems evolve. Using the combining low line to show stress seems reasonable to me, perhaps because it was a typewriting convention I'm old enough to remember.? People unfamiliar with that convention should be able to figure out what's up from the c?o?n?t?e?x?t?.? Drawing a line under a word or a phrase certainly draws attention to it! (Apparently there's a recently evolved practice to use periods between words. To. Add. Emphasis.? Almost as if one is speaking v-e-r-y s-l-o-w-l-y in order to make a point.) End users probably consider the entire Unicode set to be their tool kit.? I've seen plain text screen names in both cursive and fraktur, thanks to the math alphanumerics.? The carefree user community seems unconcerned with the technical insistence that *those* characters should only be used in formulae. If, for example, ?????????????????? ?????????????????? can input her screen name in cursive, there's nothing stopping me from using ??????????????, if I'm so inclined. Making recommendations for the post processing of strings containing the combining low line strikes me as being outside the scope of Unicode, though.? Some users might prefer that such strings be rendered in *bold* and other users might prefer /italics/.? This user would prefer that combining low line always be rendered as combining low line. From unicode at unicode.org Mon Jan 7 23:32:41 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 7 Jan 2019 21:32:41 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> Message-ID: <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 8 00:40:51 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 8 Jan 2019 07:40:51 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> Message-ID: <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> On 08/01/2019 06:32, Asmus Freytag via Unicode wrote: > On 1/7/2019 7:46 PM, James Kass via Unicode wrote: >> Making recommendations for the post processing of strings containing the combining low line strikes me as being outside the scope of Unicode, though. > > Agreed. > > Those kinds of things are effectively "mark down" languages, a name chosen to define them as lighter weight alternatives to formal, especially SGML derived mark-up languages. > > Neither mark-up nor mark down languages are in scope. > My hinting about post processing was only a door open to those tagging my suggestion as a dirty hack. I was so anxious about angry feedback that I inverted the order of the two possible usages despite my preference for keeping the combining underline while using proper fonts, fully agreeing with James Kass. I was pointing that unlike rich text, enhanced capabilities of plain text do not hold the user captive. With rich text we need to stay in rich text, whereas the goal of this thread is to point ways of ensuring interoperability. The pitch is that if some languages are still considered ?needing? rich text where others are correctly represented in plain text (stress, abbreviations), the Standard needs to be updated in a way that it fully supports actually all languages. Having said that, still unsupported minority languages are top priority. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 8 01:18:10 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 7 Jan 2019 23:18:10 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 8 04:00:38 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 8 Jan 2019 10:00:38 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> Message-ID: Marcel Schneider wrote, > With rich text we need to stay in rich text, whereas the goal of > this thread is to point ways of ensuring interoperability. Both interoperability and legibility are factors.? The question might be:? How legible should Unicode be for Latin?barely legible, moderately legible, or extremely legible? The boundaries of plain text have advanced since the concept originated and will probably continue to do so.? Stress can currently be represented in plain text with conventions used in lieu of existing typographic practice.? Unicode can preserve texts created using the plain text kludges/conventions for marking stress, but cannot preserve printed texts which use standard publishing conventions for marking stress, such as italics. If Latin were a dead script being proposed for encoding now, it?s possible that certain script features currently considered to be merely stylistic variants best reserved for mark-up would be encoded atomically. Scripts added more recently to Unicode appear to have been encoded with the idea of preserving the standard writing and publishing conventions of the users.? It's only natural if some Latin script users want to push back the boundaries of Latin computer plain text accordingly. From unicode at unicode.org Tue Jan 8 15:11:07 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 8 Jan 2019 21:11:07 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> Message-ID: Asmus Freytag wrote, > ... > (for an extreme example there's an orthography > out there that uses @ as a letter -- we know that > won't work well with email addresses and duplicate > encoding of the @ shape is a complete non-starter). Everything's a non-starter.? Until it begins. Is this a casing orthography?? (Please see attached image.) We've seen where typewriter kludges enabled users to represent the glottal stop with a question mark (or a digit seven).? Unicode makes those kludges unnecessary. But we're still using typewriter kludges to represent stress in Latin script because there is no Unicode plain text solution. -------------- next part -------------- A non-text attachment was scrubbed... Name: NaturalCase.png Type: image/png Size: 3325 bytes Desc: not available URL: From unicode at unicode.org Tue Jan 8 15:28:46 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 8 Jan 2019 13:28:46 -0800 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> Message-ID: <51e22fd8-8478-d901-39e3-36f43d757eeb@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 8 15:43:08 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Tue, 8 Jan 2019 13:43:08 -0800 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> Message-ID: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> James, On 1/8/2019 1:11 PM, James Kass via Unicode wrote: > But we're still using typewriter kludges to represent stress in Latin > script because there is no Unicode plain text solution. O.k., that one needs a response. We are still using kludges to represent stress in the Latin script because *orthographies* for most languages customarily written with the Latin script don't have clear conventions for indicating stress as a part of the orthography. When an orthography has a well-developed convention for indicating stress, then we can look at how that convention is represented in the plain text representation of that orthography. An obvious case is notational systems for the representation of pronunciation of English words in dictionaries. Those conventions *do* then have plain text representations in Unicode, because, well, they just have various additional characters and/or combining marks to clearly indicate lexical stress. But standard written English orthography does *not*. (BTW, that is in part because marking stress in written English would usually *decrease* legibility and the usefulness of the writing, rather than improving it.) Furthermore, there is nothing inherent about *stress* per se in the Latin script (or any other script, for that matter). Lexical stress is a phonological system, not shared or structured the same way in all languages. And there are *thousands* of languages written with the Latin script -- with all kinds of phonological systems associated with them. Some have lexical tones, some do not. Some have other kinds of phonological accentuation systems that don't count as lexical stress, per se. And there are differences between lexical stress (and its indication), and other kinds of "stress". Contrastive stress, which is way more interesting to consider as a part of writing, IMO, than lexical stress, is a *prosodic* phenomenon, not a lexical one. (And I have been using the email convention of asterisks here to indicate contrastive stress in multiple instances.) And contrastive stress is far from the only kind of communicatively significant pitch phenomenon in speech that typically isn't formally represented in standard orthographies. There are numerous complex scoring systems for linguistic prosody that have been developed by linguists interested in those phenomenon -- which include issues of pace and rhythm, and not merely pitch contours and loudness. It isn't the job of the Unicode Consortium or the Unicode Standard to sort that stuff out or to standardize characters to represent it. When somebody brings to the UTC written examples of established orthographies using character conventions that cannot be clearly conveyed in plain text with the Unicode characters we already have, *then* perhaps we will have something to talk about. --Ken From unicode at unicode.org Tue Jan 8 23:33:21 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 8 Jan 2019 21:33:21 -0800 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> Message-ID: On Tue, Jan 8, 2019 at 2:03 AM James Kass via Unicode wrote: > The boundaries of plain text have advanced since the concept originated > and will probably continue to do so. Stress can currently be > represented in plain text with conventions used in lieu of existing > typographic practice. Unicode can preserve texts created using the > plain text kludges/conventions for marking stress, but cannot preserve > printed texts which use standard publishing conventions for marking > stress, such as italics. > Is there any way to preserve The Art of Computer Programming except as a PDF or its TeX sources? Grabbing a different book near me, I don't see any way to preserve them except as full-color paged reproductions. Looking at one data format, it uses bold, italics, and inversion (white on black), in sans-serif, serif and script fonts; certainly in lines like "Treasure standard (+1 starknife)", offering "Treasure standard (+1 starknife)" is completely insufficient. Can some books be mostly handled with Unicode plain text and italics? Sure. HTML can handle them quite nicely. I'd say even them will have headers that are typographically distinguished and should optimally be marked in a transcription. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 9 00:58:51 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 9 Jan 2019 06:58:51 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> Message-ID: <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> Ken Whistler wrote, > It isn't the job of the Unicode Consortium or the Unicode Standard > to sort that stuff out or to standardize characters to represent it. Agreed, it isn?t. > When somebody brings to the UTC written examples of established > orthographies using character conventions that cannot be clearly > conveyed in plain text with the Unicode characters we already have, > *then* perhaps we will have something to talk about. If a text is published in all italics, that?s style/font choice.? If a text is published using italics and roman contrastively and consistently, and everybody else is doing it pretty much the same way, that?s a convention. Typewriting is mechanical writing.? Computer keyboards, input methods, and Unicode are technological advances in mechanical writing.? Typesetting for publishing is mechanical writing for the purpose of mass production and distribution of texts. From a printed Webster?s, lexicon (lek? si k?n) [ < Gr. ??????????, word. ]? 1.? a dictionary? 2.? a special vocabulary There?s a convention in English writing to express foreign words using italics.? Not just in published dictionaries, but also in running text where foreign words and phrases are deployed. Other italics conventions include ship names such as the SS ????????? ????????, or titles such as ???? ?????????? ???????? ??????, which is properly spelled with a ??? in ??a?.? (Math kludge fail.)? Of course, since that song title is in a foreign language, it should be italicized anyway. Quoting from, http://navalmarinearchive.com/research/ship_names.html ?Names of specific ships and other vessels are both capitalized and italicized (or capitalized entirely - "all caps" - in text documents denying italics such as email, use of a mechanical typewriter.)? There were technological constraints denying italics in mechanical typewriters.? There?s a technical consortium denying italics in Latin computer plain text, for better or worse.? (Trying to state the obvious here without being judgmental.) The use of italics in English writing to mark stress is another existing convention.? Italics don?t interfere with legibility in English fiction when used to indicate stress in dialogue between the characters.? Rather, the italics add information enabling the reader to approximate how the author intended the dialogue to be *spoken*. And ??????? information cannot be preserved in Unicode plain text without the math kludge or using asterisks and slashes as ???? ?????????? mark-up. ????????????? is important? vs. ?Stress ???? important?. I look forward to the continuing evolution of plain text and would welcome the ability to use italics in plain text without kludges. But I?m not holding my breath. Anybody making a formal proposal for italics encoding can be assured that the proposal would be received with something less than enthusiasm.? But stranger things have happened. Many of us here are old enough to remember when something like was a non-starter because in-line pictures were out of scope for a computer plain text standard.? But now I could plop a picture of a cow (or worse) right into this plain text e-mail, if I were so inclined.? That?s progress for you. It?s too bad they called it ????? ????????????? ???????????? ???? ?????????? instead of ?The Chicago Manual of Correct American English Orthographic Conventions for Text Publishing?, eh?? Maybe ?Style? sounded more classy.? But it *does* tend to make it simpler for people to dismiss such distinctions as being merely stylistic. But if the distinction is merely stylistic, we wouldn?t have needed to develop typewriter or computer plain text kludges for them in order to express ourselves properly. (Apologies for length and Happy New Year!) From unicode at unicode.org Wed Jan 9 01:30:26 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 8 Jan 2019 23:30:26 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> Message-ID: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 9 01:56:23 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 9 Jan 2019 07:56:23 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> Message-ID: <43ac21d8-2e2e-2223-2698-09c7663481e8@gmail.com> David Starner wrote, > Can some books be mostly handled with Unicode plain text > and italics? Sure. HTML can handle them quite nicely. ... Yes, many books can be handled very well with HTML using simple mark-up.? If I were producing a computer file to reproduce an old fiction novel, that's how I'd do it.? Not because it's better or simpler than plain text, but because it can't really be done in plain text at this time.? But if a section of the text is copy/pasted from the screen into an editor, some of the original information may be lost. As you point out, there's a lot of published material best viewed digitally as full color page scans.? As it should be.? That seems unlikely to change. From unicode at unicode.org Wed Jan 9 03:06:26 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 9 Jan 2019 09:06:26 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> Message-ID: Asmus Freytag wrote, > Still, not supported in plain text (unless you abuse the > math alphabets for things they were not intended for). The unintended usage of math alphanumerics in the real world is fairly widespread, at least in screen names. (I still get a kick out of this:) http://www.ewellic.org/mathtext.html I wonder how many times Doug's program has been downloaded. Whether it's "abuse" or not might depend on whether one considers the user community of the machines which process the texts to be more important than the user community of human beings who author, exchange, and read the texts. Real humans are the user community of the UCS.? It's up to the user community to determine how its letters and symbols get used.? That's the general rule-of-thumb Unicode applies to the subset user communities, and it should apply to the complete superset as well. From unicode at unicode.org Wed Jan 9 03:25:54 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Wed, 9 Jan 2019 01:25:54 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <43ac21d8-2e2e-2223-2698-09c7663481e8@gmail.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <43ac21d8-2e2e-2223-2698-09c7663481e8@gmail.com> Message-ID: On Tue, Jan 8, 2019 at 11:58 PM James Kass via Unicode wrote: > > David Starner wrote, > > > Can some books be mostly handled with Unicode plain text > > and italics? Sure. HTML can handle them quite nicely. ... > > Yes, many books can be handled very well with HTML using simple > mark-up. If I were producing a computer file to reproduce an old > fiction novel, that's how I'd do it. Not because it's better or simpler > than plain text, but because it can't really be done in plain text at > this time. But if a section of the text is copy/pasted from the screen > into an editor, some of the original information may be lost. > Looking at the Encyclopedia Brown book at hand, you'd lose any marking that "The Case of the Headless Ghost" is the chapter header. While the picture of the treasure chest may be gratuitous, but "he hung his sign outside the garage:" is followed by an image of said sign that says "BROWN DETECTIVE AGENCY...". If you copy/paste that without carrying the original image along, some of the original information will be lost. In the Gmail editor, I see buttons to make the text bold, italic, or underlined, and to change the color, text size and font. English users tend to see italics as part and parcel of the text formatting. One can argue that's part of history, that italics is somehow different from bold and underline and font and text size changes, but when the standard perception conveniently matches how Unicode encodes the script, there doesn't seem much point in changing things, especially with terabytes of text that encodes italics separately from the plain text matter. Frequently, copy/pasting material does preserve non-plain text features; if I paste a title from Wikipedia into here, it will show up much larger then the rest of the text. It's a pain, because I want the underlying text, not how it was displayed in the context. Honestly, I could argue that case should not be encoded. It would simplify so much processing of Latin script text, and most of the time case-sensitive operations are just wrong. Case is clearly a headache that has to be dealt with in plain text, but it certainly doesn't encourage me to add another set of characters that are basically the same but not. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 9 03:37:53 2019 From: unicode at unicode.org (Tex via Unicode) Date: Wed, 9 Jan 2019 01:37:53 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> Message-ID: <000901d4a7fe$ff8f9a40$feaecec0$@xencraft.com> James Kass wrote: If a text is published in all italics, that?s style/font choice. If a text is published using italics and roman contrastively and consistently, and everybody else is doing it pretty much the same way, that?s a convention. Asmus Freytag responded: But not all conventions are deemed worth of plaintext encoding. What are the criteria for ?worth?? Way back when, when plain text was very very plain, arguments about not including text styling seemed reasonable. But with the inclusion of numerous emoji as James mentioned, it seems odd to be protesting a few characters that would enhance ?plain text? considerably. Plain text editors today support bold, italic, and other styles as a fundamental requirement for usability. More text editors support styling than support bidi or interlinear annotation. If there was support for the handful of text features used by most plain text editors (bold, italic, strikethrough, underline, superscript, subscript, et al) (perhaps using more generalized names such as emphasis, stress, deleted?) then many of the redundant (bold, italic, ?) characters in Unicode would not have been needed. HTML seemed to do very well with a very few styling elements. HTML is of course rich text, but I am just demonstrating that a very small number of control characters would bring plain text into the modern state of text editing. Editors that don?t have the capability for bolding, underlining, etc. could ignore these controls or convert them to another convention. As James requested, it would also provide interoperability. Arguments about all of the conventions that Unicode does not support doesn?t seem compelling to me, as it seems increasingly random as what is accepted and what isn?t, or at least the rationales seem inconsistent. A case in point is the addition of the ?SS? character which made implementation complex with little benefit. Interlinear annotation is perhaps another example. I don?t want to enter into a debate about why these deserved inclusion. I am only saying they seem less useful than some other cases which seem deserving. **And right now, Dr. Strangelove style, my right hand is restraining my other hand from typing on the keyboard, to avoid saying anything about emoji.** Ken distinguished numerous variations of stress, which of course have their place, representations and uses. But perhaps for plain text we only need a way to indicate ?stress here?, leave it to the text editor to have some form of rendering. For more distinctions the user needs to use rich text. Surely there is an 80/20 rule that motivates a solution rather than letting the one percent prevent a capability that 99% would enjoy. (Yes I mixed metaphors. I feel an Occupy Unicode movement coming on. J ) I don?t see how adding a few text style controls would be a burden to most implementers. Given ideographic variation sequences, skin tones, hair styles, and the other requirements for proper Unicode support, arguing against a few text styling capabilities seems very last century. (Or at least 1990s?) And it might save having to add a few more bold, italic, superscript, et al compatibility characters? tex -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 9 05:03:52 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 9 Jan 2019 03:03:52 -0800 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> Message-ID: <14543537-d957-64fc-a221-9172c5e22035@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 9 05:04:18 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 9 Jan 2019 03:04:18 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <000901d4a7fe$ff8f9a40$feaecec0$@xencraft.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <000901d4a7fe$ff8f9a40$feaecec0$@xencraft.com> Message-ID: <5d7a2a1b-e508-dc06-f9a0-5b5996f5610e@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 9 04:29:36 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Wed, 9 Jan 2019 10:29:36 +0000 (GMT) Subject: A last missing link for interoperable representation In-Reply-To: <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> Message-ID: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> I suggest that a solution to the problem would be to encode a COMBINING ITALICIZER character, such that it only applies to the character that it immediately follows. So, for example, to make the word apricot become displayed in italics one would use seven COMBINING ITALICIZER characters, one after each letter of the word apricot. The display could be sorted out using an OpenType font by treating each pair of a letter and a COMBINING ITALICIZER as a ligature. If, say, the glyph name of COMBINING ITALICIZER were italic then the glyph for c italic could be c_italic and so plain text might well be copyable from a PDF (Portable Document Format) document and pasted to WordPad as plain text retaining the COMBINING ITALICIZER character, depending upon which application program is used to produce the PDF document and which PDF reader is in use. This would seem a workable solution. Many years ago I suggested having characters that would have been comparable in use in plain text as to how italics is switched on and off in HTML (Hypertext Markup Language) yet was advised that such an encoding would make plain text stateful and thus would not be agreed for encoding. That objection might well still be the case today. So using a COMBINING ITALICIZER character would avoid that objection and would also provide a solution that could be straightforwardly implemented using existing OpenType technology. William Overington Wednesday 9 January 2019 From unicode at unicode.org Wed Jan 9 13:58:35 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 9 Jan 2019 19:58:35 +0000 Subject: Where is my character @? In-Reply-To: <51e22fd8-8478-d901-39e3-36f43d757eeb@ix.netcom.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <51e22fd8-8478-d901-39e3-36f43d757eeb@ix.netcom.com> Message-ID: There was a post in an unrelated thread remarking that an unnamed writing system used the "at" sign (@) as a letter, and that optimal encoding for that orthography was a non-starter. A question as to whether that writing system was casing went unanswered, but a kind list member offered some pointers privately. The language in question is Koalib, which is spoken in the Sudan. It is a casing script and the upper case form uses an upper case "A" with a wrap around as in the lower case "@". The current "solution" is for the users to use the P.U.A. for both upper and lower case letters, and fonts such as Doulos SIL support that P.U.A. encoding. A Google search for "Koalib Unicode" finds the following: Wikipedia: https://en.wikipedia.org/wiki/Koalib_language 2004-08-25 Lorna A. Priest, Public Review Issue # 40 Revised Proposal to Encode... http://www.unicode.org/review/pr-40-atsigns.pdf 2004-10-20 Doug Ewell, L2/04-365 The case against encoding the Koalib @-letters http://unicode.org/L2/L2004/04365-pr40-ewell.pdf 2012-04-17 Karl Pentzlin, L2/12-116 "Capitalized Commercial At" proposal http://unicode.org/L2/L2012/12116-capital-at.pdf 2018-12-26 Eduardo Mar?n Silva, L2/19-006 Proposal to encode... http://www.unicode.org/L2/L2019/19006-capital-at.pdf It's probably old-fashioned to say that technology should be forced to accomodate people rather than the other way around.? But it's good to note that efforts are still being made on behalf of the users to make progress towards U.C.S. inclusion. From unicode at unicode.org Wed Jan 9 15:33:02 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 09 Jan 2019 14:33:02 -0700 Subject: A last missing link for interoperable representation Message-ID: <20190109143302.665a7a7059d7ee80bb4d670165c8327d.791f9e387d.wbe@email03.godaddy.com> James Kass wrote: > (I still get a kick out of this:) > http://www.ewellic.org/mathtext.html > > I wonder how many times Doug's program has been downloaded. I?ll never know, since I never attached a web counter of any sort to it. Andrew West?s online ?Unicode Text Styler? includes non-math characters (like circled and fullwidth) as well, and is probably better, although it doesn't include the ransom-note option: http://www.babelstone.co.uk/Unicode/text.html -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed Jan 9 16:03:25 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 09 Jan 2019 15:03:25 -0700 Subject: [OT] Digest supports only ASCII (was: Re: A last missing link...) Message-ID: <20190109150325.665a7a7059d7ee80bb4d670165c8327d.0b0aa3fdf6.wbe@email03.godaddy.com> As reported in Unicode Digest, Vol 61, Issue 3, James Kass wrote: > And ??????? > information cannot be preserved in Unicode plain text without the math > kludge or using asterisks and slashes as ???? ?????????? mark-up. > > ????????????? is important? vs. ?Stress ???? important?. I know this is an old argument and this will probably never be fixed, but I wish the Unicode email digest could be updated to support, you know, Unicode. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed Jan 9 16:15:08 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 09 Jan 2019 15:15:08 -0700 Subject: Where is my character =?UTF-8?Q?=40=3F?= Message-ID: <20190109151508.665a7a7059d7ee80bb4d670165c8327d.452f32d7a9.wbe@email03.godaddy.com> James Kass wrote: > It's probably old-fashioned to say that technology should be forced to > accomodate people rather than the other way around. But it's good to > note that efforts are still being made on behalf of the users to make > progress towards U.C.S. inclusion. I'm as opposed to this proposal as I was in 2004, if not more so, and I'm working on a brief response document for next week's UTC. Among other things, it's not at all clear that the orthography using @, cited in three works from a single publisher in 1998, has been adopted or become particularly widespread within the Koalib community. (And no, this does not constitute "disdain for the small community.") -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed Jan 9 18:41:05 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 9 Jan 2019 19:41:05 -0500 Subject: A last missing link for interoperable representation In-Reply-To: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> Message-ID: On 1/9/19 2:30 AM, Asmus Freytag via Unicode wrote: > > English use of italics on isolated words to disambiguate the reading > of some sentences is a convention. Everybody who does it, does it the > same way. Not supported in plain text. > > German books from the Fraktur age used Antiqua for Latin and other > foreign terms. Definitely a convention that was rather universally > applied (in books at least). Not supported in plain text. > Aren't there printing conventions that indicate this type of "contrastive stress" using letterspacing instead of font style?? I'm s?u?r?e I've seen it in German and other Latin-written languages, and also even occasionally in Hebrew, whose experiments with italics tend not to be encouraging. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 9 18:45:31 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 9 Jan 2019 19:45:31 -0500 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> Message-ID: On 1/9/19 12:33 AM, David Starner via Unicode wrote: > > > Is there any way to preserve The Art of Computer Programming except as > a PDF or its TeX sources? Grabbing a different book near me, I don't > see any way to preserve them except as full-color paged reproductions. > Looking at one data format, it uses bold, italics, and inversion > (white on black), in sans-serif, serif and script fonts; certainly in > lines like "Treasure standard (+1 starknife)", offering > "Treasure standard (+1 starknife)" is completely insufficient. > > Can some books be mostly handled with Unicode plain text and italics? > Sure. HTML can handle them quite nicely. I'd say even them will have > headers that are typographically distinguished and should optimally be > marked in a transcription. The line I used to say about this is ?there?s no such thing as plain text on paper.?? The concept of ?plain text? vs markup or styling is purely in the digital domain.? On physical artifacts, it?s just ink on wood-pulp, and the only ?real? description of the page is a graphic image. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 9 18:49:38 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 9 Jan 2019 19:49:38 -0500 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <43ac21d8-2e2e-2223-2698-09c7663481e8@gmail.com> Message-ID: On 1/9/19 4:25 AM, David Starner via Unicode wrote: > > > Honestly, I could argue that case should not be encoded. It would > simplify so much processing of Latin script text, and most of the time > case-sensitive operations are just wrong. Case is clearly a headache > that has to be dealt with in plain text, but it certainly doesn't > encourage me to add another set of characters that are basically the > same but not. I completely agree.? Casing of letters (in general, I mean) was a horrible mistake and is way more trouble than it?s worth.? Too late to fix it, and given how entrenched it is it did kind of have to be encoded, but it?s such a bad idea.? And then other alphabets see it and think ?hey, we need capitals too!? and you get capitals for all the IPA extensions and Cherokee and so on... Ugh. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 9 20:31:10 2019 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Thu, 10 Jan 2019 03:31:10 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <20190109143302.665a7a7059d7ee80bb4d670165c8327d.791f9e387d.wbe@email03.godaddy.com> References: <20190109143302.665a7a7059d7ee80bb4d670165c8327d.791f9e387d.wbe@email03.godaddy.com> Message-ID: <20190110023110.jgglofux535kvqzn@angband.pl> On Wed, Jan 09, 2019 at 02:33:02PM -0700, Doug Ewell via Unicode wrote: > James Kass wrote: > > (I still get a kick out of this:) > > http://www.ewellic.org/mathtext.html > Andrew West?s online ?Unicode Text Styler? includes non-math > characters (like circled and fullwidth) as well, and is probably better, > although it doesn't include the ransom-note option: > > http://www.babelstone.co.uk/Unicode/text.html And for the command line, there's my https://github.com/kilobyte/tran No ransom-note as I pretend the tool's primary use is tran{scrib,literat}ing between actual human scripts -- but it's remarkably easier to automate a command line tool... Meow! -- ??????? Hans 1 was born and raised in Johannesburg, then moved to Boston, ??????? and has just became a naturalized citizen. Hans 2's grandparents ??????? came from Melanesia to D?sseldorf, and he hasn't ever been outside ??????? Germany until yesterday. Which one is an African-American? From unicode at unicode.org Wed Jan 9 21:00:43 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 9 Jan 2019 19:00:43 -0800 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> Message-ID: <36610912-dcdd-917b-7ddc-ced595be76b8@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 10 09:16:42 2019 From: unicode at unicode.org (Arthur Reutenauer via Unicode) Date: Thu, 10 Jan 2019 16:16:42 +0100 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> Message-ID: <20190110151642.dz7r6pvhhqh2nay6@phare.normalesup.org> On Wed, Jan 09, 2019 at 09:06:26AM +0000, James Kass via Unicode wrote: > The unintended usage of math alphanumerics in the real world is fairly > widespread, at least in screen names. On this topic, I was just pointed to https://twitter.com/kentcdodds/status/1083073242330361856 ?You ?????????? it's ??????? to ?????????? your tweets and usernames ???????? ??????. But have you ???????????????? to what it ???????????? ???????? with assistive technologies like ???????????????????? Best, Arthur From unicode at unicode.org Thu Jan 10 10:24:59 2019 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Thu, 10 Jan 2019 21:54:59 +0530 Subject: Excessive emoji usage and TTS (was Re: A last missing link) Message-ID: On Thu 10 Jan, 2019, 20:49 Arthur Reutenauer via Unicode < unicode at unicode.org wrote: > > On this topic, I was just pointed to > > https://twitter.com/kentcdodds/status/1083073242330361856 > > ?You ?????????? it's ??????? to ?????????? your tweets and usernames > ???????? ??????. But > have you ???????????????? to what it ???????????? ???????? with assistive > technologies > like ???????????????????? Something similar: https://twitter.com/aaronreynolds/status/1083098920132071424?s=20 "This is what it?s like to get texts from my fourteen year old while driving." https://t.co/s8949bmgZI -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 10 10:41:29 2019 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Thu, 10 Jan 2019 18:41:29 +0200 Subject: Excessive emoji usage and TTS (was Re: A last missing link) In-Reply-To: References: Message-ID: <20190110164129.GC28761@macbook.localdomain> On Thu, Jan 10, 2019 at 09:54:59PM +0530, Shriramana Sharma via Unicode wrote: > On Thu 10 Jan, 2019, 20:49 Arthur Reutenauer via Unicode < > unicode at unicode.org wrote: > > > > > On this topic, I was just pointed to > > > > https://twitter.com/kentcdodds/status/1083073242330361856 > > > > ?You ?????????? it's ??????? to ?????????? your tweets and usernames > > ???????? ??????. But > > have you ???????????????? to what it ???????????? ???????? with assistive > > technologies > > like ???????????????????? > > > Something similar: > > https://twitter.com/aaronreynolds/status/1083098920132071424?s=20 > > "This is what it?s like to get texts from my fourteen year old while > driving." > > https://t.co/s8949bmgZI That is pretty good actually and even a positive point for emoji (if these were mere images you would get nothing out of it without extra tagging, and it would still lack the standardization). Nothing like what one gets from the math symbols abuse. Regards, Khaled From unicode at unicode.org Thu Jan 10 13:35:40 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 10 Jan 2019 19:35:40 +0000 Subject: Excessive emoji usage and TTS (was Re: A last missing link) In-Reply-To: <20190110164129.GC28761@macbook.localdomain> References: <20190110164129.GC28761@macbook.localdomain> Message-ID: On 2019-01-10 4:41 PM, Khaled Hosny wrote: > That is pretty good actually and even a positive > point for emoji (if these were mere images you > would get nothing out of it without extra tagging, > and it would still lack the standardization). > Nothing like what one gets from the math symbols > abuse. Yes, it's quite a difference.? I can read the text with math character use and can skip the texts with emoji. Mathematicians borrowed these letters from writers.? Now writers are borrowing them back.? Seems fair. From unicode at unicode.org Thu Jan 10 17:43:46 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 10 Jan 2019 23:43:46 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> Message-ID: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> On 2019-01-10 11:27 PM, wjgo_10009 at btinternet.com wrote: > Yesterday I wrote as follows. > >> I suggest that a solution to the problem would be to encode a >> COMBINING ITALICIZER character, such that it only applies to the >> character that it immediately follows. So, for example, to make the >> word apricot become displayed in italics one would use seven >> COMBINING ITALICIZER characters, one after each letter of the word >> apricot. > > I have now made a test font. I used a Private Use Area code point and > a visible glyph for this test. It works well. > > https://forum.high-logic.com/viewtopic.php?f=10&t=7831 > > Would it be a good idea to encode such a character into Unicode? The > first step would be to persuade the "powers that be" that italics are > needed.? That seems presently unlikely.? There's an entrenched mindset > which seems to derive from the fact that pre-existing character sets > were based on mechanical typewriting technology and were limited by > the maximum number of glyphs in primitive computer fonts. The first step would be to persuade the "powers that be" that italics are needed.? That seems presently unlikely.? There's an entrenched mindset which seems to derive from the fact that pre-existing character sets were based on mechanical typewriting technology and were further limited by the maximum number of glyphs in primitive computer fonts. The second step would be to persuade Unicode to encode a new character rather than simply using an existing variation selector character to do the job. From unicode at unicode.org Thu Jan 10 17:46:42 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 10 Jan 2019 23:46:42 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> Message-ID: Oops.? Sorry for the inadvertent copy/paste duplication. From unicode at unicode.org Thu Jan 10 17:27:10 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Thu, 10 Jan 2019 23:27:10 +0000 (GMT) Subject: A last missing link for interoperable representation In-Reply-To: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> Message-ID: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> Yesterday I wrote as follows. > I suggest that a solution to the problem would be to encode a > COMBINING ITALICIZER character, such that it only applies to the > character that it immediately follows. So, for example, to make the > word apricot become displayed in italics one would use seven COMBINING > ITALICIZER characters, one after each letter of the word apricot. I have now made a test font. I used a Private Use Area code point and a visible glyph for this test. It works well. https://forum.high-logic.com/viewtopic.php?f=10&t=7831 Would it be a good idea to encode such a character into Unicode? William Overington Thursday 10 January 2019 -------------- next part -------------- A non-text attachment was scrubbed... Name: italicizer_maquette_example.png Type: image/png Size: 19268 bytes Desc: not available URL: From unicode at unicode.org Thu Jan 10 18:28:08 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Thu, 10 Jan 2019 19:28:08 -0500 Subject: A last missing link for interoperable representation In-Reply-To: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> Message-ID: <965db628-a917-9083-7bb0-910c571d4441@kli.org> On 1/10/19 6:43 PM, James Kass via Unicode wrote: > > The first step would be to persuade the "powers that be" that italics > are needed.? That seems presently unlikely.? There's an entrenched > mindset which seems to derive from the fact that pre-existing > character sets were based on mechanical typewriting technology and > were further limited by the maximum number of glyphs in primitive > computer fonts. > > The second step would be to persuade Unicode to encode a new character > rather than simply using an existing variation selector character to > do the job. A perhaps more affirmative step, not necessarily first but maybe, would be to write up a proposal and submit it through channels so the "powers that be" can respond officially. ~mark From unicode at unicode.org Thu Jan 10 18:37:11 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 11 Jan 2019 00:37:11 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <965db628-a917-9083-7bb0-910c571d4441@kli.org> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <965db628-a917-9083-7bb0-910c571d4441@kli.org> Message-ID: <081f819e-ca23-4008-55c3-10bc33dfefac@gmail.com> Mark E. Shoulson wrote, > A perhaps more affirmative step, not necessarily first > but maybe, would be to write up a proposal and submit > it through channels so the "powers that be" can > respond officially. Indeed.? And a preliminary step might be to float the concept on the public list and see how well it is received.? Such discussion can often lead to more robust proposals, or an alternative use for one's time.? (smiles) From unicode at unicode.org Thu Jan 10 19:14:45 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 11 Jan 2019 01:14:45 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> Message-ID: <20190111011445.1773182d@JRWUBU2> On Thu, 10 Jan 2019 23:43:46 +0000 James Kass via Unicode wrote: > The second step would be to persuade Unicode to encode a new > character rather than simply using an existing variation selector > character to do the job. Actually, this might be a superior option. Richard. From unicode at unicode.org Thu Jan 10 19:48:23 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 11 Jan 2019 01:48:23 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <20190111011445.1773182d@JRWUBU2> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> Message-ID: <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> Richard Wordingham responded, >> ... simply using an existing variation >> selector character to do the job. > > Actually, this might be a superior option. For the V.S. option there should be a provision for consistency and open-endedness to keep it simple.? Start with VS14 and work backwards for italic, fraktur, antiqua...? (whatever the preferred order works out to be).? Or (better yet) start at VS17 and move forward (and change the rule that seventeen and up is only for CJK). Is it true that many of the CJK variants now covered were previously considered by the Consortium to be merely stylistic variants? From unicode at unicode.org Fri Jan 11 01:13:18 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 11 Jan 2019 07:13:18 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> Message-ID: I've been advised off-list that my attempt to make an analogy with CJK doesn't sit well. It's fair to say that ideographic variation sequences are for plain-text representation of material which isn't suitable for atomic encoding.? An analogy can be drawn from that situation to the situation of other scripts, such as Latin (or Khmer). The ideographic variation sequences also represent an anomaly:? if it's not suitable for plain-text encoding, it doesn't *need* plain-text representation.? Except that it does. It's the demands of the CJK user community which drive the plain-text representation, which is proper.? This method should apply to non-CJK scripts as well. Styled Latin text is being simulated with math alphanumerics now, which means that data is being interchanged and archived.? That's the user demand illustrated. Whether the users are doing it Chicago style or just plain willy-nilly doesn't matter; it's being done.? User communities drive their own script development and advancement using the tools available. From unicode at unicode.org Fri Jan 11 01:29:42 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Fri, 11 Jan 2019 07:29:42 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> Message-ID: <61e13a9b-1ade-c89b-8cae-07c965194c01@it.aoyama.ac.jp> On 2019/01/11 10:48, James Kass via Unicode wrote: > Is it true that many of the CJK variants now covered were previously > considered by the Consortium to be merely stylistic variants? What is a stylistic variant or not is quite a bit more complicated for CJK than for scripts such as Latin. In some contexts, something may be just a stylistic variant, whereas in other contexts (e.g. person registries,...), it may be more than a stylistic distinction. Also, in contrast to the issue discussed in the current thread, there's no consistent or widely deployed solution for such CJK variants in rich text scenarios such as HTML. Regards, Martin. From unicode at unicode.org Fri Jan 11 02:13:44 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Fri, 11 Jan 2019 08:13:44 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> Message-ID: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> On 2019/01/11 16:13, James Kass via Unicode wrote: > Styled Latin text is being simulated with math alphanumerics now, which > means that data is being interchanged and archived.? That's the user > demand illustrated. Almost by definition, styled text isn't plain text, even if it's simulated by something else. And the simulation is highly limited, as the voicing examples and the fact that the math alphanumerics only cover basic Latin have shown. Regards, Martin. From unicode at unicode.org Fri Jan 11 04:43:55 2019 From: unicode at unicode.org (Tex via Unicode) Date: Fri, 11 Jan 2019 02:43:55 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> Message-ID: <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> Martin, James is making the case there is demand or a user need and that the proof is that users are using inconsistent tactics to simulate a solution to their problem. The response that: "Almost by definition, styled text isn't plain text, even if it's simulated by something else." is a bit like Humpty Dumpty saying words mean what I want them to mean. Most of the emoji aren't plain text and Unicode has them in abundance. Ruby text is also not plain text. Their inclusion was the user need for consistency and interoperability. The original emoji had inconsistent encodings and were a problem for interchange as well as search and rendering. Their existence and popularity became their own problem requiring further styling (e.g. coloring) and greatly expanded enumeration (foods, animals, et al.) Let's be honest and admit the actual demand for some of these latter objects in plain text is marginal and certainly is less than the prevalence of italics. The response that: "the simulation is highly limited, as the voicing examples and the fact that the math alphanumerics only cover basic Latin have shown." unless I misunderstand your meaning, is the argument that we encoded only these therefore the use case is limited to these. In a different message you say: "Also, in contrast to the issue discussed in the current thread, there's no consistent or widely deployed solution for such CJK variants in rich text scenarios such as HTML." I don't see how a rich text solution has any bearing on plain text. We could take the point that if there was no need in HTML to solve the problem than there wasn't demand justifying the need in Unicode. :-) I understand your actual intent to say there was a need for CJK variants and there was no other solution. However, the fact that there is a rich text solution for italics isn't helpful to plain text users. HTML had bidirectional isolates and after the fact Unicode encoded them as well. The fact that there isn't a consistent way to represent stress or the other uses for italics (or obliques, and bold, etc.) does make certain searches across large numbers of plain texts problematic. In the same way it is sometimes important to distinguish capitalized text when searching (polish vs Polish) it would be helpful to do the same for italicized text. For example, if I am searching for the movie title "Contact" vs. all the places where texts reference a personal "Contact", distinguishing italicized titles would help. And to the extent that users are inserting non-standardized punctuation or other characters for "styling" it makes reliable searching difficult. As James mentioned it helps with interoperability as well. In the '90s it made sense to resist styling plain text. In the 2020's, with more than 100k characters, numerous pictures and character adornments, it seems anachronistic to be arguing against a handful of control characters that would standardize a common text requirement. Most rendering systems will handle it easily and any plain text editor or other software that supports a combining strikethrough character would easily adapt a combining italicize or a combining bold character. tex From unicode at unicode.org Fri Jan 11 16:28:40 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Fri, 11 Jan 2019 14:28:40 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> Message-ID: Emoji were being encoded as characters, as codepoints in private use areas. That inherently called for a Unicode response. Bidirectional support is a headache; the amount of confusion and outright exploits from them is way higher then we like.The HTML support probably doesn't help that. However, properly mixing Hebrew and English (e.g.) is pretty clearly a plain text problem. There are terabytes of Latin text out there, most of it encoded in formats that already support italics. Whereas emoji, encoded as characters in a then limited number of systems, could be subsumed into Unicode easily, much of that text will never be edited and those formats will never exclude the existing means of marking italics out of bounds, offering multiple ways to do italics in perpetuity. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Fri Jan 11 16:10:25 2019 From: unicode at unicode.org (via Unicode) Date: Fri, 11 Jan 2019 23:10:25 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> Message-ID: On 11.01.2019 11:43, Tex via Unicode wrote: > Martin, > > James is making the case there is demand or a user need and that the > proof is that users are using inconsistent tactics to simulate a > solution to their problem. The use of math characters is mostly to get around limitations of Twitter (and some other platforms). There are plenty of rich text formats like Markdown and Html existing already. I am rather doubtful that it should be Unicode's responsibility to get around lack of rich text support via special characters and fonts, especially since many platforms do not allow users to freely change the fonts (and if these platforms installed such fonts, they could just as easily support markup/rich text instead). Even if they do, the programs/platforms involved would not necessarily enable these fonts by default: if the wanted rich text, they would be supporting it already. Also, any Unicode-based rich text standard would not really be standard compared to the vast amount of HTML out there already. David Faulks From unicode at unicode.org Fri Jan 11 17:17:24 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 11 Jan 2019 23:17:24 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> References: <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> Message-ID: Martin J. D?rst wrote, > Almost by definition, styled text isn't plain text, even if it's > simulated by something else. By an earlier definition, in-line pictures weren't plain text, until people started exchanging them as though they were.? In this case, people are exchanging plain text as plain text. > And the simulation is highly limited, as > the voicing examples and the fact that the math alphanumerics > only cover basic Latin have shown. The voicing examples are software shortcomings which could be overcome.? The software people might seize the opportunity to accommodate their users and vocalize bold *loudly*, italics with /stress/, and fraktur with a Boris Karloff (or Bela Lugosi) voice. That would be up to them.? But the voicing examples aren't really about reading and writing and how they relate to the character encoding.? (Not saying that the voicing examples aren't interesting and relevant to the overall topic.) The fact that the math alphanumerics are incomplete may have been part of what prompted Marcel Schneider to start this thread. If stringing encoded italic Latin letters into words is an abuse of Unicode, then stringing punctuation characters to simulate a "smiley" (?) is an abuse of ASCII - because that's not what those punctuation characters are *for*.? If my brain parses such italic strings into recognizable words, then I guess my brain is non-compliant. From unicode at unicode.org Fri Jan 11 17:54:17 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 11 Jan 2019 23:54:17 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> References: <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> Message-ID: <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> Tex Texin wrote, > ... However, the fact that there is a rich text solution for italics > isn't helpful to plain text users. Truer words were never spoken. > In the '90s it made sense to resist styling plain text. In the 2020's, > with more than 100k characters, numerous pictures and character > adornments, it seems anachronistic to be arguing against a handful > of control characters that would standardize a common text > requirement. Most rendering systems will handle it easily and any > plain text editor or other software that supports a combining > strikethrough character would easily adapt a combining italicize or > a combining bold character. Exactly.? William Overington has already posted a proof-of-concept here: https://forum.high-logic.com/viewtopic.php?f=10&t=7831 ... using a P.U.A. character /in lieu/ of a combining formatting or VS character.? The concept is straightforward and works properly with existing technology. From unicode at unicode.org Sat Jan 12 04:57:26 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Sat, 12 Jan 2019 10:57:26 +0000 (GMT) Subject: A last missing link for interoperable representation References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> Message-ID: On 2019-01-11, James Kass via Unicode wrote: > Exactly.? William Overington has already posted a proof-of-concept here: > https://forum.high-logic.com/viewtopic.php?f=10&t=7831 > ... using a P.U.A. character /in lieu/ of a combining formatting or VS > character.? The concept is straightforward and works properly with > existing technology. It does not work with much existing technology. Interspersing extra codepoints into what is otherwise plain text breaks all the existing software that has not been, and never will be updated to deal with arbitrarily complex algorithms required to do Unicode searching. Somebody who need to search exotic East Asian text will know that they need software that understands VS, but a plain ordinary language user is unlikely to have any idea that VS exist, or that their searches will mysteriously fail if they use this snazzy new pseudo-plain-text italicization technique It's also fundamentally misguided. When I _italicize_ a word, I am writing a word composed of (plain old) letters, and then styling the word; I am not composing a new and different word ("_italicize_") that is distinct from the old word ("italicize") by virtue of being made up of different letters. I think the VS or combining format character approach *would* have been a better way to deal with the mess of mathematical alphabets, because for mathematicians, *b* is a distinct symbol from b, and while there may be correlated use of alphabets, there need be no connection whatever between something notated b and something notated *b*. But for plain text, it's crazy. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Sat Jan 12 06:29:39 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 12 Jan 2019 12:29:39 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> Message-ID: <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com> Julian Bradford wrote, "It does not work with much existing technology. Interspersing extra codepoints into what is otherwise plain text breaks all the existing software that has not been, and never will be updated to deal with arbitrarily complex algorithms required to do Unicode searching. Somebody who need to search exotic East Asian text will know that they need software that understands VS, but a plain ordinary language user is unlikely to have any idea that VS exist, or that their searches will mysteriously fail if they use this snazzy new pseudo-plain-text italicization technique" Sounds like you didn't try it.? VS characters are default ignorable. First one is straight, the second one has VS2 characters interspersed and after the "t": apricot a?p?r?i?c?o?t? Notepad finds them both if you type the word "apricot" into the search box. "..." Regardless of how you input italics in rich-text, you are putting italic forms into the display. "I think the VS or combining format character approach *would* have been a better way to deal with the mess of mathematical alphabets, ..." I think so, too, but since I'm not a member of *that* user community, my opinion hasn't much value.? Plus VS characters were encoded after the math stuff. "But for plain text, it's crazy." Are you a member of the plain-text user community? From unicode at unicode.org Sat Jan 12 07:21:29 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 12 Jan 2019 13:21:29 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com> Message-ID: <91e983a0-a257-6bdb-66af-0622e9a85233@gmail.com> > Julian Bradford wrote, * Bradfield, sorry. From unicode at unicode.org Sat Jan 12 07:22:21 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 12 Jan 2019 13:22:21 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> Message-ID: <20190112132221.7497fdea@JRWUBU2> On Sat, 12 Jan 2019 10:57:26 +0000 (GMT) Julian Bradfield via Unicode wrote: > It's also fundamentally misguided. When I _italicize_ a word, I am > writing a word composed of (plain old) letters, and then styling the > word; I am not composing a new and different word ("_italicize_") that > is distinct from the old word ("italicize") by virtue of being made up > of different letters. And what happens when you capitalise a word for emphasis or to begin a sentence? Is it no longer the same word? > I think the VS or combining format character approach *would* have > been a better way to deal with the mess of mathematical alphabets, > because for mathematicians, *b* is a distinct symbol from b, and while > there may be correlated use of alphabets, there need be no connection > whatever between something notated b and something notated *b*. Perhaps the influence of school has lingered too well, but I would be very uncomfortable with such a lack of connection. The idea that *b* is a vector and _b_ is its magnitude has stuck well. Italicisation on the other hand, is a confirmation that something is a symbol, and naturally disappears in handwriting. Richard. From unicode at unicode.org Sat Jan 12 08:21:19 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 12 Jan 2019 14:21:19 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <20190112132221.7497fdea@JRWUBU2> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> Message-ID: <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> Reading & writing & 'rithmatick... This is a math formula: a + b = b + a ... where the estimable "mathematician" used Latin letters from ASCII as though they were math alphanumerics variables. This is an italicized word: ???????????????????????? ... where the "geek" hacker used Latin italics letters from the math alphanumeric range as though they were Latin italics letters. Where's the harm? FWIW, the math formula: a + b # ?? + ?? ... becomes invalid if normalized NFKD/NFKC.? (Or if copy/pasted from an HTML page using marked-up ASCII into a plain-text editor.) Yet the italicized word "kakistocracy" is still legible if normalized.? If copy/pasted from an HTML page using the math alphanumeric characters, it survives intact.? If copy/pasted from markupped ASCII, it's still legible. From unicode at unicode.org Sat Jan 12 10:21:43 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 12 Jan 2019 16:21:43 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> References: <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> Message-ID: <20190112162143.06d36c69@JRWUBU2> On Sat, 12 Jan 2019 14:21:19 +0000 James Kass via Unicode wrote: > FWIW, the math formula: > a + b # ?? + ?? > ... becomes invalid if normalized NFKD/NFKC.? (Or if copy/pasted from > an HTML page using marked-up ASCII into a plain-text editor.) (a) Italic versus plain is not significant in the mathematics I've encountered. It's worse than distinguishing capital em and capital mu, which is allowed if you're the head of department. (b) a + b # b + a is a general, but not universally true, statement for ordinal numbers, the simplest example being ? = 1 + ? ? ? + 1 (c) You're talking about a folding, not a normalisation. The example you want would use emboldening, e.g. "In general, ?? + ?? ??? ?? + ??" which is true for vectors ???? and ?? if one is treating the quaternions as a direct sum of reals and real 3-vectors. Richard. From unicode at unicode.org Sat Jan 12 10:26:59 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Sat, 12 Jan 2019 16:26:59 +0000 (GMT) Subject: A last missing link for interoperable representation In-Reply-To: <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> Message-ID: <7f64c5a1.9721.16842e34f54.Webtop.70@btinternet.com> James Kass wrote: > For the V.S. option there should be a provision for consistency and > open-endedness to keep it simple. Start with VS14 and work backwards > for italic, ? I have now made, tested and published a font, VS14 Maquette, that uses VS14 to indicate italic. https://forum.high-logic.com/viewtopic.php?f=10&t=7831&p=37561#p37561 William Overington Saturday 12 January 2019 ------ Original Message ------ From: "James Kass via Unicode" To: unicode at unicode.org Sent: Friday, 2019 Jan 11 At 01:48 Subject: Re: A last missing link for interoperable representation Richard Wordingham responded, >> ... simply using an existing variation >> selector character to do the job. > > Actually, this might be a superior option. For the V.S. option there should be a provision for consistency and open-endedness to keep it simple.? Start with VS14 and work backwards for italic, fraktur, antiqua...? (whatever the preferred order works out to be).? Or (better yet) start at VS17 and move forward (and change the rule that seventeen and up is only for CJK). Is it true that many of the CJK variants now covered were previously considered by the Consortium to be merely stylistic variants? From unicode at unicode.org Sat Jan 12 12:50:00 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 12 Jan 2019 10:50:00 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <20190112132221.7497fdea@JRWUBU2> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> Message-ID: <93c4be0a-a591-627e-d7b5-58142859dca9@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 12 13:16:17 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 12 Jan 2019 20:16:17 +0100 Subject: A last missing link for interoperable representation In-Reply-To: References: <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> Message-ID: <7670985e-2e49-5b5e-848c-24e00cff7ebd@orange.fr> On 12/01/2019 00:17, James Kass via Unicode wrote: [?] > The fact that the math alphanumerics are incomplete may have been > part of what prompted Marcel Schneider to start this thread. No, really not at all. I didn?t even dream of having italics in Unicode working out of the box. That would exactly be the sort of demand that would have completely discredited me advocating the use of preformatted superscripts for the Unicode conformant and interoperable representation of a handful of languages spoken by one third of mankind and using the Latin script, while no other scripts are concerned with that orthographic feature. (No clear borderline between orthography and typography here, but with ordinal indicators in particular and abbreviation indicators in general we?re clearly on the orthographic side. (SC2/WG3 would agree, since they deemed "?" and "?" worth encoding in 8-bit charsets.) It started when I found in the XKB keysymdef.h four dead keysyms added for Karl Pentzlin?s German T3, among which dead_lowline, and remembered that at some point in history, users were deprived of the means of typing a combining underscore. I didn?t think at the extra letterspacing (called ?gesperrt? spaced out in German) that Mark E. Shoulson mentioned upthread, (a) because it isn?t used for that purpose in the locale I?m working for, and (b) because emulating it with interspersed NARROW NO-BREAK SPACEs would make that text unsearchable. > > If stringing encoded italic Latin letters into words is an abuse of > Unicode, then stringing punctuation characters to simulate a "smiley" > (?) is an abuse of ASCII - because that's not what those punctuation > characters are *for*. If my brain parses such italic strings into > recognizable words, then I guess my brain is non-compliant. I think that like Google Search having extensive equivalence classes treating mathematical letters like plain ASCII, text-to-speech software could use a little bit of AI to recognize strings of those letters as ordinary words with emphasis, like James Kass suggested ? the more as we?re actually able to add combining diacritics for correct spelling in some diacriticized alphabets (including a few with non-decomposable diacritics), though with somewhat less-than-optimal diacritic placement in many cases in the actual state of the art ? and also parse ASCII art correspondingly, unlike what happened in another example shared on Twitter downthread of the math letters tweet: https://twitter.com/ourelectra/status/1083367552430989315 Thanks, Marcel From unicode at unicode.org Sat Jan 12 19:22:08 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 13 Jan 2019 01:22:08 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <93c4be0a-a591-627e-d7b5-58142859dca9@ix.netcom.com> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <93c4be0a-a591-627e-d7b5-58142859dca9@ix.netcom.com> Message-ID: Asmus Freytag wrote, > ...What this teaches you is that italicizing (or boldfacing) > text is fundamentally related to picking out parts of your > text in a different font. Typically from the same typeface, though. > So those screen readers got it right, except that they could > have used one of the more typical notational conventions that > the mathalphabetics are used to express (e.g. "vector" etc.), > rather than rattling off the Unicode name. WRT text-to-voice applications, such as "VoiceOver", I wonder how well they would do when encountering /any/ exotic text runs or characters.? Like Yi, or Vai, or even an isolated CJK ideograph in otherwise Latin text.? For example:? "The Han radical # 72, which looks like '?', means 'sun'."? Would the application "say" the character as a Japanese reader would expect to hear it?? Or in one of the Chinese dialects?? Or would the application just give the hex code point? In an era where most of the states in my country no longer teach cursive writing in public schools, it seems unlikely that Twitter users (and so forth) will be clamoring for the ability to implement Chicago Style text properly on their cell phone screens.? (Many users would probably prefer to use the cell phone to order a Chicago style pizza.)? But, stranger things have happened. From unicode at unicode.org Sat Jan 12 20:15:35 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sat, 12 Jan 2019 21:15:35 -0500 Subject: A last missing link for interoperable representation In-Reply-To: <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> References: <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> Message-ID: <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org> Just to add some more fuel for this fire, I note also the highly popular (in some places) technique of using Unicode letters that may have nothing whatsoever to do with the symbol or letter you mean to represent, apart from coincidental resemblance and looking "cool" enough.? This happens a lot on Second Life, where you can set your "display name" distinct from your "user name", but the display name appears to be limited to Unicode *letters* and some punctuation, mostly, and certainly can't be outside the BMP.? So for a sampling from stuff I've heard of... ?bi??? S???lS?ul ?P??D ???????? ?? ?? ?ud? ??itm?? ????? B??D???? ?L????? ????? Fashionablez ????? ?ha?g ???? M??????? ??????? ? . ?u ?u ????? ???? ????? ??n ???o ?'M ??????? ??????? ?????MM??? ?????? ????? ??????u?? ??????? ???? ?r?? ?????? ? ?? ? ? ? ? :. ??Z?R? ????? ?J?????? ?cH ???????? ????? ? Amy ? ????? G?????L? ?????t Wu?????? ?h??h? ??c??????? ???? Jarah Sparks???? ?? fleur ?? ????? ?????? ???- Pandora Barbaros???- ? ??????? ?-x- ????? ??u??? ?l?? ???l?? ???? ????? ?? Gatatem ????? ??? I could do more searching... Some of these things are even more common than shown here.? Using ? for a heart ? is extremely widespread, and decorations like ? and ? abound.? Note some decorations involving ? with some Arabic(!) combining characters. Note the use of Hebrew and Arabic and CJK and other characters to represent Latin letters to which they bear only a passing resemblance.? There are also a lot of names in all small-caps or all full-width (I didn't include any examples of just that because they seemed so ordinary), or "inverted"? ?uo???s?? ??nsn ??? u?? I don't know what, precisely, this argues for or against.? Would people deny that this is an "abuse" of the character-set, even though people are doing it and it works for them?? The medium is pretty indisputably plain-text.? Should all this kind of thing be somehow made to "work" for these creative, if mystifying, people? These are clearly pretty far-out examples (though not extreme, compared to what's out there, nor uncommon, from what I have been told.) This discussion has been very interesting, really.? I've heard what I thought were very good points and relevant arguments from both/all sides, and I confess to not being sure which I actually prefer.? Just giving you more to think about... ~mark From unicode at unicode.org Sat Jan 12 20:17:34 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 13 Jan 2019 02:17:34 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <7f64c5a1.9721.16842e34f54.Webtop.70@btinternet.com> References: <3b885166-58c3-f970-829e-e6d521259670@gmail.com> <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com> <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr> <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <7f64c5a1.9721.16842e34f54.Webtop.70@btinternet.com> Message-ID: On 2019-01-12 4:26 PM, wjgo_10009 at btinternet.com wrote: > I have now made, tested and published a font, VS14 Maquette, that uses > VS14 to indicate italic. > > https://forum.high-logic.com/viewtopic.php?f=10&t=7831&p=37561#p37561 > The italics don't happen in Notepad, but VS14 Maquette works spendidly in LibreOffice!? (Windows 7)? (In a *.TXT file) Since the VS characters are supposed to be used with officially registered/recognized sequences, it's possible that Notepad isn't trying to implement the feature. The official reception of the notion of using variant letter forms, such as italics, in plain-text is typically frosty.? So advancement of plain-text might be left up to third-party developers, enthusiasts, and the actual text users.? And there's nothing wrong with that.? (It's non-conformant, though, unless the VS material is officially recognized/registered.) Non-Latin scripts, such as Khmer, may have their own traditions and conventions WRT special letter forms.? Which is why starting at VS14 and working backwards might be inadequate in the long run. Khmer has letter forms called muul/moul/muol (not sure how to spell that one, but neither is anybody else).? It superficially resembles fraktur for Khmer.? Other non-Latin scripts may have a plethora of such forms/fonts/styles. From unicode at unicode.org Sat Jan 12 22:24:29 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 13 Jan 2019 04:24:29 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org> Message-ID: <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com> Mark E. Shoulson wrote, > This discussion has been very interesting, really.? I've heard what I > thought were very good points and relevant arguments from both/all > sides, and I confess to not being sure which I actually prefer. It's subjective, really.? It depends on how one views plain-text and one's expectations for its future.? Should plain-text be progressive, regressive, or stagnant?? Because those are really the only choices.? And opinions differ. Most of us involved with Unicode probably expect plain-text to be around for quite a while.? The figure bandied about in the past on this list is "a thousand years".? Only a society of mindless drones would cling to the past for a millennium.? So, many of us probably figure that strictures laid down now will be overridden as a matter of course, over time. Unicode will probably be around for awhile, but the barrier between plain- and rich-text has already morphed significantly in the relatively short period of time it's been around. I became attracted to Unicode about twenty years ago.? Because Unicode opened up entire /realms/ of new vistas relating to what could be done with computer plain text.? I hope this trend continues. From unicode at unicode.org Sun Jan 13 02:20:36 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Sun, 13 Jan 2019 08:20:36 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org> <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com> Message-ID: <3edf3287-1fe9-6444-bd12-42fbc0232094@it.aoyama.ac.jp> On 2019/01/13 13:24, James Kass via Unicode wrote: > > Mark E. Shoulson wrote, > > > This discussion has been very interesting, really.? I've heard what I > > thought were very good points and relevant arguments from both/all > > sides, and I confess to not being sure which I actually prefer. > > It's subjective, really.? It depends on how one views plain-text and > one's expectations for its future.? Should plain-text be progressive, > regressive, or stagnant?? Because those are really the only choices. And > opinions differ. I'd say it should be conservative. As the meaning of that word (similar to others such as progressive and regressive) may be interpreted in various way, here's what I mean by that. It should not take up and extend every little fad at the blink of an eye. It should wait to see what the real needs are, and what may be just a temporary fad. As the Mathematical style variants show, once characters are encoded, it's difficult to get people off using them, even in ways not intended. Emoji have often been often cited in this thread. But there are some important observations: 1) Emoji were added to Unicode only after it turned out that they were widely used in Japanese character encodings, and dripping into Unicode-based systems in large numbers but without any clearly assigned code points. The Unicode Consortium didn't start encoding them because they thought emoji were cute or progressive or anything like that. 2) The Unicode Consortium is continuing to hold down the number of newly encoded emoji by using an approximate limit for each year and a strict process. 3) The Unicode Consortium is somewhat motivated to encode new emoji because of the publicity surrounding them. That publicity might subside sooner or later. It's difficult to imagine the same kind of publicity for italics and friends. > Most of us involved with Unicode probably expect plain-text to be around > for quite a while.? The figure bandied about in the past on this list is > "a thousand years".? Only a society of mindless drones would cling to > the past for a millennium.? So, many of us probably figure that > strictures laid down now will be overridden as a matter of course, over > time. > > Unicode will probably be around for awhile, but the barrier between > plain- and rich-text has already morphed significantly in the relatively > short period of time it's been around. Because whatever is encoded can't be "unencoded", it's clear that we can only move in one direction, and not back. But because we want Unicode to work for a long, long time, it's very important to be conservative. > I became attracted to Unicode about twenty years ago.? Because Unicode > opened up entire /realms/ of new vistas relating to what could be done > with computer plain text.? I hope this trend continues. I hope this trend only continues very slowly, if at all. Regards, Martin. From unicode at unicode.org Sun Jan 13 02:22:37 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Sun, 13 Jan 2019 08:22:37 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <93c4be0a-a591-627e-d7b5-58142859dca9@ix.netcom.com> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <93c4be0a-a591-627e-d7b5-58142859dca9@ix.netcom.com> Message-ID: <85c6455c-aae0-510e-63ed-69358ce96a2f@it.aoyama.ac.jp> On 2019/01/13 03:50, Asmus Freytag via Unicode wrote: > To reiterate, if you effectively require a span (even if you could simulate that > differently) you are in the realm or rich text. The one big exception to that is > bidi, because it is utterly impossible to do bidi text without text ranges. > Therefore, Unicode plain text explicitly violates that principle in favor of > achieving a fundamental goal of universality, that is being able to include the > bidi languages. Yes, and in HTML, where higher-level (span-based) mechanisms are available, it is preferred to use these rather than the bidi control characters. Regards, Martin. From unicode at unicode.org Sun Jan 13 10:44:58 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Sun, 13 Jan 2019 16:44:58 +0000 (GMT) Subject: A last missing link for interoperable representation References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com> Message-ID: On 2019-01-12, James Kass via Unicode wrote: > Sounds like you didn't try it.? VS characters are default ignorable. By software that has a full understanding of Unicode. There is a very large world out there of software that was written before Unicode was dreamed of, let alone popular. > apricot > a?p?r?i?c?o?t? > Notepad finds them both if you type the word "apricot" into the search box. What has Notepad to do with me? > "But for plain text, it's crazy." > > Are you a member of the plain-text user community? Certainly:) -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Sun Jan 13 10:46:42 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Sun, 13 Jan 2019 16:46:42 +0000 (GMT) Subject: A last missing link for interoperable representation References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> Message-ID: On 2019-01-12, Richard Wordingham via Unicode wrote: > On Sat, 12 Jan 2019 10:57:26 +0000 (GMT) > Julian Bradfield via Unicode wrote: > >> It's also fundamentally misguided. When I _italicize_ a word, I am >> writing a word composed of (plain old) letters, and then styling the >> word; I am not composing a new and different word ("_italicize_") that >> is distinct from the old word ("italicize") by virtue of being made up >> of different letters. > > And what happens when you capitalise a word for emphasis or to begin a > sentence? Is it no longer the same word? Indeed. As has been observed up-thread, the casing idea is a dumb one! We are, however, stuck with it because of legacy encoding transported into Unicode. We aren't stuck with encoding fonts into Unicode. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Sun Jan 13 10:52:25 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Sun, 13 Jan 2019 16:52:25 +0000 (GMT) Subject: A last missing link for interoperable representation References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> Message-ID: On 2019-01-12, James Kass via Unicode wrote: > This is a math formula: > a + b = b + a > ... where the estimable "mathematician" used Latin letters from ASCII as > though they were math alphanumerics variables. Yup, and it's immediately understandable by anyone reading on any computer that understands ASCII. That's why mathematicians write like that in plain text. > This is an italicized word: > ???????????????????????? > ... where the "geek" hacker used Latin italics letters from the math > alphanumeric range as though they were Latin italics letters. It's a sequence of question marks unless you have an up to date Unicode font set up (which, as it happens, I don't for the terminal in which I read this mailing list). Since actual mathematicians don't use the Unicode math alphabets, there's no strong incentive to get updated fonts. > Where's the harm? You lose your audience for no reasons other than technogeekery. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Sun Jan 13 14:38:45 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 13 Jan 2019 21:38:45 +0100 Subject: A last missing link for interoperable representation In-Reply-To: References: <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> Message-ID: On 13/01/2019 17:52, Julian Bradfield via Unicode wrote: > On 2019-01-12, James Kass via Unicode wrote: >> This is a math formula: >> a + b = b + a >> ... where the estimable "mathematician" used Latin letters from ASCII as >> though they were math alphanumerics variables. > > Yup, and it's immediately understandable by anyone reading on any > computer that understands ASCII. That's why mathematicians write like > that in plain text. As far as the information goes that was running until now on this List, Mathematicians are both using TeX and liking the Unicode math alphabets. > >> This is an italicized word: >> ???????????????????????? >> ... where the "geek" hacker used Latin italics letters from the math >> alphanumeric range as though they were Latin italics letters. > > It's a sequence of question marks unless you have an up to date > Unicode font set up (which, as it happens, I don't for the terminal in > which I read this mailing list). Since actual mathematicians don't use > the Unicode math alphabets, there's no strong incentive to get updated > fonts. These statements make me fear that the font you are using might unsupport the NARROW NO-BREAK SPACE U+202F >?<. If you see a question mark between these pointy brackets, please let us know. Because then, You?re unable to read interoperably usable French text, too, as you?ll see double punctuation (eg "?!") where a single mark is intended, like here?! There is a crazy typeface out there, misleadingly called 'Courier New', as if the foundry didn?t anticipate that at some point it would be better called "Courier Obsolete". Or they did, but? (Referring to CLDR ticket #11423.) BTW if anybody knows a version of Courier New updated to a decent level of Unicode support, please be so kind and share the link so I can spread the word. > >> Where's the harm? > > You lose your audience for no reasons other than technogeekery. Aiming at extending the subset of environments supporting correct typesetting is no geekery but awareness of our cultural heritage that we?re committed to maintain and to develop, taking it over into the digital world while adapting technology to culture, not conversely. Best regards, Marcel From unicode at unicode.org Sun Jan 13 15:43:11 2019 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Sun, 13 Jan 2019 23:43:11 +0200 Subject: A last missing link for interoperable representation In-Reply-To: References: <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> Message-ID: <20190113214311.GA1281@macbook.localdomain> On Sun, Jan 13, 2019 at 04:52:25PM +0000, Julian Bradfield via Unicode wrote: > On 2019-01-12, James Kass via Unicode wrote: > > This is an italicized word: > > ???????????????????????? > > ... where the "geek" hacker used Latin italics letters from the math > > alphanumeric range as though they were Latin italics letters. > > It's a sequence of question marks unless you have an up to date > Unicode font set up (which, as it happens, I don't for the terminal in > which I read this mailing list). Since actual mathematicians don't use > the Unicode math alphabets, there's no strong incentive to get updated > fonts. They do, but not necessarily by directly inputting them. LaTeX with the ?unicode-math? package will translate ASCII + font switches to the respective Unicode math alphanumeric characters. Word will do the same. Even browsers rendering MathML will do the same (though most likely the MathML source will have the math alphanumeric characters already). Regards, Khaled From unicode at unicode.org Sun Jan 13 17:36:24 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 13 Jan 2019 23:36:24 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com> Message-ID: Julian Bradfield replied, >> Sounds like you didn't try it.? VS characters are default ignorable. > > By software that has a full understanding of Unicode. There is a very > large world out there of software that was written before Unicode was > dreamed of, let alone popular. ??? ?? ???? ????? ??? ?? ??? ?? ??? ???, ?? ????? ????? (*) ?????? What happens with Devanagari text?? Should the user community refrain from interchanging data because 1980s era software isn't Unicode aware? From unicode at unicode.org Sun Jan 13 21:00:36 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Mon, 14 Jan 2019 03:00:36 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> Message-ID: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp> On 2019/01/14 01:46, Julian Bradfield via Unicode wrote: > On 2019-01-12, Richard Wordingham via Unicode wrote: >> On Sat, 12 Jan 2019 10:57:26 +0000 (GMT) >> And what happens when you capitalise a word for emphasis or to begin a >> sentence? Is it no longer the same word? > > Indeed. As has been observed up-thread, the casing idea is a dumb one! > We are, however, stuck with it because of legacy encoding transported > into Unicode. We aren't stuck with encoding fonts into Unicode. No, the casing idea isn't actually a dumb one. As Asmus has shown, one of the best ways to understand what Unicode does with respect to text variants is that style works on spans of characters (words,...), and is rich text, but thinks that work on single characters are handled in plain text. Upper-case is definitely for most part a single-character phenomenon (the recent Georgian MTAVRULI additions being the exception). UPPER CASE can be used on whole spans of text, but that's not the main use case. And if UPPER CASE is used for emphasis, one way to do it (and the best way if this is actually a styling issue) is to use rich text and mark it up according to semantics, and then use some styling directive (e.g. CSS text-transform: uppercase) to get the desired look. Another criterion is orthography. Schoolchildren learn when to capitalize a word and when not. Teachers check and correct it all the time. Grammar books and books for second language learners discuss capitalization, because it's part of orthography, the rules differ by language, and not getting it right will make the writer look bad. But even most adults won't know the rules for what to italicize that have been brought up in this thread. Even if they have read books that use italic and bold in ways that have been brought up in this thread, most readers won't be able to tell you what the rules are. That's left to copy editors and similar specialist jobs. There was a time when computers (and printers in particular) were single-case. There was some discussion about having to abolish case distinctions to adapt to computers, but fortunately, that wasn't necessary. Regards, Martin. From unicode at unicode.org Sun Jan 13 22:31:35 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Sun, 13 Jan 2019 20:31:35 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org> <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com> Message-ID: On Sat, Jan 12, 2019 at 8:26 PM James Kass via Unicode wrote: > It's subjective, really. It depends on how one views plain-text and > one's expectations for its future. Should plain-text be progressive, > regressive, or stagnant? Because those are really the only choices. > And opinions differ. > > Most of us involved with Unicode probably expect plain-text to be around > for quite a while. The figure bandied about in the past on this list is > "a thousand years". Only a society of mindless drones would cling to > the past for a millennium. So, many of us probably figure that > strictures laid down now will be overridden as a matter of course, over > time. And yet you write this in the Latin script that's been around for a couple millennia. Arabic, Han ideographs, Cyrillic and Devanagari have all been around a millennia. Looking back at the history of computing, a large chunk of the underlying technology has hit stability. ARM chips, x86 chips, Unix, and Windows have all been around since 1985 or before, roughly 35 years ago and 35 years since the first programmed computer. They aren't wildly changing. Unicode is moving towards that position; it does a job and doesn't need disrupt changes to continue to be relevant. > Unicode will probably be around for awhile, but the barrier between > plain- and rich-text has already morphed significantly in the relatively > short period of time it's been around. Fixed pictures have been parts of character sets for decades and were part of Unicode 1.1. U+2704, WHITE SCISSORS, for example. And emoji aren't disruptive in the way that moving something that's been a part of the rich-text layer forever into the plain-text layer. > I became attracted to Unicode about twenty years ago. Because Unicode > opened up entire /realms/ of new vistas relating to what could be done > with computer plain text. I hope this trend continues. The right tool for the job. If you need rich text, you should use rich text. Emoji had to make the case that they were being used as characters and there were no competing tools to handle them. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Sun Jan 13 23:06:04 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Sun, 13 Jan 2019 21:06:04 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp> Message-ID: On Sun, Jan 13, 2019 at 7:03 PM Martin J. D?rst via Unicode wrote: > No, the casing idea isn't actually a dumb one. As Asmus has shown, one > of the best ways to understand what Unicode does with respect to text > variants is that style works on spans of characters (words,...), and is > rich text, but thinks that work on single characters are handled in > plain text. Upper-case is definitely for most part a single-character > phenomenon (the recent Georgian MTAVRULI additions being the exception). I would disagree; upper case is normally used in all caps or title-case, and the latter is used on a word, not a character. I don't argue that Unicode is wrong for handling casing the way it does, but it does massively complicate the processing of any Latin text; virtually all searches should be case-insensitive, for example. At least in English, computerized casing will always be problematic. > UPPER CASE can be used on whole spans of text, but that's not the main > use case. And if UPPER CASE is used for emphasis, one way to do it (and > the best way if this is actually a styling issue) is to use rich text > and mark it up according to semantics, and then use some styling > directive (e.g. CSS text-transform: uppercase) to get the desired look. That's an example of how having multiple systems makes things more complex and less consistent. If something can be written as all upper case with the caps lock key, it will be. If a generated HTML file can have uppercase added with a Python or SQL function, it probably will be. Using CSS text-transform may be best practice, but simpler plain text solutions will be used in a lot of cases and nothing can be extrapolated clearly from its use or lack of use. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Sun Jan 13 23:08:37 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 14 Jan 2019 05:08:37 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> Message-ID: <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> Marcel Schneider wrote, > There is a crazy typeface out there, misleadingly called 'Courier New', > as if the foundry didn?t anticipate that at some point it would be better > called "Courier Obsolete". ... ?????? ?????????????? seems a bit ????????? nowadays, as well. (Had to use mark-up for that ?span? of a single letter in order to indicate the proper letter form.? But the plain-text display looks crazy with that HTML jive in it.) From unicode at unicode.org Mon Jan 14 00:02:21 2019 From: unicode at unicode.org (Tex via Unicode) Date: Sun, 13 Jan 2019 22:02:21 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <9a1a2642-4123-45ce-ebb0-c1aa4461c 266@it.aoyama.ac.jp> Message-ID: <000901d4abce$b781b250$268516f0$@xencraft.com> > But even most adults won't know the rules for what to italicize that > have been brought up in this thread. Even if they have read books that > use italic and bold in ways that have been brought up in this thread, > most readers won't be able to tell you what the rules are. That's left > to copy editors and similar specialist jobs. Most adults don't know the right places to soft-hyphenate a word, and yet we support that in plain-text. They also don't know the differences between the various dashes and spaces and when to use each. Literacy isn't an appropriate criteria. Even the apostrophe fails that test since so many people fail to distinguish its from it's and there from they're. :-) > There was a time when computers (and printers in particular) were > single-case. There was some discussion about having to abolish case > distinctions to adapt to computers, but fortunately, that wasn't necessary. Ironic to mention the example of the failure of technology to support linguistic requirements driving a proposal to limit the attributes of language. As you say it was fortunate it wasn't necessary then... It makes the case for the importance of improving technology to support fundamental language attributes. tex From unicode at unicode.org Mon Jan 14 00:19:29 2019 From: unicode at unicode.org (Tex via Unicode) Date: Sun, 13 Jan 2019 22:19:29 -0800 Subject: A last missing link for interoperable representation In-Reply-To: References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org> <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com> Message-ID: <001001d4abd1$1becb800$53c62800$@xencraft.com> "Looking back at the history of computing, a large chunk of the underlying technology has hit stability. ARM chips, x86 chips, Unix, and Windows have all been around since 1985 or before, roughly 35 years ago and 35 years since the first programmed computer. They aren't wildly changing." I would encourage you to return to a system of 35 years ago, if you believe they are the same. Performance, pipeline, memory access, device support, graphical capabilities, underlying instructions, security features... One could argue the wheel is medieval and still works today, but the wheels I drive on are designed for a variety of weather conditions, traction, minimal noise generation, light weight with durability and high performance, and are particular to the front or back axle. And I know from experience the wrong wheels can spin me around and ram me into a median... tex From unicode at unicode.org Mon Jan 14 00:24:46 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 14 Jan 2019 06:24:46 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <3edf3287-1fe9-6444-bd12-42fbc0232094@it.aoyama.ac.jp> References: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net> <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org> <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com> <3edf3287-1fe9-6444-bd12-42fbc0232094@it.aoyama.ac.jp> Message-ID: <34796c35-a574-438d-e842-83fd87a17e6d@gmail.com> Martin J. D?rst wrote, > I'd say it should be conservative. As the meaning of that word > (similar to others such as progressive and regressive) may be > interpreted in various way, here's what I mean by that. > > It should not take up and extend every little fad at the blink of an > eye. It should wait to see what the real needs are, and what may be > just a temporary fad. As the Mathematical style variants show, once > characters are encoded, it's difficult to get people off using them, > even in ways not intended. A conservative approach to progress is a sensible position for computer character encoders.? Taking a conservative approach doesn't necessarily mean being anti-progress. Trying to "get people off" using already encoded characters, whether or not the encoded characters are used as intended, might give an impression of being anti-progress. Unicode doesn't enforce any spelling or punctuation rules.? Unicode doesn't tell human beings how to pronounce strings of text or how to interpret them.? Unicode doesn't push any rules about splitting infinitives or conjugating verbs. Unicode should not tell people how any written symbol must be interpreted.? Unicode should not tell people how or where to deploy their own written symbols. Perhaps fraktur is frivolous in English text.? Perhaps its use would result in a new convention for written English which would enhance the literary experience.? Italics conventions which have only been around a hundred years or so may well turn out to be just a passing fad, so we should probably give it a bit more time. Telling people they mustn't use Latin italics letter forms in computer text while we wait to see if the practice catches on seems flawed in concept. From unicode at unicode.org Mon Jan 14 01:26:36 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Mon, 14 Jan 2019 07:26:36 +0000 (GMT) Subject: A last missing link for interoperable representation References: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> Message-ID: On 2019-01-13, Marcel Schneider via Unicode wrote: > As far as the information goes that was running until now on this List, > Mathematicians are both using TeX and liking the Unicode math alphabets. As Khaled has said, if they use them, it's because some software designer has decided to use them to implement markup. I have never seen a Unicode math alphabet character in email outside this list. > These statements make me fear that the font you are using might unsupport > the NARROW NO-BREAK SPACE U+202F >?<. If you see a question mark between It displays as a space. As one would expect - I use fixed width fonts for plain text. > these pointy brackets, please let us know. Because then, You?re unable to > read interoperably usable French text, too, as you?ll see double punctuation > (eg "?!") where a single mark is intended, like here?! I see "like here !". French text does not need narrow spacing any more than science does. When doing typography, fifty centimetres is $50\thinspace\mathrm{cm}$; in plain text, 50cm does just fine. Likewise, normal French people writing email write "Quel idiot!", or sometimes "Quel idiot !". If you google that phrase on a few French websites, you'll see that some (such as Larousse, whom one might expect to care about such things) use no space before punctuation, while others (such as some random T-shirt company) use an ASCII space. The Acad?mie Fran?aise, which by definition knows more about French orthography than you do, uses full ASCII spaces before ? and ! on its front page. Also after opening guillemets, which looks even more stupid from an Anglophone perspective. > Aiming at extending the subset of environments supporting correct typesetting There are many fine programs, including TeX, for doing good typesetting. Unicode is not about typesetting, it's about information exchange and preservation. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Mon Jan 14 01:28:18 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Mon, 14 Jan 2019 07:28:18 +0000 (GMT) Subject: A last missing link for interoperable representation References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> Message-ID: On 2019-01-14, James Kass via Unicode wrote: > ?????? ?????????????? seems a bit ????????? nowadays, as well. > > (Had to use mark-up for that ?span? of a single letter in order to > indicate the proper letter form.? But the plain-text display looks crazy > with that HTML jive in it.) Indeed. But _Art nouveau_ seems a bit _pass?_ nowadays looks fine and is understood even by those who have never annotated a manuscript with proof corrections. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Mon Jan 14 01:47:45 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Mon, 14 Jan 2019 07:47:45 +0000 (GMT) Subject: A last missing link for interoperable representation References: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com> Message-ID: On 2019-01-13, James Kass via Unicode wrote: > ??? ?? ???? ????? ??? ?? ??? ?? ??? ???, ?? ????? ????? (*) ?????? > What happens with Devanagari text?? Should the user community refrain > from interchanging data because 1980s era software isn't Unicode aware? Devanagari is an established writing system (which also doesn't need separate letters for different typefaces). Those who wish to exchange information in devanagari will use either an ISCII or Unicode system with suitable font support. Just as those who wish to exchange English text with typographic detail will use a suitable typographic mark-up system with font support, which will typically not interfere with plain text searching. Even in a PDF document, "art nouveau" will appear as "art nouveau" whatever font it's in. Incidentally, a large chunk of my facebook feed is Indian politics, and of that portion of it that is in Hindi or other Indian languages, most is still written in ASCII transcription, even though every web browser and social media application in common use surely has full Unicode support these days. Sometimes using your own writing system is just too much effort! -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Mon Jan 14 01:56:43 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 14 Jan 2019 07:56:43 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> Message-ID: <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> Julian Bradfield wrote, > I have never seen a Unicode math alphabet character in email > outside this list. It's being done though.? Check this message from 2013 which includes the following, copy/pasted from the web page into Notepad: ???????? ???? ????????.??????????????????? ? ???????? ???????? ????????? ????????????.??????/???????????????????? https://apple.stackexchange.com/questions/104159/what-are-these-characters-and-how-can-i-use-them From unicode at unicode.org Mon Jan 14 02:40:58 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Mon, 14 Jan 2019 08:40:58 +0000 (GMT) Subject: A last missing link for interoperable representation References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> Message-ID: On 2019-01-14, James Kass via Unicode wrote: > Julian Bradfield wrote, > > I have never seen a Unicode math alphabet character in email > > outside this list. > > It's being done though.? Check this message from 2013 which includes the > following, copy/pasted from the web page into Notepad: > > ???????? ???? ????????.??????????????????? ? ???????? ???????? ????????? > ????????????.??????/???????????????????? > > https://apple.stackexchange.com/questions/104159/what-are-these-characters-and-how-can-i-use-them Which makes the point very nicely. They're not being *used* to do maths, they're being played with for purely decorative purposes, and moreover in a way which breaks the actual intended use as a URL. If you introduce random stuff into Unicode, people will play with it (or use it for phishing). The whole thread is, as it says, "what is this weird stuff"? -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Mon Jan 14 02:48:05 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 14 Jan 2019 08:48:05 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com> Message-ID: <20190114084805.596db197@JRWUBU2> On Mon, 14 Jan 2019 07:47:45 +0000 (GMT) Julian Bradfield via Unicode wrote: > On 2019-01-13, James Kass via Unicode wrote: > > ??? ?? ???? ????? ??? ?? ??? ?? ??? ???, ?? ????? ????? (*) ?????? > > > What happens with Devanagari text?? Should the user community > > refrain from interchanging data because 1980s era software isn't > > Unicode aware? > > Devanagari is an established writing system (which also doesn't need > separate letters for different typefaces). Those who wish to exchange > information in devanagari will use either an ISCII or Unicode system > with suitable font support. Has ISCII kept abreast of additions to the encoded Devanagari script? Hindi may be an established writing system, but Vedic Sanskrit with a full details is another matter. Even with full Unicode support, having a 'suitable font' is an issue with 'plain text', even deprecated plain text. The problems are that writers of Hindi don't want to have to manually suppress ligature formation, and it doesn't help that tables of Hidi conjuncts don't express the difference between real and fake viramas. (The difference surfaces with preposed vowels.) > Just as those who wish to exchange English text with typographic > detail will use a suitable typographic mark-up system with font > support, which will typically not interfere with plain text searching. > Even in a PDF document, "art nouveau" will appear as "art nouveau" > whatever font it's in. But "art nouveau" is ASCII. Copying truly complex Indic from a PDF is still something of an adventure. > Incidentally, a large chunk of my facebook feed is Indian politics, > and of that portion of it that is in Hindi or other Indian > languages, most is still written in ASCII transcription, even though > every web browser and social media application in common use surely > has full Unicode support these days. I don't believe the USE has been added to IE 11, and certainly not on Windows 7. And I fear that of OpenType fonts, only mine widely support Tai Tham as documented on the Unicode site. (And 'widely' excludes IE 11, but not MS Edge.) A fair few Tai Tham fonts rely on being permitted to bypass the script-specific support, which the Windows stack only permits to privileged scripts. Richard. From unicode at unicode.org Mon Jan 14 03:06:47 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 14 Jan 2019 09:06:47 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> Message-ID: <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> Not a twitter user, don't know how popular the practice is, but here's a couple of links concerned with how to use bold or italics in Twitter plain text messages. https://www.simplehelp.net/2018/03/13/how-to-use-bold-and-italicized-text-on-twitter/ https://mothereff.in/twitalics Both pages include a form of caveat.? But the caveat isn't about the intended use of the math alphanumerics. The first page includes the following text as part of a "tweet": Just because you ?????? doesn?t mean you ???????????? :) And, as before, I have no idea how /popular/ the practice is.? But here's some more links: (web page from 2013) How To Write In Italics, Tweet Backwards And Use Lots Of Different ... https://www.adweek.com/digital/twitter-font-italics-backwards/ (This is copy/pasted *as-is* from the web page to plain-text) Bold and Italic Unicode Text Tool - ???????? ?????? ?????????????? - YayText https://yaytext.com/bold-italic/ Super cool unicode text magic. Write ???????? and/or ???????????? updates on Facebook, Twitter, and elsewhere. Bold (serif) preview copy tweet. Michael Maurino [emoji redacted-JK] on Twitter: "Can I make italics on twitter? 'cause ... https://twitter.com/iron_stylus/status/281991180064022528?lang=en Charlie Brooker on Twitter: "How do you do italics on this thing again?" https://twitter.com/charltonbrooker/status/484623185862983680?lang=en How to make your Facebook and Twitter text bold or italic, and other ... https://boingboing.net/2016/04/10/yaytext-unicode-text-styling.html Apr 10, 2016 - For years I've been using the Panix Unicode Text Converter to create ironic, weird or simply annoying text effects for use on Twitter, Facebook ... How to change your Twitter font | Digital Trends https://www.digitaltrends.com/.../now-you-can-use-bold-italics-and-other-fancy-fonts-... Aug 14, 2013 - now you can use bold italics and other fancy fonts on twitter isaac ... or phrase into your Twitter text box, and there you have it: fancy tweets. Twitter Fonts Generator (???????? ?????? ??????????) ? LingoJam https://lingojam.com/TwitterFonts You might have noticed that some users on Twitter are able to change the font ... them to seemingly make their tweet font bold, italic, or just completely different. From unicode at unicode.org Mon Jan 14 03:45:42 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Mon, 14 Jan 2019 09:45:42 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> References: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> Message-ID: <96a4e654-cf49-563a-ad23-4aef54eed4c3@it.aoyama.ac.jp> Hello James, others, From the examples below, it looks like a feature request for Twitter (and/or Facebook). Blaming the problem on Unicode doesn't seem to be appropriate. Regards, Martin. On 2019/01/14 18:06, James Kass via Unicode wrote: > > Not a twitter user, don't know how popular the practice is, but here's a > couple of links concerned with how to use bold or italics in Twitter > plain text messages. > > https://www.simplehelp.net/2018/03/13/how-to-use-bold-and-italicized-text-on-twitter/ > > https://mothereff.in/twitalics > > Both pages include a form of caveat.? But the caveat isn't about the > intended use of the math alphanumerics. > > The first page includes the following text as part of a "tweet": > Just because you ?????? doesn?t mean you ???????????? :) > > And, as before, I have no idea how /popular/ the practice is.? But > here's some more links: > > (web page from 2013) > How To Write In Italics, Tweet Backwards And Use Lots Of Different ... > https://www.adweek.com/digital/twitter-font-italics-backwards/ > > (This is copy/pasted *as-is* from the web page to plain-text) > Bold and Italic Unicode Text Tool - ???????? ?????? ?????????????? - > YayText > https://yaytext.com/bold-italic/ > Super cool unicode text magic. Write ???????? and/or ???????????? > updates on Facebook, Twitter, and elsewhere. Bold (serif) preview copy > tweet. > > Michael Maurino [emoji redacted-JK] on Twitter: "Can I make italics on > twitter? 'cause ... > https://twitter.com/iron_stylus/status/281991180064022528?lang=en > > Charlie Brooker on Twitter: "How do you do italics on this thing again?" > https://twitter.com/charltonbrooker/status/484623185862983680?lang=en > > How to make your Facebook and Twitter text bold or italic, and other ... > https://boingboing.net/2016/04/10/yaytext-unicode-text-styling.html > Apr 10, 2016 - For years I've been using the Panix Unicode Text > Converter to create ironic, weird or simply annoying text effects for > use on Twitter, Facebook ... > > How to change your Twitter font | Digital Trends > https://www.digitaltrends.com/.../now-you-can-use-bold-italics-and-other-fancy-fonts-... > > Aug 14, 2013 - now you can use bold italics and other fancy fonts on > twitter isaac ... or phrase into your Twitter text box, and there you > have it: fancy tweets. > > Twitter Fonts Generator (???????? ?????? ??????????) ? LingoJam > https://lingojam.com/TwitterFonts > You might have noticed that some users on Twitter are able to change the > font ... them to seemingly make their tweet font bold, italic, or just > completely different. > From unicode at unicode.org Mon Jan 14 03:57:18 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Mon, 14 Jan 2019 09:57:18 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <34796c35-a574-438d-e842-83fd87a17e6d@gmail.com> References: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org> <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com> <3edf3287-1fe9-6444-bd12-42fbc0232094@it.aoyama.ac.jp> <34796c35-a574-438d-e842-83fd87a17e6d@gmail.com> Message-ID: Hello James, others, On 2019/01/14 15:24, James Kass via Unicode wrote: > > Martin J. D?rst wrote, > > > I'd say it should be conservative. As the meaning of that word > > (similar to others such as progressive and regressive) may be > > interpreted in various way, here's what I mean by that. > > > > It should not take up and extend every little fad at the blink of an > > eye. It should wait to see what the real needs are, and what may be > > just a temporary fad. As the Mathematical style variants show, once > > characters are encoded, it's difficult to get people off using them, > > even in ways not intended. > > A conservative approach to progress is a sensible position for computer > character encoders.? Taking a conservative approach doesn't necessarily > mean being anti-progress. > > Trying to "get people off" using already encoded characters, whether or > not the encoded characters are used as intended, might give an > impression of being anti-progress. Using the expression "get people off" was indeed somewhat ambiguous. Of course we cannot forbid people to use Mathematical alphanumerics. There's no standards police, neither for Unicode nor most other standards. > Unicode doesn't enforce any spelling or punctuation rules.? Unicode > doesn't tell human beings how to pronounce strings of text or how to > interpret them.? Unicode doesn't push any rules about splitting > infinitives or conjugating verbs. > > Unicode should not tell people how any written symbol must be > interpreted.? Unicode should not tell people how or where to deploy > their own written symbols. Yes. But Unicode can very well say: These characters are for Math, and if you use them for anything else, that's your problem, and because they are used for Math, they support what's used in Math, and we won't add copies of accented characters or variant characters for style or [your proposal goes here] because that's not what Unicode is about. If you want real styling, then use applications that can do that, or try to convince your application provider to provide that. (Well, Unicode is more or less saying just exactly that currently.) And that's what I meant with "getting people off". If that then leads to less people (mis)using these characters, all the better. > Perhaps fraktur is frivolous in English text.? Perhaps its use would > result in a new convention for written English which would enhance the > literary experience.? Italics conventions which have only been around a > hundred years or so may well turn out to be just a passing fad, so we > should probably give it a bit more time. There's no need to give italic conventions more time. Of course they may die out, but they are very active now. And they are very actively supported in rich text, where they belong. > Telling people they mustn't use Latin italics letter forms in computer > text while we wait to see if the practice catches on seems flawed in > concept. The practice is already there. Lots of people use italics in rich text. That's just fine because that's the right thing to do. We don't need to muddy the waters. Regards, Martin. From unicode at unicode.org Mon Jan 14 04:08:04 2019 From: unicode at unicode.org (Tex via Unicode) Date: Mon, 14 Jan 2019 02:08:04 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> References: <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com> Message-ID: <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> This thread has gone on for a bit and I question if there is any more light that can be shed. BTW, I admit to liking Asmus definition for functions that span text being a definition or criteria for rich text. I also liked James examples of the twitter use case. The arguments against italics seem to be: ? Unicode is plain text. Italics is rich text. ? We haven't had it until now, so we don't need it. ? There are many rich text solutions, such as html. ? There are ways to indicate or simulate italics in plain text including using underscore or other characters, using characters that look italic (eg math), etc. ? Adding Italicization might break existing software ? The examples of existing Unicode characters that seem to represent rich text (emoji, interlinear annotation, et al) have justifications. The case for it are: ? Plain text still has tremendous utility and rich text is not always an option. ? Simulations for italics are non-standard and therefore hurt interoperability. This includes math characters not being supported universally, underscore and other indicators are not a standard, nor are alternative fonts. ? There are legitimate needs for a standardized approach for interchange, accessibility (e.g. screen readers), search, twitter, et al. ? Evidence of the demand is perhaps demonstrated by the number of simulations, and the requests for how to implement it to vendors of plain text apps (such as twitter). ? Supporting italics can be implemented without breaking existing documents and should be easily supported in modern Unicode apps. ? The impact on the standard for adding a character for italics (and another for bold and perhaps a couple others) is miniscule as it fits into the VS model. ? The argument that italics is rich text is an ideological one. However, as with other examples, there are cases where practicality should win out. ? This isn?t a slippery slope. Personally, I think the cost seems very low, both to the standard and to implementers. I don?t see a lot of risk that it will break apps. (At least not those that wouldn?t be broken by VS or other features in the standard.) It will help many apps. I think the benefits to interoperability, accessibility, search, standardization of text are significant. Perhaps the question should be put to twitter, messaging apps, text-to-voice vendors, and others whether it will be useful or not. If the discussion continues I would like to see more of a cost/benefit analysis. Where is the harm? What will the benefit to user communities be? tex -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 14 04:30:58 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 14 Jan 2019 10:30:58 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <96a4e654-cf49-563a-ad23-4aef54eed4c3@it.aoyama.ac.jp> References: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> <96a4e654-cf49-563a-ad23-4aef54eed4c3@it.aoyama.ac.jp> Message-ID: Hello Martin, others... > Blaming the problem on Unicode doesn't seem to be appropriate. I don't consider that there's any problem with plain text users exchanging plain text.? I give Unicode /credit/ for being the foundation of that ability.? Anyone imagining that I'm casting blame is under a misconception. There's plain text data out there stringing math alphanumerics into recognizable words.? It's being stored and shared and indexed.? I have no problem with that; I'm in favor of it. (Everyone, please let's focus on Tex Texin's latest post.? Wish I'd sent this post before his...) Best regards, James Kass From unicode at unicode.org Mon Jan 14 07:19:03 2019 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 14 Jan 2019 14:19:03 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> References: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> Message-ID: <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com> > On 14 Jan 2019, at 06:08, James Kass via Unicode wrote: > > ?????? ?????????????? seems a bit ????????? nowadays, as well. > > (Had to use mark-up for that ?span? of a single letter in order to indicate the proper letter form. But the plain-text display looks crazy with that HTML jive in it.) How about using U+0301 COMBINING ACUTE ACCENT: ??????????? From unicode at unicode.org Mon Jan 14 07:44:52 2019 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 14 Jan 2019 14:44:52 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <20190113214311.GA1281@macbook.localdomain> References: <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <20190113214311.GA1281@macbook.localdomain> Message-ID: <7141EA4B-238B-4B21-B3E6-B7AB23C7023B@telia.com> > On 13 Jan 2019, at 22:43, Khaled Hosny via Unicode wrote: > > LaTeX with the > ?unicode-math? package will translate ASCII + font switches to the > respective Unicode math alphanumeric characters. Word will do the same. > Even browsers rendering MathML will do the same (though most likely the > MathML source will have the math alphanumeric characters already). For full translation, one probably has to use ConTexT and LuaTeX. Then, along with PDF, one can also generate HTML with MathML. From unicode at unicode.org Mon Jan 14 09:38:26 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 14 Jan 2019 16:38:26 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp> References: <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp> Message-ID: On 14/01/2019 04:00, Martin J. D?rst via Unicode wrote: [?] > [?] As Asmus has shown, one of the best ways to understand what > Unicode does with respect to text variants is that style works on > spans of characters (words,...), and is rich text, but thinks that > work on single characters are handled in plain text. Upper-case is > definitely for most part a single-character phenomenon (the recent > Georgian MTAVRULI additions being the exception). Obviously the single-character rule also applies to superscript when used as ordinal indicator or more generally, as abbreviation indicator. Thanks for the hint, it?s all about interoperability and in this case too the point in using preformatted characters is a good one IIUC. Sorry for getting a little off-topic. There?s also one reply on my to-do list where I?ll do even more so; can?t help given it?s our digital representation that?s at stake, and due to past neglect on either side there?s still a need to painfully lobby for each character while so many other important issues are out there? Best Regards, Marcel From unicode at unicode.org Mon Jan 14 13:14:34 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 14 Jan 2019 20:14:34 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> Message-ID: <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> On 14/01/2019 06:08, James Kass via Unicode wrote: > > Marcel Schneider wrote, > >> There is a crazy typeface out there, misleadingly called 'Courier >> New', as if the foundry didn?t anticipate that at some point it >> would be better called "Courier Obsolete". ... > > ?????? ?????????????? seems a bit ????????? nowadays, as well. > > (Had to use mark-up for that ?span? of a single letter in order to > indicate the proper letter form. But the plain-text display looks > crazy with that HTML jive in it.) > I apologize for seeming to question the font name ?????? ???? while targeting only the fact that this typeface is not updated to support the . It just looks like the grand name is now misused to make people believe that if **this** great font is unsupporting , it has a good reason to do so, and we should keep people off using that ?exotic whitespace? otherwise than ?intended,? ie for Mongolian. Since fortunately TUS started backing its use in French (2014) and ended up raising this usage to the first place, I can?t see why major vendors are both using this obsolete font as monospace default in main software *and* are not seeming to think at updating its coverage. OK, in fact I *can* see a ?good? reason, that I?ve hinted in the cited ticket, but I won?t be going to dump it on the List again and again. Thanks for pointing the flaw in my wording. Best regards, Marcel From unicode at unicode.org Mon Jan 14 15:21:13 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 14 Jan 2019 13:21:13 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> References: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> Message-ID: <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 14 15:42:40 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 14 Jan 2019 22:42:40 +0100 Subject: A last missing link for interoperable representation In-Reply-To: References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> Message-ID: On 14/01/2019 08:26, Julian Bradfield via Unicode wrote: > On 2019-01-13, Marcel Schneider via Unicode wrote: [?] >> These statements make me fear that the font you are using might unsupport >> the NARROW NO-BREAK SPACE U+202F >?<. If you see a question mark between > > It displays as a space. As one would expect - I use fixed width fonts > for plain text. It?s mainly that I suspected you could be using Courier New in the terminal. It?s default for plain text in main browsers, and there are devices whose copy of Courier New shows a .notdef box for U+202F. That?s at least what I ?nderstood from the feedback, and a test in my browser looked likewise. > >> these pointy brackets, please let us know. Because then, You?re unable to >> read interoperably usable French text, too, as you?ll see double punctuation >> (eg "?!") where a single mark is intended, like here?! > > I see "like here !". That?s fine, your font has support for . Thanks for reporting. The reason why I?m anxious to see that checked is that the impact on implementations of as the group separator is being assessed. > French text does not need narrow spacing any more than science does. > When doing typography, fifty centimetres is $50\thinspace\mathrm{cm}$; > in plain text, 50cm does just fine. By ?plain text? you probably mean *draft style*. I?m thinking that because "$50\thinspace\mathrm{cm}$" is not less plain text than "50cm". Indeed, in not understanding that sooner I was an idiot, naively believing that all Unicode List Members are using Unicode terminology. Turns out that that cannot be taken for granted any more than knowing the preferences of French people as of French text display, while not being a Frenchman: 1. Most French people prefer that big punctunation be spaced off from the word it pertains to. 2. Most French people strongly dislike punctuation cut off by a line break, but cannot fix it because: a) the ordinary keyboard layout has no non-breaking spaces; b) the readily available on peculiar keyboard layouts is bugging in most e-mail composers, ending up as breakable. 3. A significant part of French people strongly dislike angle quotes that are spaced off too far, as it happens when using . > Likewise, normal French people writing email write "Quel idiot!", or > sometimes "Quel idiot !". Normal people using normal keyboard layouts are writing with the readily available characters most of the time. This is why (to pick another example) French people abbreviate ?num?ro? to "n?", while on a British English or an American English keyboard layout we can?t normally expect anything else than "no", or "#" for ?Number.? We?re not trying to keep people off writing fast and draft style. What in the Unicode era every locale is expected to achieve is to enable normal users to get the accurate interoperable representation of their language while typing fast, as opposed to coding in TeX, which is like using InDesign with system spaces instead of Unicode. System spaces are not interoperable, nor is LaTeX \thinspace if that is non-breakable in LaTeX, which it obviously is, since it is used to represent the thin space between a number and a measurement unit. In Unicode, as we know it, U+2009 THIN SPACE is breakable, and the worst thing here is that its duplicate encoding U+2008 PUNCTUATION SPACE is breakable too, instead of being non-breakable like U+2007 FIGURE SPACE. That is why there was a need to add U+202F NARROW NO-BREAK SPACE later. (More details in the cited CLDR ticket.) > > If you google that phrase on a few French websites, you'll see that > some (such as Larousse, whom one might expect to care about such > things) use no space before punctuation, Thanks for catching, that flaw shall be reported with link to your email. You may also wish to look up this page: https://communaute.lerobert.com/forum/LE-ROBERT-CORRECTEUR/LE-ROBERT-CORRECTEUR-CORRECTION-D-ORTHOGRAPHE-DICTIONNAIRES-ET-GUIDES/Espace-entre-le-meotet-le-point-d-interrogation/2918628/398261 reading: ?Le logiciel Le Robert correcteur justement signale les espaces fines ins?cables si elles ne sont pas pr?sentes sur le texte et propose la correction.? (?Le?Robert spellchecker does report the lack of narrow no-break spaces and proposes to fix it.?) > while others (such as some > random T-shirt company) use an ASCII space. > > The Acad?mie Fran?aise, which by definition knows more about French > orthography than you do, uses full ASCII spaces before ? and ! on its > front page. Also after opening guillemets, which looks even more > stupid from an Anglophone perspective. (See point 3 above.) That is a very good point. Indeed this website is reasonably expected to be an example and a template of correctly typesetting a French website. There are several reasons why actually it is not. The main reason is that it is not the work of the A.F. itself, but of webdesigners, webmasters and content managers, who are normal people like for any other website. They just haven?t got an appropriate keyboard layout yet, and that is ultimately my fault because in the nineties and later I didn?t care about computers and keyboard layouts. That may sound crazy but it isn?t really. French is needing so a peculiar keyboard layout to get its representation functional, useful and interoperable without slowing down typists, that numerous preconditions and time was needed to design it. Among the preconditions, Unicode did not have the needed non-breakable thin space when keyboarding was on in France. French typesetters were aware of the thin space needed with big punctuation marks (sometimes called tall or double punctuation). The style manual of the Imprimerie Nationale is unambiguous, and where it isn?t, its actual practice is to be followed. That leaves only the colon not with but with . I cannot post a scan or photo of the table at page 149, nor of the examples as they are typeset in the print book, because it?s copyrighted material, but you?re welcome to purchase your copy if you didn?t already. That guide is kind of quoted by the A.F. when it?s up to determine whether capital letters should be diacriticized or not. Philippe Verdy reported in 2015 on this List that in France, the colon too is widely typeset with , and that the Imprimerie Nationale conforms to the specs of its clients. > >> Aiming at extending the subset of environments supporting correct typesetting > > There are many fine programs, including TeX, for doing good > typesetting. Unicode is not about typesetting, it's about information > exchange and preservation. Yah and TeX is converting our code to Unicode, so that we have several formats to choose from when considering exchange and preservation. The point in having an interoperable digital representation of all natural languages is that normal people are not forced to use draft style when just writing their language on a computer. Best regards, Marcel From unicode at unicode.org Mon Jan 14 16:08:23 2019 From: unicode at unicode.org (Tex via Unicode) Date: Mon, 14 Jan 2019 14:08:23 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.netcom.com> References: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.n etcom.com> Message-ID: <001201d4ac55$ab716500$02542f00$@xencraft.com> Asmus, I agree 100%. Asking where is the harm was an actual question intended to surface problems. It wasn?t rhetoric for saying there is no harm. Also, it may not be obvious to social media, messaging platforms, that there is a possibility of a solution. Often when a problem exists for a long time, it fades into unconsciousness. The pain is accepted as that is the way it is and has to be. It becomes part of the culture. Asking if there is a pain and whether a solution would be welcomed is consciousness raising. I agree about leading standardization. I thought some legitimate needs were raised. The questions were designed to quantify the use case as well as the potential damage. I didn?t think anyone was recommending more math abuse. I thought it was raised as an example of people resorting to them as a solution for a need. Of course they are also an example of playful experimentation. Separately, Regarding messaging platforms, although twitter is one example in the social media space, today there are many business, commercial, and other applications that embed messaging capabilities for their communities and for servicing customers. I wouldn?t dismiss the need just based on twitter?s assessment or on the idea that social media is just for casual or ?fun? use. Clarity of communications can be significant for many organizations. Having the proposed capabilities in plain text rather than requiring all of the overhead of a more rich text solution could be a big win for these apps. tex From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag via Unicode Sent: Monday, January 14, 2019 1:21 PM To: unicode at unicode.org Subject: Re: A last missing link for interoperable representation On 1/14/2019 2:08 AM, Tex via Unicode wrote: Perhaps the question should be put to twitter, messaging apps, text-to-voice vendors, and others whether it will be useful or not. If the discussion continues I would like to see more of a cost/benefit analysis. Where is the harm? What will the benefit to user communities be? The "it does no harm" is never an argument "for" making a change. It's something of a necessary, but not a sufficient condition, in other words. More to the point, if there were platforms (like social media) that felt an urgent need to support styling without a markup language, and could articulate that need in terms of a proposal, then we would have something to discuss. (We might engage them in a discussion of the advisability of supporting "markdown", for example). Short of that, I'm extremely leery of "leading" standardization; that is, encoding things that "might" be used. As for the abuse of math alphabetics. That's happening whether we like it or not, but at this point represents playful experimentation by the exuberant fringe of Unicode users and certainly doesn't need any additional extensions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 14 16:43:17 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 14 Jan 2019 22:43:17 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com> References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com> Message-ID: Hans ?berg wrote, > How about using U+0301 COMBINING ACUTE ACCENT: ??????????? Thought about using a combining accent.? Figured it would just display with a dotted circle but neglected to try it out first.? It actually renders perfectly here.? /That's/ good to know.? (smile) From unicode at unicode.org Mon Jan 14 16:58:24 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Mon, 14 Jan 2019 14:58:24 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> References: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> Message-ID: On Mon, Jan 14, 2019 at 2:09 AM Tex via Unicode wrote: > The arguments against italics seem to be: > > ? Unicode is plain text. Italics is rich text. > > ? We haven't had it until now, so we don't need it. > > ? There are many rich text solutions, such as html. > > ? There are ways to indicate or simulate italics in plain text including using underscore or other characters, using characters that look italic (eg math), etc. > > ? Adding Italicization might break existing software > > ? The examples of existing Unicode characters that seem to represent rich text (emoji, interlinear annotation, et al) have justifications. There generally shouldn't be multiple ways of doing things. For example, if you think that searching for certain text in italics is important, then having both HTML italics and Unicode italics are going to cause searches to fail or succeed unexpectedly, unless the underlying software unifies the two systems (an extra complexity). Searching for certain italicized text could be done today in rich text applications, were there actual demand for it. > ? Plain text still has tremendous utility and rich text is not always an option. Where? Twitter has the option of doing rich text, as does any closed system. In fact, Twitter is rich text, in that it hyperlinks web addresses. That Twitter has chosen not to support italics is a choice. If users don't like this, they could go another system, or use third-party tools to transmit rich text over Twitter. The use of underscores or markings for italics would be mostly compatible with human twitterers using the normal interface. Source code is an example of plain text, and yet adding italics into comments would require but a trivial change to editors. If the user audience cared, it would have been done. In fact, I suspect there exist editors and environments where an HTML subset is put into comments and rendered by the editors; certainly active links would be more useful in source code comments than italics. Lastly, the places where I still find massive use of plain text are the places this would hurt the most. GNU Grep's manpage shows no sign that it supports searching under any form of Unicode normalization. Same with GNU Less. Adding italics would just make searching plain text documents more complex for their users. The domain name system would just add them to the ban list, and they'd be used for spoofing in filenames and other less controlled but still sensitive environments. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Mon Jan 14 17:02:49 2019 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 15 Jan 2019 00:02:49 +0100 Subject: A last missing link for interoperable representation In-Reply-To: References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com> Message-ID: <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com> > On 14 Jan 2019, at 23:43, James Kass via Unicode wrote: > > Hans ?berg wrote, > > > How about using U+0301 COMBINING ACUTE ACCENT: ??????????? > > Thought about using a combining accent. Figured it would just display with a dotted circle but neglected to try it out first. It actually renders perfectly here. /That's/ good to know. (smile) It is a bit off here. One can try math, too: the derivative of ??(??) is ???(??). From unicode at unicode.org Mon Jan 14 17:21:02 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 14 Jan 2019 15:21:02 -0800 Subject: A last missing link for interoperable representation In-Reply-To: References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> Message-ID: <4763e887-33c5-7a82-fb2f-3357791b61bc@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 14 17:37:15 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 14 Jan 2019 23:37:15 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com> References: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com> <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com> Message-ID: <20190114233715.0a46eb16@JRWUBU2> On Tue, 15 Jan 2019 00:02:49 +0100 Hans ?berg via Unicode wrote: > > On 14 Jan 2019, at 23:43, James Kass via Unicode > > wrote: > > > > Hans ?berg wrote, > > > > > How about using U+0301 COMBINING ACUTE ACCENT: ??????????? > > > > Thought about using a combining accent. Figured it would just > > display with a dotted circle but neglected to try it out first. It > > actually renders perfectly here. /That's/ good to know. (smile) > > It is a bit off here. One can try math, too: the derivative of ??(??) > is ???(??). No it isn't. You should be using a spacing character for differentiation. On the other hand, one uses a combining circumflex for Fourier transforms. Richard. From unicode at unicode.org Mon Jan 14 18:02:05 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 14 Jan 2019 16:02:05 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <20190114233715.0a46eb16@JRWUBU2> References: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com> <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com> <20190114233715.0a46eb16@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 14 18:05:42 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 14 Jan 2019 16:05:42 -0800 Subject: A last missing link for interoperable representation In-Reply-To: References: <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com> Message-ID: <4d354d16-3730-00f3-647c-e8c512bd4abf@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 14 18:17:00 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 14 Jan 2019 16:17:00 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <001201d4ac55$ab716500$02542f00$@xencraft.com> References: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.n etcom.com> <001201d4ac55$ab716500$02542f00$@xencraft.com> Message-ID: <0158d32d-63f4-a120-d3a5-389f206c232c@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 14 19:09:08 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 15 Jan 2019 01:09:08 +0000 Subject: A last missing link for interoperable representation In-Reply-To: <34796c35-a574-438d-e842-83fd87a17e6d@gmail.com> References: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org> <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com> <3edf3287-1fe9-6444-bd12-42fbc0232094@it.aoyama.ac.jp> <34796c35-a574-438d-e842-83fd87a17e6d@gmail.com> Message-ID: <20190115010908.1f20e000@JRWUBU2> On Mon, 14 Jan 2019 06:24:46 +0000 James Kass via Unicode wrote: > Unicode doesn't enforce any spelling or punctuation rules.? Unicode > doesn't tell human beings how to pronounce strings of text or how to > interpret them. These are not statements that are both honest and true. Unicode lays down rules and recommendations which others may then enforce. In Indic scripts where LETTER A is not also a consonant, Unicode forbids writing where LETTER AA would do the same job, and most renderers enforce that rule. Similarly, in phonetically ordered LTR scripts, one can't write a dependent vowel as the first character even if it is the leftmost character. There is a subtler rule about not spelling negative numbers with a hyphen-minus - if one does, one may suddenly find a line break just after what is being used as a negative sign. In scripts where Sanskrit grv and gvr may be rendered identically, Unicode tells us what the two code sequences are, and therefore indirectly what the range of pronunciations is for a given spelling. Now, sometimes the enforcers overstep the mark. For example, the USE tells us that when we write Northern Thai /p?ia?/ 'sound of a smack' which visually is , with denoting /ia/, we should write it ?????? . So much for phonetic order! Enforcement can be more subtle. TUS says that Farsi should use U+06CC ARABIC LETTER FARSI YEH instead of U+064A ARABIC LETTER YEH although they are identical in initial and medial positions. In this case, the enforcer will be the spell-checker. Richard. From unicode at unicode.org Mon Jan 14 19:18:24 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 15 Jan 2019 01:18:24 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com> <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com> <20190114233715.0a46eb16@JRWUBU2> Message-ID: <20190115011824.670e04b6@JRWUBU2> On Mon, 14 Jan 2019 16:02:05 -0800 Asmus Freytag via Unicode wrote: > On 1/14/2019 3:37 PM, Richard Wordingham via Unicode wrote: > On Tue, 15 Jan 2019 00:02:49 +0100 > Hans ?berg via Unicode wrote: > > On 14 Jan 2019, at 23:43, James Kass via Unicode > wrote: > > Hans ?berg wrote, > > How about using U+0301 COMBINING ACUTE ACCENT: ??????????? > > Thought about using a combining accent. Figured it would just > display with a dotted circle but neglected to try it out first. It > actually renders perfectly here. /That's/ good to know. (smile) > > It is a bit off here. One can try math, too: the derivative of ??(??) > is ???(??). > > No it isn't. You should be using a spacing character for > differentiation. > > Sorry, but there may be different conventions. The dot / double-dot > above is definitely common usage in physics. > > A./ Apologies. It was positioned in the parenthesis, and it looked like a misplaced U+0301. Richard. From unicode at unicode.org Mon Jan 14 19:37:56 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 14 Jan 2019 20:37:56 -0500 Subject: A last missing link for interoperable representation In-Reply-To: <96a4e654-cf49-563a-ad23-4aef54eed4c3@it.aoyama.ac.jp> References: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> <96a4e654-cf49-563a-ad23-4aef54eed4c3@it.aoyama.ac.jp> Message-ID: <42a41a15-43b6-9fbf-4cf9-ca385e377dcf@kli.org> On 1/14/19 4:45 AM, Martin J. D?rst via Unicode wrote: > Hello James, others, > > From the examples below, it looks like a feature request for Twitter > (and/or Facebook). Blaming the problem on Unicode doesn't seem to be > appropriate. I think what people here are doing is not blaming the problem on Unicode, but rather blaming the _solution_ on Unicode, for better or worse. ~mark From unicode at unicode.org Mon Jan 14 19:41:17 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 14 Jan 2019 20:41:17 -0500 Subject: A last missing link for interoperable representation In-Reply-To: <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> References: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> Message-ID: On 1/14/19 5:08 AM, Tex via Unicode wrote: > > This thread has gone on for a bit and I question if there is any more > light that can be shed. > > BTW, I admit to liking Asmus definition for functions that span text > being a definition or criteria for rich text. > > Me too.? There are probably some exceptions or weird corner-cases, but it seems to be a really good encapsulation of the distinction which I had never seen before. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 14 19:48:45 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 14 Jan 2019 20:48:45 -0500 Subject: A last missing link for interoperable representation In-Reply-To: <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.netcom.com> References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.netcom.com> Message-ID: On 1/14/19 4:21 PM, Asmus Freytag via Unicode wrote: > On 1/14/2019 2:08 AM, Tex via Unicode wrote: >> >> Perhaps the question should be put to twitter, messaging apps, >> text-to-voice vendors, and others whether it will be useful or not. >> >> If the discussion continues I would like to see more of a >> cost/benefit analysis. Where is the harm? What will the benefit to >> user communities be? >> > The "it does no harm" is never an argument "for" making a change. It's > something of a necessary, but not a sufficient condition, in other words. > > More to the point, if there were platforms (like social media) that > felt an urgent need to support styling without a markup language, and > could articulate that need in terms of a proposal, then we would have > something to discuss. (We might engage them in a discussion of the > advisability of supporting "markdown", for example). > > Short of that, I'm extremely leery of "leading" standardization; that > is, encoding things that "might" be used. > It is certainly true that Unicode should not be (and wasn't, before emoji) in the business of encoding things that "could be used", but rather, was for encoding things that *were* used.? This, naturally, poses a chicken-and-egg problem which has been complained about by several people in the past (including me).? Still, there are ways to show that things that haven't been encoded are still being "used", as people make shift to do what they can to use the script/notation, like using PUA or characters that aren't QUITE right, but close...? And in fairness, I'd have to say that the use of mathematical italics would count in that regard.? It's hard to dispute that there is a demand for it, just by looking at how people have been trying to do it!? So I'm starting to think this is not really "leading" standardization, but rather following up and, well, standardizing it, replacing ad-hoc attempts with a standard way to do things, just as Unicode is supposed to do. ~mark > As for the abuse of math alphabetics. That's happening whether we like > it or not, but at this point represents playful experimentation by the > exuberant fringe of Unicode users and certainly doesn't need any > additional extensions. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 14 19:56:37 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 14 Jan 2019 20:56:37 -0500 Subject: A last missing link for interoperable representation In-Reply-To: References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> Message-ID: <9b3c8e3f-5e20-2481-8d12-b7be716f177e@kli.org> In some of this discussion, I'm not sure what is being proposed or forbidden here... I don't know that anyone is advocating removing the "don't use these for words!" warning sticker on the mathematical italics.? The closest-to-sensible suggestions I've heard are things like a VS to italicize a letter, a combining italicizer so to speak (this is actually very similar to the emoji-style vs text-style VS sequences).? *If* the VS is ignored by searches, as apparently it should be and some have reported that it is, then VS-type solutions would NOT be a problem when it comes to searches (and don't go whining about legacy software.? If Unicode had to be backward-compatible with everything we wouldn't have gone beyond ASCII).? So I'm not sure what you mean when you speak of "Unicode italics".? Do you mean using the mathematical italics as we've been seeing?? Or having a whole new plane of italic characters for everything that could conceivably be italicized?? Those would probably both be mistakes, I agree. ~mark On 1/14/19 5:58 PM, David Starner via Unicode wrote: > On Mon, Jan 14, 2019 at 2:09 AM Tex via Unicode wrote: >> The arguments against italics seem to be: >> >> ? Unicode is plain text. Italics is rich text. >> >> ? We haven't had it until now, so we don't need it. >> >> ? There are many rich text solutions, such as html. >> >> ? There are ways to indicate or simulate italics in plain text including using underscore or other characters, using characters that look italic (eg math), etc. >> >> ? Adding Italicization might break existing software >> >> ? The examples of existing Unicode characters that seem to represent rich text (emoji, interlinear annotation, et al) have justifications. > There generally shouldn't be multiple ways of doing things. For > example, if you think that searching for certain text in italics is > important, then having both HTML italics and Unicode italics are going > to cause searches to fail or succeed unexpectedly, unless the > underlying software unifies the two systems (an extra complexity). > Searching for certain italicized text could be done today in rich text > applications, were there actual demand for it. > >> ? Plain text still has tremendous utility and rich text is not always an option. > Where? Twitter has the option of doing rich text, as does any closed > system. In fact, Twitter is rich text, in that it hyperlinks web > addresses. That Twitter has chosen not to support italics is a choice. > If users don't like this, they could go another system, or use > third-party tools to transmit rich text over Twitter. The use of > underscores or markings for italics would be mostly > compatible with human twitterers using the normal interface. > > Source code is an example of plain text, and yet adding italics into > comments would require but a trivial change to editors. If the user > audience cared, it would have been done. In fact, I suspect there > exist editors and environments where an HTML subset is put into > comments and rendered by the editors; certainly active links would be > more useful in source code comments than italics. > > Lastly, the places where I still find massive use of plain text are > the places this would hurt the most. GNU Grep's manpage shows no sign > that it supports searching under any form of Unicode normalization. > Same with GNU Less. Adding italics would just make searching plain > text documents more complex for their users. The domain name system > would just add them to the ban list, and they'd be used for spoofing > in filenames and other less controlled but still sensitive > environments. > From unicode at unicode.org Mon Jan 14 20:02:20 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 14 Jan 2019 18:02:20 -0800 Subject: A last missing link for interoperable representation In-Reply-To: References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> Message-ID: <0bcc5124-534c-6049-1854-3f51aa10db19@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 14 20:02:42 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 14 Jan 2019 21:02:42 -0500 Subject: A last missing link for interoperable representation In-Reply-To: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp> References: <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp> Message-ID: On 1/13/19 10:00 PM, Martin J. D?rst via Unicode wrote: > On 2019/01/14 01:46, Julian Bradfield via Unicode wrote: >> On 2019-01-12, Richard Wordingham via Unicode wrote: >>> On Sat, 12 Jan 2019 10:57:26 +0000 (GMT) >>> And what happens when you capitalise a word for emphasis or to begin a >>> sentence? Is it no longer the same word? >> Indeed. As has been observed up-thread, the casing idea is a dumb one! >> We are, however, stuck with it because of legacy encoding transported >> into Unicode. We aren't stuck with encoding fonts into Unicode. > No, the casing idea isn't actually a dumb one. Well, for me, when I say or said that the "casing idea" is a dumb one, I don't mean how Unicode handled it.? Unicode is quite correct in encoding capitals distinctly from lowercase, both for computer-historical reasons and others you mention.? I think the idea of having case in alphabets _in the first place_ was a bad move.? It's a "mistake" that happened centuries ago. ~mark From unicode at unicode.org Mon Jan 14 20:07:48 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 14 Jan 2019 21:07:48 -0500 Subject: A last missing link for interoperable representation In-Reply-To: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp> References: <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com> <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com> <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp> Message-ID: (sorry for multiple responses...) On 1/13/19 10:00 PM, Martin J. D?rst via Unicode wrote: > On 2019/01/14 01:46, Julian Bradfield via Unicode wrote: >> On 2019-01-12, Richard Wordingham via Unicode wrote: >>> On Sat, 12 Jan 2019 10:57:26 +0000 (GMT) >>> And what happens when you capitalise a word for emphasis or to begin a >>> sentence? Is it no longer the same word? >> Indeed. As has been observed up-thread, the casing idea is a dumb one! >> We are, however, stuck with it because of legacy encoding transported >> into Unicode. We aren't stuck with encoding fonts into Unicode. > No, the casing idea isn't actually a dumb one. As Asmus has shown, one > of the best ways to understand what Unicode does with respect to text > variants is that style works on spans of characters (words,...), and is > rich text, but thinks that work on single characters are handled in > plain text. Upper-case is definitely for most part a single-character > phenomenon (the recent Georgian MTAVRULI additions being the exception). Not just an exception, but an exception that proves the rule.? It's precisely because plain-text distinctions, generally speaking, should be at the letter level as Asmus says that there was so much shouting about MTAVRULI.? That these are exceptional demonstrates the existence of the rule. > But even most adults won't know the rules for what to italicize that > have been brought up in this thread. Even if they have read books that > use italic and bold in ways that have been brought up in this thread, > most readers won't be able to tell you what the rules are. That's left > to copy editors and similar specialist jobs. I don't think there's really a case to be made that italics are or should work the same as capitals, or that they are justified for the same reasons that capitals are justified.? And the use-cases show how people are using them: not necessarily for Chicago Manual of Style mandated purposes, but for emphasis of varying kinds. > There was a time when computers (and printers in particular) were > single-case. There was some discussion about having to abolish case > distinctions to adapt to computers, but fortunately, that wasn't necessary. Abolishing case I could see as a hassle, and we have become somewhat dependent on it for other things.? But it was a bad idea to start with. ~mark From unicode at unicode.org Tue Jan 15 00:31:02 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Tue, 15 Jan 2019 06:31:02 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.netcom.com> Message-ID: On 2019/01/15 10:48, Mark E. Shoulson via Unicode wrote: > On 1/14/19 4:21 PM, Asmus Freytag via Unicode wrote: >> Short of that, I'm extremely leery of "leading" standardization; that >> is, encoding things that "might" be used. >> > It is certainly true that Unicode should not be (and wasn't, before > emoji) Just to be precise, as already has been mentioned in this thread, the first batch of 'emoji' was in Unicode from the start (e.g. U+2603 SNOWMAN, there since Unicode 1.1), I think from Zapf Dingbats. The second batch came from Japanese phones. So for the first two batches of emoji, Unicode did not do any "leading" standardization. It was only after that, for later batches, where that happened. > in the business of encoding things that "could be used", but > rather, was for encoding things that *were* used.? This, naturally, > poses a chicken-and-egg problem which has been complained about by > several people in the past (including me).? Still, there are ways to > show that things that haven't been encoded are still being "used", as > people make shift to do what they can to use the script/notation, like > using PUA or characters that aren't QUITE right, but close...? And in > fairness, I'd have to say that the use of mathematical italics would > count in that regard.? It's hard to dispute that there is a demand for > it, just by looking at how people have been trying to do it! "a demand" doesn't quantify the demand at all. My guess is that given the overall volume of Twitter or Facebook communication, the percentage of Math italics (ab)use is really, really low. It's impossible to say that there's no demand, but use cases like "look, I found these characters, aren't they cute" in some corners of some social services is not the same as "we urgently need this, otherwise we can't communicate in our language". Regards, Martin. From unicode at unicode.org Tue Jan 15 00:46:31 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Tue, 15 Jan 2019 06:46:31 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> Message-ID: On 2019/01/15 07:58, David Starner via Unicode wrote: > On Mon, Jan 14, 2019 at 2:09 AM Tex via Unicode wrote: >> ? Plain text still has tremendous utility and rich text is not always an option. > > Where? Twitter has the option of doing rich text, as does any closed > system. In fact, Twitter is rich text, in that it hyperlinks web > addresses. That Twitter has chosen not to support italics is a choice. > If users don't like this, they could go another system, or use > third-party tools to transmit rich text over Twitter. The use of > underscores or markings for italics would be mostly > compatible with human twitterers using the normal interface. Yes indeed. Some similar services allow styling. One example is Slack, see e.g. https://get.slack.help/hc/en-us/articles/202288908-Format-your-messages. Markdown has been mentioned as an example of how some basic styling options (bold, italic,...) can be implemented. Another choice is using an user interface component (menu,...). The user then doesn't have to care about any 'weird' conventions, even the simplest ones, nor about what happens in the background (most probably HTML), and already is familiar with it from other applications. As for implementation complexity, it's not trivial, but there are quite a lot of components available, in particular for Web technology. It's not rocket science. Actually, in some cases, it is even difficult to get rid of styling on the Web. I recently wanted to print out a map of how to get to a restaurant for a party. The restaurant's Web site was all black background. I copied the address to Google Maps and then tried to print it. Google Maps insists that the first page is just information about the location, so I copied the name of the restaurant from the Web page. What happened was that it still had the black background. So copy-paste on your average Web browser these days doesn't lose styles, even in cases where that would be desirable (because more legible). So rich text technology is already way ahead when it comes to styled text. Do we want to encode background-color variant selectors in Unicode? If yes, how many? [Hint: The last two questions are rhetorical.] Regards, Martin. From unicode at unicode.org Tue Jan 15 01:07:25 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Mon, 14 Jan 2019 23:07:25 -0800 Subject: A last missing link for interoperable representation In-Reply-To: <9b3c8e3f-5e20-2481-8d12-b7be716f177e@kli.org> References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> <9b3c8e3f-5e20-2481-8d12-b7be716f177e@kli.org> Message-ID: On Mon, Jan 14, 2019 at 5:58 PM Mark E. Shoulson via Unicode wrote: > *If* the VS is ignored by searches, as apparently it should be and some > have reported that it is, then VS-type solutions would NOT be a problem > when it comes to searches Who is using VS-type solutions? I could not enter except for manually using some sort of \u notations. Languages that need special input support can easily adapt to unusual rules, but English Unicode is weirdly hard to enter, because the QWERTY keyboard is ubiquitous and standard. Smart quotes, non-HYPHEN-MINUS hyphens and dashes, and accents generally need memorizing of obscure entry methods or resort to a character list. Without great support from vendors, a new Unicode italic system only going to be used by the same people who currently use mathematical italics. > (and don't go whining about legacy software. > If Unicode had to be backward-compatible with everything we wouldn't > have gone beyond ASCII). Then where's this plain text that absolutely needs italics? Those legacy software systems are the place where unadorned plain text still lives. Anything on the Web is inherently dealing with rich text. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Tue Jan 15 01:31:21 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 15 Jan 2019 08:31:21 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <0158d32d-63f4-a120-d3a5-389f206c232c@ix.netcom.com> References: <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.n etcom.com> <001201d4ac55$ab716500$02542f00$@xencraft.com> <0158d32d-63f4-a120-d3a5-389f206c232c@ix.netcom.com> Message-ID: On 15/01/2019 01:17, Asmus Freytag via Unicode wrote: > On 1/14/2019 2:08 PM, Tex via Unicode wrote: >> >> Asmus, >> >> I agree 100%. Asking where is the harm was an actual question intended to surface problems. It wasn?t rhetoric for saying there is no harm. >> > The harm comes when this is imported into rich text environments (like this e-mail inbox). Here, the math abuse and the styled text run may look the same, but I cannot search for things based on what I see. I see an English or French word, type it in the search box and it won't be found. I call that 'stealth' text. > > The answer is not necessarily in folding the two, because one of the reasons for having math alphabetics is so you can search for a variable "a" of? certain kind without getting hits on every "a" in the text. Destroying that functionality in an attempt to "solve" the problems created by the alternate facsimile of styled text is also "harm" in some way. > That may end up in a feature request for webmails and e-mail clients, where the user should be given the ability to toggle between what I?d call a ?Bing search mode? and a ?Google search mode.? Google Search has extended equivalence classes that enable it to handle math alphabets like plain ASCII runs, i.e. we may type a search in ASCII and Google finds instances where the text is typeset ?abusing? math alphabets. On the other hand, Bing Search does not have such extended equivalence classes, and brings up variables as they are styled when searching correspondingly. I won?t blame Google of doing ?harm?, and I?d like to position rather on Google?s side as it seems to meet the expectations of a larger part of end-user communities. I won?t blame Microsoft neither, I?m just noting a dividing line between the two vendors about handling math alphabets. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 15 02:09:00 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 15 Jan 2019 09:09:00 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <0bcc5124-534c-6049-1854-3f51aa10db19@ix.netcom.com> References: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com> <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com> <000901d4abf1$0ae95980$20bc0c80$@xencraft.com> <0bcc5124-534c-6049-1854-3f51aa10db19@ix.netcom.com> Message-ID: <8eb1d719-462a-b75c-c84c-2b737213c210@orange.fr> On 15/01/2019 03:02, Asmus Freytag via Unicode wrote: > On 1/14/2019 5:41 PM, Mark E. Shoulson via Unicode wrote: >> On 1/14/19 5:08 AM, Tex via Unicode wrote: >>> >>> This thread has gone on for a bit and I question if there is any more light that can be shed. >>> >>> BTW, I admit to liking Asmus definition for functions that span text being a definition or criteria for rich text. >>> >>> >> Me too.? There are probably some exceptions or weird corner-cases, but it seems to be a really good encapsulation of the distinction which I had never seen before. >> > ** blush ** > > A./ > > I did like it too, and I was really amazed that the issue could be boiled down to such a handy shibboleth. It wasn?t until I?m looking harder that I can?t help any more seeing it as a mere rewording of current practice. That is, if we?re using markup (that typically acts on spans and other elements), it?s rich text; if we?re using characters, it?s plain text. The reason why I changed my mind is that the new shibboleth can be misused to relegate to the realm of rich text some feature of a writing system, like using superscript as ordinal indicators (English "3??", French "2?" [order] or "2??" [rank], Italian "1?" or ? in Latin-1 ? "1?", the latter being used in German as a narrow form of "prima" that has special semantics there ["top quality" or "great!"]), only on the basis that it is currently emulated using rich text by declaring that "?" is?or ?should? be?a span with superscript markup, so that we end up with "2e". As I?ve (too) slightly pointed in a previous reply, that is not what we should end up with. Abbreviation indicators in Latin script are a case of a single character solution, albeit multiple characters may be involved in a single instance. We can also have inner uppercase, aka camelcase, that cannot be handled by the titlecase attribute. We?re clearly in the realm of plain text, and any other solution may be called an emulation, or a legacy workaround, but not a Unicode conformant interoperable representation. Also, please note the presence in Unicode, of U+070F SYRIAC ABBREVIATION MARK, a format control? Probably there are also some other format controls in other scripts, performing likely the same job. Remember when a similar solution was suggested for Latin script on this List? Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 15 03:24:03 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 15 Jan 2019 10:24:03 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> Message-ID: Le lun. 14 janv. 2019 ? 20:25, Marcel Schneider via Unicode < unicode at unicode.org> a ?crit : > On 14/01/2019 06:08, James Kass via Unicode wrote: > > > > Marcel Schneider wrote, > > > >> There is a crazy typeface out there, misleadingly called 'Courier > >> New', as if the foundry didn?t anticipate that at some point it > >> would be better called "Courier Obsolete". ... > > > > ?????? ?????????????? seems a bit ????????? nowadays, as well. > > > > (Had to use mark-up for that ?span? of a single letter in order to > > indicate the proper letter form. But the plain-text display looks > > crazy with that HTML jive in it.) > > > > I apologize for seeming to question the font name ?????? ???? while > targeting only > the fact that this typeface is not updated to support the . It just > looks like the grand name is now misused to make people believe that if > **this** great font is unsupporting , it has a good reason to do so, > and we should keep people off using that ?exotic whitespace? otherwise than > ?intended,? ie for Mongolian. Since fortunately TUS started backing its use > in French (2014) > This is not for Mongolian and French wanted this space since long and it has a use even in English since centuries for fine typography. So no, NNBSP is definitely NOT "exotic whitespace". It's just that it was forgotten in the early stages of computing with legacy 8-bit encodings but it should have been in Unicode since the begining as its existence is proven long before the computing age (before ASCII, or even before Baudot and telegraphic systems). It has alsway been used by typographs, it has centuries of tradition in publishing. And it has always been recommended and still today for French for all books/papers publishers. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 15 03:30:34 2019 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 15 Jan 2019 10:30:34 +0100 Subject: A last missing link for interoperable representation In-Reply-To: <20190115011824.670e04b6@JRWUBU2> References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com> <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com> <20190114233715.0a46eb16@JRWUBU2> <20190115011824.670e04b6@JRWUBU2> Message-ID: > On 15 Jan 2019, at 02:18, Richard Wordingham via Unicode wrote: > > On Mon, 14 Jan 2019 16:02:05 -0800 > Asmus Freytag via Unicode wrote: > >> On 1/14/2019 3:37 PM, Richard Wordingham via Unicode wrote: >> On Tue, 15 Jan 2019 00:02:49 +0100 >> Hans ?berg via Unicode wrote: >> >> On 14 Jan 2019, at 23:43, James Kass via Unicode >> wrote: >> >> Hans ?berg wrote, >> >> How about using U+0301 COMBINING ACUTE ACCENT: ??????????? >> >> Thought about using a combining accent. Figured it would just >> display with a dotted circle but neglected to try it out first. It >> actually renders perfectly here. /That's/ good to know. (smile) >> >> It is a bit off here. One can try math, too: the derivative of ??(??) >> is ???(??). >> >> No it isn't. You should be using a spacing character for >> differentiation. >> >> Sorry, but there may be different conventions. The dot / double-dot >> above is definitely common usage in physics. Also in differential geometry, as for curves. >> A./ > > Apologies. It was positioned in the parenthesis, and it looked like a > misplaced U+0301. In MacOS, one can drop the combined character into the character table, and see that it is U+0307 COMBINING DOT ABOVE. It comes out right when typeset in ConTeXt. From unicode at unicode.org Tue Jan 15 04:04:33 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Tue, 15 Jan 2019 10:04:33 +0000 (GMT) Subject: A last missing link for interoperable representation References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> Message-ID: On 2019-01-15, Philippe Verdy via Unicode wrote: > This is not for Mongolian and French wanted this space since long and it > has a use even in English since centuries for fine typography. > So no, NNBSP is definitely NOT "exotic whitespace". It's just that it was > forgotten in the early stages of computing with legacy 8-bit encodings but > it should have been in Unicode since the begining as its existence is > proven long before the computing age (before ASCII, or even before Baudot > and telegraphic systems). It has alsway been used by typographs, it has > centuries of tradition in publishing. And it has always been recommended > and still today for French for all books/papers publishers. Do you expect people to encode all the variable justification spaces between words by combining all the (numerous) spaces already available in Unicode? And how about the kerning between letters? If spacing of punctuation is to be encoded instead of left to display algorithms, shouldn't you also encode the kerns instead of leaving them to the font display technology? Oh, and what about dropped initials? They have been used in both manuscripts and typography for many centuries - surely we must encode them? -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Tue Jan 15 05:24:44 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 15 Jan 2019 12:24:44 +0100 Subject: A last missing link for interoperable representation In-Reply-To: References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> Message-ID: On 15/01/2019 10:24, Philippe Verdy via Unicode wrote: > > Le?lun. 14 janv. 2019 ??20:25, Marcel Schneider via Unicode > a ?crit?: > > On 14/01/2019 06:08, James Kass via Unicode wrote: > > > > Marcel Schneider wrote, > > > >> There is a crazy typeface out there, misleadingly called 'Courier > >> New', as if the foundry didn?t anticipate that at some point it > >> would be better called "Courier Obsolete". ... > > > > ?????? ?????????????? seems a bit ????????? nowadays, as well. > > > > (Had to use mark-up for that ?span? of a single letter in order to > > indicate the proper letter form.? But the plain-text display looks > > crazy with that HTML jive in it.) > > > > I apologize for seeming to question the font name ?????? ???? while targeting only > the fact that this typeface is not updated to support the . It just > looks like the grand name is now misused to make people believe that if > **this** great font is unsupporting , it has a good reason to do so, > and we should keep people off using that ?exotic whitespace? otherwise than > ?intended,? ie for Mongolian. Since fortunately TUS started backing its use > in French (2014) > > > This is not for Mongolian and French wanted this space since long and it has a use even in English since centuries for fine typography. > So no, NNBSP is definitely NOT "exotic whitespace". It's just that it was forgotten in the early stages of computing with legacy 8-bit encodings but it should have been in Unicode since the begining as its existence is proven long before the computing age (before ASCII, or even before Baudot and telegraphic systems). It has alsway been used by typographs, it has centuries of tradition in publishing. And it has always been recommended and still today for French for all books/papers publishers. Many thanks for bringing this to the point. So the case is even worse as Unicode deliberately skipped the non-breakable thin space while thinking at encoding the whole range of other typographic spaces, even with duplicate encoding of en and em spaces, and not forgetting those old-fashioned tabular spaces and dash: figure space and dash, and punctuation space. In this particular context and with all that historic practice background, what else than malice (supposedly inspired by an unlawful and exuberant DTP vendor) could drive people not to define the line-breaking property value of U+2008 PUNCTUATION SPACE as "GL", while they did define it so for U+2007 FIGURE SPACE. Here is also the still outdated wording of UAX?#14 wrt NNBSP, Mongolian and French: [?] NARROW NO-BREAK SPACE is used in Mongolian. The MONGOLIAN VOWEL SEPARATOR acts like a NARROW NO-BREAK SPACE in its line breaking behavior. It additionally affects the shaping of certain vowel characters as described in/Section 13.5, Mongolian/, of [Unicode ]. NARROW NO-BREAK SPACE is a narrow version of NO-BREAK SPACE, which has exactly the same line breaking behavior, but with a narrow display width. It is regularly used in Mongolian in certain grammatical contexts (before a particle), where it also influences the shaping of the glyphs for the particle. In Mongolian text, the NARROW NO-BREAK SPACE is typically displayed with one third the width of a normal space character. When NARROW NO-BREAK SPACE occurs in French text, it should be interpreted as an ?espace fine ins?cable?. ?When [?] it should be interpreted as [?]? is a pure insult. NARROW NO-BREAK SPACE *is* exactly at least the French "espace fine ins?cable" *and* the Mongolian whatever-it-is-called-in-Mongolian *and* the group separator, aka triad separator, in *all* locales following the SI and ISO recommendation to group digits with spaces, not with any punctuation. As hopefully that misleading section will be edited, here?s the link to the quoted version: https://www.unicode.org/reports/tr14/tr14-41.html#DescriptionOfProperties Also I?d like or better I need to kindly ask the knowing List Members to correct the following statement *if* it is wrong: If the Unicode Standard had been set up in an unbiased way, U+2008 PUNCTUATION SPACE had been given the line break property value "GL". Perhaps the following would also be true: If the Unicode Standard had been set up in an unbiased way, there would be a NARROW NO-BREAK SPACE encoded in the range U+2000..U+200F. Thanks in advance to Philippe Verdy and any other knowing List Members for staying or getting in touch and (keeping) posting feedback. I don?t edit the subject line, nor do I spin off a new thread, given when I lauched this one I sincerely believed that the issues with NARROW NO-BREAK SPACE and with preformatted superscript abbreviation indicators for interoperable representation of French and numerous other languages (part of which are using not only the former as groun separator, but also the latter as ordinal indicators) are about to be definitely settled. Turns out they?re not. Hopefully when this thread goes on, the sometimes extremely aggressive anti-NNBSP lobbying (and also the more lenient anti-preformatted-superscript lobbying) will come to an end, freeing the way to the real Unicode interoperable digital representation of all of the world?s languages. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 15 06:25:06 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 15 Jan 2019 13:25:06 +0100 Subject: A last missing link for interoperable representation In-Reply-To: References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> Message-ID: Note that even if this NNBSP character is not mapped in a font, it should be rendered correctly with all modern renderers (the mapping is necessary only when a font design wants to tune its metrics, because its width varies between 1/8 and 1/6 em (the narrow space is a bit narrower in traditional English typography than in French, so typical English design set it at about 1/8 em, typical French design set it at 1/6 em, and neutral fonts may set it somewhere in the middle); the measure in em may however vary with some fonts (notably those using "narrow" or "wide" letters by default (because the font size in em indicates only its height) and in decorated/cursive styles (e.g. fonts with swashes need a higher line gap, the font design of the em size may be smaller than for modern simplified styles for display). But a renderer should have no problem using a default metric for all whitespace characters, that actually don't need any glyph to be drawn: All what is needed is metrics, everything else, inclusing character properties like breaking are infered by the renderer independantly of the font and other per-language tuning, or controled by styling effects applied on top of the font A renderer may expand the kerning/approach if needed for example to generate "hollow" or "shadow" effects, or to generate synthetic weights, including with "variable" fonts support, typically the renderer will base the metrics of all missing/unmapped whitespaces from the metrics given to the normal SPACE or NBSP which are typically both mapped to the same glyph; NNBSP will be synthetized easily using half the advance width of SPACE, and it's fine; renderers can also synthetize all other whitespaces for ideographic usages, or will adapt the rendering if instructed to synthetize a monospaced variant: here there's a choice for NNBSP to be rendered like NBSP, typically for French as it is normally a bit wider, or as a zero-width space like in English, or contextually for example zero-width near punctuations or NBSP between letters/digits). Fonts only specify defaults that alter the rendering produced by a renderer, but a renderer is not required to use all infos and all glyphs in a specific font, it has to adapt to the context and choose what is more relevant and which kind of data it recognizeds and implements/uses at runtime. The font just provides the best settings according to the font designer, if all features are enabled, but most work is done by the renderer (and fonts are completely unaware of tyhe actual encoding of documents, fonts are only a database containing multiple features/settings, all of them bneing optional and selectable individually). If your fonts behave incorrectly on your system because it does not map any glyph for NNBSP, don't blame the font or Unicode about this problem, blame the renderer (or the application or OS using it, may be they are very outdated and were not aware of these features, theyt are probably based on old versions of Unicode when NNBSP was still not present even if it was requested since very long at least for French and even English, before even Unicode, and long before Mongolian was then encoded, only in Unicode and not in any known supported legacy charset: Mongolian was specified by borrowing the same NNBSP already designed for Latin, because the Mongolian space had no known specific behavior: the encoded whitespaces in Unicode are compeltely script-neutral, they are generic, and are even BiDi-neutral, they are all usable with any script). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 15 05:32:34 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Tue, 15 Jan 2019 11:32:34 +0000 (GMT) Subject: A last missing link for interoperable representation Message-ID: <62f33173.a54b.1685148d5b6.Webtop.229@btinternet.com> Martin J. D?rst wrote: > So rich text technology is already way ahead when it comes to styled > text. Do we want to encode background-color variant selectors in > Unicode? If yes, how many? Yes. You would only need one. Background colour was a feature of teletext in the United Kingdom from 1976. It was very effective in its application. In teletext, there were seven choices of foreground colour (red, green, yellow, blue, magenta, cyan, white), the default background was black. The New Background control character caused the background colour to become the same as the current foreground colour in which text was being displayed. One could then change the foreground colour. There was also a Black Background control code. This was necessary because neither text nor graphics could be black in teletext. In teletext those control codes were stateful and applied until a change or to the end of the line of text, whichever came first. So, given that Unicode is starting to encode colour choices for emoji and black is in the set of colours - and that might possibly extend to choosing colour for text - if Unicode were to encode CHANGE BACKGROUND COLOUR then the background colour could become the current foreground colour, even if that chosen foreground colour had just been selected and not actually used to colour text. The implementation in Unicode need not be stateful. > [Hint: The last two questions are rhetorical.] Maybe that was the intention, but the questions were asked and the concept is an interesting possibility for implementation. William Overington Tuesday 15 January 2019 From unicode at unicode.org Tue Jan 15 07:08:39 2019 From: unicode at unicode.org (Victor Gaultney via Unicode) Date: Tue, 15 Jan 2019 13:08:39 +0000 Subject: Encoding italic (was: A last missing link) Message-ID: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> I've been alerted to this thread by a friend, so just rejoined in order to respond. I'm currently doing research into italics. Some of the confusion and disagreement about italics centers around whether it is typographic markup or textual content. Both historically and currently italics can be used for either, but can clearly change the meaning of a word or phrase*.? It also has a different semantic meaning than bold.** It is not just rich text, nor parallel to casing. It works differently, and most like the use of matching punctuation (parentheses, brackets, quotation marks). Italics are sometimes used to indicate stress, although that is only one use. Stress is like a phonetic sound. It is represented in writing systems in different ways. However a writing system text encoding standard relates to the visual symbols and the rules of their behaviour rather than to the sound itself. Italicised text is visually different, and that difference can have a variety of meanings. It would make sense for Unicode to encode the visual difference that marks those meanings (such as stress), just as it does with punctuation. Quotation marks, for example, are visually represented in different ways depending on the language, but Unicode does have characters that are use to indicate that 'this is a quote'. So it makes no sense for Unicode to encode 'stress' as a character, but it *may* make theoretical sense to encode 'italic begin' and 'italic end' characters, just as we do parentheses, brackets, quotation marks, etc. This would allow for the use of italic in non-styled environments (text messages, social media, etc.). BTW - encoding the begin/end of italic would be very different from HTML semantic tags that attempt to encode meaning. Like punctuation, it only encodes the visual distinction, not the meaning. Use of variation selectors, a single character modifier, or combining characters also seem to be less useful options, as they act at the individual character level and are highly impractical. They also violate the key concept that italics are a way of marking a span of text as 'special' - not individual letters. Matched punctuation works the same way and is a good fit for italic. Although italic is a deeply Latin script concept, people do want to apply it to non-latin text (with sometimes limited sense and success). Encoding two punctuation characters would allow use across scripts, in the same way that quotation marks are sometimes used. My current research in italic won't get published publicly until 2020, however I gave a talk at ATypI Montreal about the nature of italic (https://www.youtube.com/watch?v=4vlFxed22Sg). I have an unpublished paper on italic but can't share it publicly (due to image rights). Contact me if you would like to see a private copy. Victor Gaultney * David Crystal's famous example is that these two sentences mean different things: 'I've lost my red slippers' and 'I've lost my /red/ slippers' (as opposed to my blue ones). Crystal, David. 1994. The Cambridge encyclopedia of language (Cambridge University Press), p13-14. ** Vachek, Josef, and Philip A Luelsdorff. 1989. Written language revisited (Amsterdam: Benjamins), p45-48. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 15 09:48:15 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Tue, 15 Jan 2019 15:48:15 +0000 (GMT) Subject: Encoding italic (was: A last missing link) In-Reply-To: <664185367.230185.1547566799485.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <1239322872.230125.1547566618695.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <664185367.230185.1547566799485.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> Message-ID: <2cd96c5f.8020.1685232ee39.Webtop.232@btinternet.com> Hi You are the gentleman who kindly made the Gentium typeface open source. Thank you for your generous gift to the world. > Use of variation selectors, a single character modifier, or combining characters also seem to be less useful options, as they act at the individual character level and are highly impractical. They also violate the key concept that italics are a way of marking a span of text as 'special' - not individual letters. Matched punctuation works the same way and is a good fit for italic. Italics works differently from matched punctuation marks in that with italics there is a change to each glyph whereas with matched punctuation there is no change to the glyphs between the matched punctuation marks. That difference leads to the significant difficult that there are thus two competing forces here. One of those forces is what you have stated about the nature of italics. The other of those forces is that Unicode is not stateful. Years ago I encoded some Private Use Area codes for such features as italics, with a start character and an end character to surround a span of text that would then be rendered in italics. As a result of discussion and advice I learned that such characters are not acceptable for encoding into regular Unicode because the effect would be stateful. So yes, the method that I suggested and for which James Kass suggested an enhancement is peculiar when viewed against the theory of the way that italics are used, but neither the method nor the enhanced method is stateful and that is an important feature of them. Now it would be possible for a software application program to have a feature for composing plain text where a span of text may be highlighted by a user of the software application program and every character (except perhaps spaces?) within that span of text has, at the click of a button, a VS14 character inserted after it. I remember that when handsetting metal type the same space sorts were used with italics as with roman. There could also be a button that could remove all VS14 characters, if any, from within a highlighted span of text. So, for someone typesetting plain text and viewing plain text the effect could look to be in accordance with how you consider italics should be encoded, though for plain text interchange the encoding would still be by using a VS14 character after each character that one wishes to become displayed italicized. William Overington Tuesday 15 January 2019 From unicode at unicode.org Tue Jan 15 12:22:03 2019 From: unicode at unicode.org (Johannes Bergerhausen via Unicode) Date: Tue, 15 Jan 2019 19:22:03 +0100 Subject: wws dot org Message-ID: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> Dear list, I am happy to report that www.worldswritingsystems.org is now online. The web site is a joint venture by ? Institut Designlabor Gutenberg (IDG), Mainz, Germany, ? Atelier National de Recherche Typographique (ANRT), Nancy, France and ? Script Encoding Initiative (SEI), Berkeley, USA. For every known script, we researched and designed a reference glyph. You can sort these 292 scripts by Time, Region, Name, Unicode version and Status. Exactly half of them (146) are already encoded in Unicode. Here you can find more about the project: www.youtube.com/watch?v=CHh2Ww_bdyQ And is a link to see the poster: https://shop.designinmainz.de/produkt/the-worlds-writing-systems-poster/ All the best, Johannes ? Prof. Bergerhausen Hochschule Mainz, School of Design, Germany www.designinmainz.de www.decodeunicode.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 15 15:40:21 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 15 Jan 2019 21:40:21 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> Message-ID: <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com> Although there probably isn't really any concerted effort to "keep plain-text mediocre", it can sometimes seem that way. As we've been told repeatedly, just because something has been done over and over again doesn't mean that there's a precedent for it. Using spans of text as a general indicator of rich-text seems reasonable at first blush.? But selected spans can also be copy/pasted (relocated), which is not stylistic at all.? Spans of text can be selected to apply casing, which is often seen as non-stylistic.? In applications such as BabelPad, spans of text can be converted to-and-from various forms of Unicode references and encodings.? Spans of text can be transliterated, moved, or deleted. In short, selecting a span of text only means that the user is going to apply some kind of process to that span. Avant-garde enthusiasts are on the leading edge by definition. That's why they're known as trend setters.? Unicode exists because forward-looking people envisioned it and worked to make it happen. Regardless of one's perception of exuberance, Unicode turned out to be so much more than a fringe benefit. From unicode at unicode.org Tue Jan 15 16:16:16 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 15 Jan 2019 23:16:16 +0100 Subject: A last missing link for interoperable representation In-Reply-To: References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> Message-ID: On 15/01/2019 13:25, Philippe Verdy via Unicode wrote: > > Note that even if this NNBSP character is not mapped in a font, it > should be rendered correctly with all modern renderers (the mapping > is necessary only when a font design wants to tune its metrics, > because its width varies between 1/8 and 1/6 em (the narrow space is > a bit narrower in traditional English typography than in French, so > typical English design set it at about 1/8 em, typical French design > set it at 1/6 em, and neutral fonts may set it somewhere in the > middle); the measure in em may however vary with some fonts (notably > those using "narrow" or "wide" letters by default (because the font > size in em indicates only its height) and in decorated/cursive styles > (e.g. fonts with swashes need a higher line gap, the font design of > the em size may be smaller than for modern simplified styles for > display). > > But a renderer should have no problem using a default metric for all > whitespace characters, that actually don't need any glyph to be > drawn: All what is needed is metrics, everything else, inclusing > character properties like breaking are infered by the renderer > independantly of the font and other per-language tuning, or controled > by styling effects applied on top of the font Indeed, since every Unicode implementation must rely on the character properties, and given keeping this library up-to-date is straightforward and easy, there is really no point in displaying a .notdef box in lieu of whatever whitespace. As a consequence, prior to assessing the impact of the group separator migration from (wrong) to (correct) on implementations and interoperability, Unicode would be well advised to start assessing the impact of implementations (and, of course, the backing vendors) on correct rendering of , and on the related usability and interoperability of the digital representation of those many locales that should rely on . > > A renderer may expand the kerning/approach if needed for example to > generate "hollow" or "shadow" effects, or to generate synthetic > weights, including with "variable" fonts support, typically the > renderer will base the metrics of all missing/unmapped whitespaces > from the metrics given to the normal SPACE or NBSP which are > typically both mapped to the same glyph; NNBSP will be synthetized > easily using half the advance width of SPACE, and it's fine; > renderers can also synthetize all other whitespaces for ideographic > usages, or will adapt the rendering if instructed to synthetize a > monospaced variant: here there's a choice for NNBSP to be rendered > like NBSP, typically for French as it is normally a bit wider, or as > a zero-width space like in English, or contextually for example > zero-width near punctuations or NBSP between letters/digits). In a monospaced font, NNBSP has normally the width of a character, but it has been designed for proportional fonts, and there, it must not have the width of a digit, as that would annihilate the required effect. The group separator must never have the width of a full digit, not even of digit 1 in variable-width digits, but just a slight gap ensuring correct readability, BTW also after the decimal separator as per ISO 80000. Between punctuation, mustn?t be zero-wide, as it is used in English to separate closing single and double quotation marks when a nested quotation ends the first level quotation. I don?t think that English does use elsewhere around punctuation except dashes if appropriate according to the applied style manual, but Canadian French does, unlike an urban legend saying it doesn?t. It does only prefer not to space off punctuation *if* is unavailable. That is another proof of the inappropriateness of the for the purpose of spacing off tall punctuation marks. > > Fonts only specify defaults that alter the rendering produced by a > renderer, but a renderer is not required to use all infos and all > glyphs in a specific font, it has to adapt to the context and choose > what is more relevant and which kind of data it recognizeds and > implements/uses at runtime. The font just provides the best settings > according to the font designer, if all features are enabled, but most > work is done by the renderer (and fonts are completely unaware of > tyhe actual encoding of documents, fonts are only a database > containing multiple features/settings, all of them bneing optional > and selectable individually). Good point, indeed. Currently we are too much concerned with fonts, while actually it?s all up to the renderer. Today as most devices are permanently connected to the internet, a decent rendering engine could as well grab missing glyphs from an online repository, at Google Fonts or at the application vendor?s website. All that missing-glyph-whining seems completely outdated and very detrimental to the user experience. It is so anachronistic that people shouldn?t be surprised about suspicions of intentional bugs for the purpose of unlawful lobbying by messing up user experience outside of certain DTP applications. The French locale is the most heavily impacted victim of those operating modes. > > If your fonts behave incorrectly on your system because it does not > map any glyph for NNBSP, don't blame the font or Unicode about this > problem, blame the renderer (or the application or OS using it, may > be they are very outdated and were not aware of these features, theyt > are probably based on old versions of Unicode when NNBSP was still > not present even if it was requested since very long at least for > French and even English, before even Unicode, and long before > Mongolian was then encoded, only in Unicode and not in any known > supported legacy charset: Mongolian was specified by borrowing the > same NNBSP already designed for Latin, because the Mongolian space > had no known specific behavior: the encoded whitespaces in Unicode > are compeltely script-neutral, they are generic, and are even > BiDi-neutral, they are all usable with any script). > Completely agreed. If I blame Unicode it?s for keeping the NNBSP off the Standard during almost a decade, which translates to two decades of delay due to the loss of dynamics past the early rush, and to people who keep bullying the NNBSP 20 years after it was encoded, and despite it is now widely supported. Also the ignorance related to NNBSP is still abysmal despite the very popular style manual of the French Imprimerie Nationale requires it?s use explicitly: EXCLAMATION MARK espace fine ins?cable ! justifying space (quoted/translated from figure p. 149; ISBN 9782743304829). Many thanks to all who took part in this thread ? that is very instructive and has brought up many new insights ? and likewise to those spinning of child threads and sharing material. Keep on the good work and be successful! Best regards, Marcel From unicode at unicode.org Tue Jan 15 17:47:14 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 15 Jan 2019 15:47:14 -0800 Subject: Encoding italic (was: A last missing link) In-Reply-To: <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com> Message-ID: On Tue, Jan 15, 2019 at 1:47 PM James Kass via Unicode wrote: > > Although there probably isn't really any concerted effort to "keep > plain-text mediocre", it can sometimes seem that way. > Dennis Ritchie allegedly replied to requests for new features in C with ?If you want PL/I, you know where to find it.? C is still an austere language, and still well used, with users who want C++ or Java knowing where to find them. If you want all the features of rich text, use rich text. Avant-garde enthusiasts are on the leading edge by definition. That's > why they're known as trend setters. Unicode exists because > forward-looking people envisioned it and worked to make it happen. > Regardless of one's perception of exuberance, Unicode turned out to be > so much more than a fringe benefit. > Unicode exists because large corporations wanted to sell computers to users around the world, and found supporting a million different character sets was costly and buggy, and that users wanted to mix scripts in ways that a single character set didn't support and ISO 2022 and similar solutions just weren't cutting it. That's a clear user story. People can use italics on computers without problem. Twitter has chosen not to support italics on their platform, which users have found hacky work-arounds for. That's not such a clear user story; shouldn't Twitter add support for italics instead of changing every system in the world? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 15 19:15:27 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 16 Jan 2019 01:15:27 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com> Message-ID: Enabling plain-text doesn't make rich-text poor. People who regard plain-text with derision, disdain, or contempt have every right to hold and share opinions about what plain-text is *for* and in which direction it should be heading.? Such opinions should receive all the consideration they deserve. From unicode at unicode.org Tue Jan 15 21:53:47 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 16 Jan 2019 04:53:47 +0100 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com> Message-ID: <4520c27d-c54a-ea00-87f1-ed2b9cc43c04@orange.fr> On 16/01/2019 02:15, James Kass via Unicode wrote: > > Enabling plain-text doesn't make rich-text poor. > > People who regard plain-text with derision, disdain, or contempt have > every right to hold and share opinions about what plain-text is *for* > and in which direction it should be heading. Such opinions should > receive all the consideration they deserve. Perhaps there?s a need to sort out what plain text is thought to be across different user communities. Sometimes ?plain text? is just a synonym for _draft style_, considering that a worker should not need to follow any style guide, because (a) many normal keyboards don?t enable users to do so, and (b) the process is too complicated using mainstream extended keyboard layouts. From this point of view, any demand to key in directly a text in a locale?s accurate digital representation is likely to be considered an unreachable challenge and thus, an offense. But indeed, people are entitled not to screw down their requirements as of what text is supposed to look like. From that POV, draft style is unbearable, and being bound to it is then the actual offense. The first step would then be to beef up that draft style so that it integrates all characters needed for a fully featured representation of a locale?s language, from curly quotes to preformatted superscript. Unicode makes it possible, in the straight line of what was set up in ISO/IEC 6937. The next step is to design appropriate input methods. Today, we can even get back the u?n?d?e?r?l?i?n?e? that we were deprived of, by adding an appropriate dead key or combining diacritic, but that?s still experimental. It already works better, though, than the Unicode Syriac abbreviation control, whose overline is *not* rendered in Chrome on Linux, The same way, Unicode could encode a Latin italic control, or as Victor Gaultney proposes, a Latin italic start control and a Latin italic end control, directing the rendering engine to pick italics instead of drawing a linie along the rest of the word. However, the discussion about Fraktur typefaces in the parent thread made clear that reasoning in terms of roman vs italic is not really interoperable, because in Roman typefaces, italic is polysemic, as it?s used both for foreign words and for stress, while in Fraktur, stress is denoted by spacing out, and foreign words, by using roman. That would require a start and end pair of both Latin foreign word controls and Latin stress controls. As we see it from here, that would be even less implemented than the Syriac abbreviation format control. It might be considered Unicode conformant, since it would be part of the interoperable digital representation of Latin script using languages, and its use could be extended to other scripts. But that is *not* what I?m asking for. First, we aren?t writing in Fraktur any more, at least not in France nor in any other language using preformatted superscript abbreviation indicators. And second, if we need a document for full-fleshed publishing, we can use LaTeX or InDesign. What I?m asking for is simply that people are enabled to write in their language in a decent manner and can use that text in any environment without postprocessing *and* without looking downright bad. That might please even those who are looking at draft style with disdain. Best regards, Marcel From unicode at unicode.org Tue Jan 15 22:40:16 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 16 Jan 2019 04:40:16 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> Message-ID: <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Victor Gaultney wrote, > Use of variation selectors, a single character modifier, or combining > characters also seem to be less useful options, as they act at the individual > character level and are highly impractical. They also violate the key concept > that italics are a way of marking a span of text as 'special' - not individual > letters. Matched punctuation works the same way and is a good fit for italic. The VS possibility would double the character count of any strings including them.? That may make it undesirable for groups like Twitter who have limits.? But math (mis)use doesn't affect the character count.? If the VS method were to be used, the math alphanumerics might continue to be used where possible, at least by Twitter users who already employ the math-alphas to make their corpus of legacy data. Using VS arose in the parent thread as a way of avoiding the necessity of adding additional characters to the standard.? (But we don't seem to be running out of available code space.)? The purpose of VS is to preserve variant letter form distinctions in plain-text, which seems to apply to italics.? Further, VS is an existing mechanism which wouldn't be expected to impact searching and so forth on savvy systems.? (An opening/closing pair of control characters also shouldn't impact searching.)? Finally, VS already works in existing technology and there wouldn't be a long down-time waiting for updates to the standard and implementation of same. (Not that we should rush to judgment or "solutions" here, just that an ad-hoc "solution" is possible and could be implemented by third-parties.) Concerns about statefulness in plain-text exist.? Treating "italic" as an opening/closing "punctuation" may help get around such concerns.? IIRC, it was proposed that the Egyptian cartouche be handled that way. Like emoji, people who don't like italics in plain text don't have to use them. From unicode at unicode.org Tue Jan 15 23:05:24 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 15 Jan 2019 21:05:24 -0800 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com> Message-ID: On Tue, Jan 15, 2019 at 5:17 PM James Kass via Unicode wrote: > Enabling plain-text doesn't make rich-text poor. > Adding italics to Unicode will complicate the implementation of all rich text applications that currently support italics. > People who regard plain-text with derision, disdain, or contempt have > every right to hold and share opinions about what plain-text is *for* > and in which direction it should be heading. Such opinions should > receive all the consideration they deserve. > Really? There's no one here regards plain text with derision, disdain or contempt. I might complain about the people who claim to like plain text yet would only be happy with massive changes to it, though. However, plain text can be used standalone, and it can be used inside programs and other formats. Dismissing the people who use Unicode in ways that aren't plain text is unfair and hurts your case. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 16 00:17:46 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 16 Jan 2019 06:17:46 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com> Message-ID: <05235340-4f56-9c34-67e1-6f9c2e5829d1@gmail.com> Responding to David Starner, > I might complain about the people who claim to like plain text yet would > only be happy with massive changes to it, though. Most movie lovers welcomed talkies. People are free to cling to their rotary phones as long as they like.? They just can't press the pound sign. > However, plain text can be used standalone, and it can be used inside > programs and other formats. That remains true even post-emoji.? How would italics change that? > Dismissing the people who use Unicode in ways that aren't plain text > is unfair and hurts your case. It wasn't my intention to be dismissive, much, so point taken. Discussions like this one exist so that people can express concerns and share ideas towards resolutions. > Adding italics to Unicode will complicate the implementation of all rich > text applications that currently support italics. Would there be any advantages to rich-text apps if italics were added to Unicode?? Is there any cost/benefit data?? You've made an assertion about complication to rich-text apps which I can neither confirm nor refute. One possible advantage would be interoperability.? People snagging snippets of text from web pages or word processors and dropping data into their plain-text windows wouldn't be bamboozled by the unexpected.? If computer text is getting exchanged, isn't it better when it can be done in a standard fashion? From unicode at unicode.org Wed Jan 16 02:15:38 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 16 Jan 2019 09:15:38 +0100 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com> Message-ID: <33661316-2594-8183-f268-e3430e36c2ae@orange.fr> On 16/01/2019 06:05, David Starner via Unicode wrote: [?] > [?] There's no one here regards plain text with derision, disdain or contempt. There is one sort of so-called plain text that looks unbearable to me. That is the draft-style plain text full of ASCII fallbacks. Especially those where Latin abbreviation indicators that are correctly superscript, are sitting on the baseline. Also those using ASCII space or Latin-1 non-breakable space to space off French punctuation, and where those marks are then cut off by line breaks, or torn apart by justification when such plain text is the backbone of rich text on the web (where remains unhacked, unlike in word processors where it?s fixed-width, and even then it?s ugly). > [?] Dismissing the people who use Unicode in ways that aren't plain text is unfair [?]. Is this statement applying the restrictive house policy about what is ?ordinary (plain) text? as it is found in TUS? I?m asking the question because even if this statement is a mark of support and empathy, I?m uncomfortable with the idea that there seems to be a subset of Unicode that despite being plain text by definition, cannot be used in every plain text string. Please feel free to post your definition of "plain text". I feel that it will add to the collection. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 16 02:57:15 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 16 Jan 2019 08:57:15 +0000 Subject: A last missing link for interoperable representation In-Reply-To: References: <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com> <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com> <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> Message-ID: Julian Bradfield wrote, > Oh, and what about dropped initials? They have been used in both > manuscripts and typography for many centuries - surely we must encode > them? Naa-aah, we just hack the full width presentation forms for that. ?rop ?aps in ?lain ?ext (Whether they actually drop depends on the font, though.) From unicode at unicode.org Wed Jan 16 03:12:23 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Wed, 16 Jan 2019 01:12:23 -0800 Subject: Encoding italic In-Reply-To: <05235340-4f56-9c34-67e1-6f9c2e5829d1@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com> <05235340-4f56-9c34-67e1-6f9c2e5829d1@gmail.com> Message-ID: On Tue, Jan 15, 2019 at 10:19 PM James Kass via Unicode wrote: > Would there be any advantages to rich-text apps if italics were added to > Unicode? Is there any cost/benefit data? You've made an assertion > about complication to rich-text apps which I can neither confirm nor refute. It's trivial; virtually all rich-text apps support italics or specifically don't support italics. Suddenly they have to unify italics from the plain text with the higher level italics, or they have to exclude italics from the input data. > One possible advantage would be interoperability. People snagging > snippets of text from web pages or word processors and dropping data > into their plain-text windows wouldn't be bamboozled by the unexpected. > If computer text is getting exchanged, isn't it better when it can be > done in a standard fashion? Bamboozled by the unexpected? I think the expectations of those who have plain-text windows (who are still watching silents, in a sense) is that pasting data into them will not copy italics. As for more common users, a quick websearch shows many examples of people frustrated that they cut and paste something and details like bold and italics were carried along. This also establishes that current systems already allow rich text to be cut-and-pasted in a platform-specific manner. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Wed Jan 16 03:30:40 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 16 Jan 2019 09:30:40 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: <0af51bfc-5c40-8b7e-3c55-d560b1aedd27@gmail.com> I wrote, > The VS possibility would double the character count of any strings > including them. A kind list member has pointed out privately that the above is mistaken.? Twitter character counts aren't actually character counts.? Each math-alpha counts as two characters as do the VS characters.? So a string with VS characters interspersed would actually be triple rather than double. (I've also been advised that a lot of the math-alpha on Twitter involves fraktur, script, and double struck characters.? As was pointed out to me, that practice would probably continue even if Twitter enabled italic and bold styling as a feature.? Again, I do not personally know how widespread the practice is.) From unicode at unicode.org Wed Jan 16 05:23:59 2019 From: unicode at unicode.org (Victor Gaultney via Unicode) Date: Wed, 16 Jan 2019 11:23:59 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: James Kass wrote: > Concerns about statefulness in plain-text exist.? Treating "italic" as > an opening/closing "punctuation" may help get around such concerns. > IIRC, it was proposed that the Egyptian cartouche be handled that way. I do appreciate the technical issues surrounding statefulness and user expectation when they select, copy, and paste. However that has always been an issue. The Latin script (and many others) already has 'states', and that is reflected in the encoding of the markers that indicate the beginning and end of those states (parens, quotes, etc.). In the Latin script those markers are visually represented as separate glyphs, although sometimes enterprising font makers will use OpenType or Graphite to adjust those glyphs in context. Encoding 'begin italic' and 'end italic' would introduce difficulties when partial strings are moved, etc. But that's no different than with current punctuation. If you select the second half of a string that includes an end quote character you end up with a mismatched pair, with the same problems of interpretation as selecting the second half of a string including an 'end italic' character. Apps have to deal with it, and do, as in code editors. Apps (and font makers) can also choose how to deal with presenting strings of text that are marked as italic. They can choose to present visual symbols to indicate begin/end, such as /this/. Or they can present it using the italic variant of the font, if available. Yes that brings up the issue of what to do if no italic counterpart is there. But that's already an issue with people using math characters for pseudo-italic. I'd guess that far, far more fonts in the world have italic counterparts than contain math chars, and the trend toward always having roman/italic matched pairs is something I've established in my research interviews. Treating italic like punctuation is a win for a lot of people: - Users get their italic content preserved in plain text - Those who develop plain text apps (social media in particular) don't have to build in a whole markup/markdown layer into their apps - Misuse of math chars for pseudo-italic would likely disappear - The text runs between markers remain intact, so they need no special treatment in searching, selecting, etc. - It finally, and conclusively, would end the decades of the mess in HTML that surrounds and . My main point in suggesting that Unicode needs these characters is that italic has been used to indicate specific meaning - this text is somehow special - for over 400 years, and that content should be preserved in plain text. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 16 06:13:26 2019 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Wed, 16 Jan 2019 12:13:26 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <0af51bfc-5c40-8b7e-3c55-d560b1aedd27@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <0af51bfc-5c40-8b7e-3c55-d560b1aedd27@gmail.com> Message-ID: > On 16 Jan 2019, at 09:30, James Kass via Unicode wrote: > > > I wrote, > > > The VS possibility would double the character count of any strings > > including them. > > A kind list member has pointed out privately that the above is mistaken. Twitter character counts aren't actually character counts. Each math-alpha counts as two characters as do the VS characters. So a string with VS characters interspersed would actually be triple rather than double. Odd! I have just briefly experimented with twitter and it appears that any character ? U+1100 has a count of 2 and any character < U+1100 has a count of 1. I remember many years ago twitter was incorrectly counting in UTF-16 encoding units thus giving a count of 1 for BMP characters and a count of 2 for astral characters. That problem was fixed long ago. Andr? Schappo > (I've also been advised that a lot of the math-alpha on Twitter involves fraktur, script, and double struck characters. As was pointed out to me, that practice would probably continue even if Twitter enabled italic and bold styling as a feature. Again, I do not personally know how widespread the practice is.) > From unicode at unicode.org Wed Jan 16 06:16:17 2019 From: unicode at unicode.org (Andrew Cunningham via Unicode) Date: Wed, 16 Jan 2019 23:16:17 +1100 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: HI Victor, an off list reply. The contents are just random thoughts sparked by an interesting conversation. On Wed, 16 Jan 2019 at 22:44, Victor Gaultney via Unicode < unicode at unicode.org> wrote: > > - It finally, and conclusively, would end the decades of the mess in HTML > that surrounds and . > I am not sure that would fix the issue, more likely compound the issue making it even more blurry what the semantic purpose is. HTML5 make both and semantic ... and by the definition the style of the elements is not necessarily italic. for instance would be script dependant, may be partially script dependant when another appropriate semantic tag is missing. A character/encoding level distinction is just going to compound the mess. And then there are all the other script specific typographic / typesetting conventions that should also be considered. > My main point in suggesting that Unicode needs these characters is that > italic has been used to indicate specific meaning - this text is somehow > special - for over 400 years, and that content should be preserved in plain > text. > > > Underlying, bold text, interletter spacing, colour change, font style change all are used to apply meaning in various ways. Not sure why italic is special in this sense. Additionally without encoding the meaning of italic, all you know is that it is italic, not what convention of semantic meaning lies behind it. And I am curious on your thoughts, if we distinguish italic in Unicode, encode some way of spacifying italic text, wouldn't it make more sense to do away with italic fonts all together? and just roll the italic glyphs into the regular font? In theory changing italic from a stylistic choice as it currently is to a encoding/character level semantic is a paradigmn shift. We dont have separate fonts for variation selectors or any other mecahanism in unicode,and it would seem to make sense to roll character glyph variation into a single font. And potentially exclude italicisation from being a viable axis in a variable font. Just speculation on my part. To clarify I am neither for nor against encoding italics. But so far there doesn't seem to be a robust case for it. But it it were introduced I would prefer a system that was more inclusive of all scripts, giving proper analysis of typeseting and typographic conventions in each script and well founded decisions on which should be encoded. Cherry picking one feature relevant to a small set of scripts seems to be a problematic path. I have enough trouble with ordered and unordered lists and list markers in HTML without expaning the italics mess in HTML. -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 16 08:33:39 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 16 Jan 2019 15:33:39 +0100 Subject: wws dot org In-Reply-To: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> Message-ID: <63d51c54-66c7-1f49-bf18-d5fd8c656d89@orange.fr> On 15/01/2019 19:22, Johannes Bergerhausen via Unicode wrote: > Dear list, > > I am happy to report that www.worldswritingsystems.org ?is now online. > > The web site is a joint venture by > > ? Institut Designlabor Gutenberg (IDG), Mainz, Germany, > ? Atelier National de Recherche Typographique (ANRT), Nancy, France and > ? Script Encoding Initiative (SEI), Berkeley, USA. > > For every known script, we researched and designed a reference glyph. > > You can sort these 292 scripts by Time, Region, Name, Unicode version and Status. > Exactly half of them (146) are already encoded in Unicode. So to date, Unicode has only made half its way, and for every single script in the Standard there is another script out there that remains still unsupported. First things first. When I first replied in the first thread of this year I already warned: >>> Having said that, still unsupported minority languages are top priority. I didn?t guess that I opened a Pandora box whose content would lead us far away from the only useful goal deeply embedded in the concept of Unicode: support all of the world?s writing systems. Instead, we?re discussing how to enable social media users to tune ephemeral messages even more to attract even more the scarce attention of overwhelmed co-users before going buried in the mass of a vanishing timeline. I sought feedback about using Unicode to get back the underlining feature known from the typewriter era. But like some other hints I provided, it went unpicked? Sadly it?s uninteresting, no cherries. Also if Unicode had to wait until enough characters are picked for adoption prior to encoding the missing scripts, I?m afraid the job won?t ever be done? The industry is welcome to help speeding up the process. Thanks to Johannes Bergerhausen for setting up and sharing this resource. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 16 10:23:40 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 16 Jan 2019 08:23:40 -0800 Subject: wws dot org In-Reply-To: <63d51c54-66c7-1f49-bf18-d5fd8c656d89@orange.fr> References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> <63d51c54-66c7-1f49-bf18-d5fd8c656d89@orange.fr> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 16 11:30:15 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Wed, 16 Jan 2019 17:30:15 +0000 (GMT) Subject: New ideas (from: wws dot org) In-Reply-To: References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> <63d51c54-66c7-1f49-bf18-d5fd8c656d89@orange.fr> Message-ID: <74897b08.2613.16857b6a9e8.Webtop.231@btinternet.com> Asmus Freytag wrote as follows: > PS: of course, if a contemplated change, such as the one alluded to, > should be ill advised, its negative effects could have wide ranging > impacts...but that's not the topic here. If you object to encoding italics please say so and if possible please provide some reasons. I am on moderated posts because one of my ideas, on which I am continuing to research, is permanently banned from being discussed in the Unicode mailing list. Every post that I send is screened by a moderator, not every post gets through to the mailing list. There have been developments in my research project such as the definition of an encoding space for the particular purpose and I am proposing being able to express access to those items using U+FFF7 and a sequence of tag digits so that the items can be unambiguously encoded in Unicode plain text. That could be very useful for the future, yet I cannot even post about it to the Unicode mailing list because the topic is permanently banned. One character to implement a potentially very useful invention, and its encoding cannot be discussed in the Unicode mailing list. So many people will not even know that the suggestion has ever been made and so they can neither think it a good idea, nor make helpful comments or otherwise because it cannot even be discussed. So the encoding of italics, on which topic my posts have thus far been allowed through, is only very minorly regarded as controversial in relation to my research project. I have sent a copy of this email to various people as well as to the Unicode mailing list, and it may well be that this post will not be allowed to go through to the Unicode mailing list due to the ban, so if replying to this email please do not send a copy to the Unicode mailing list unless you get a copy listed as from me via Unicode rather than just listed as from me, because although I would like to get discussion in the Unicode mailing list of the possibility of encoding U+FFF7 as a base character for these items I do not wish such a discussion in the Unicode mailing list to be by error. William Overington Wednesday 16 January 2019 From unicode at unicode.org Wed Jan 16 11:38:09 2019 From: unicode at unicode.org (Phake Nick via Unicode) Date: Thu, 17 Jan 2019 01:38:09 +0800 Subject: wws dot org In-Reply-To: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> Message-ID: Feedback after briefly reading the East Asia section of the website: 1. I am pretty sure the "Kaida" script is not living anymore, according to Wikipedia description 2. Hentaigana refers to all alternative form of kana that're used before modern standardization, I don't think they're still used actively now. 3. The meaning of the "Old Hanzi" is not clear. If it is the same definition as the one stated in this blog: http://babelstone.blogspot.com/2007/07/old-hanzi.html , then it is not referring to a single script and instead refer to all historical ways to write Hanzi, including Oracle Bone script, Bronze script, and (Small) Seal script and such. And the list have already separately include oracle bone script, bronze script and seal script, which apparently make this "old hanzi" entry redundant. ? 2019?1?16??? 02:25?Johannes Bergerhausen via Unicode ??? > Dear list, > > I am happy to report that www.worldswritingsystems.org is now online. > > The web site is a joint venture by > > ? Institut Designlabor Gutenberg (IDG), Mainz, Germany, > ? Atelier National de Recherche Typographique (ANRT), Nancy, France and > ? Script Encoding Initiative (SEI), Berkeley, USA. > > For every known script, we researched and designed a reference glyph. > > You can sort these 292 scripts by Time, Region, Name, Unicode version and > Status. > Exactly half of them (146) are already encoded in Unicode. > > Here you can find more about the project: > www.youtube.com/watch?v=CHh2Ww_bdyQ > > And is a link to see the poster: > https://shop.designinmainz.de/produkt/the-worlds-writing-systems-poster/ > > All the best, > Johannes > > > > > ? Prof. Bergerhausen > > Hochschule Mainz, School of Design, Germany > > www.designinmainz.de > > www.decodeunicode.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 16 12:07:42 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Wed, 16 Jan 2019 10:07:42 -0800 Subject: New ideas (from: wws dot org) In-Reply-To: <74897b08.2613.16857b6a9e8.Webtop.231@btinternet.com> References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> <63d51c54-66c7-1f49-bf18-d5fd8c656d89@orange.fr> <74897b08.2613.16857b6a9e8.Webtop.231@btinternet.com> Message-ID: <2e461d85-3ba3-adf8-2435-4bdc25c5af24@ix.netcom.com> On 1/16/2019 9:30 AM, wjgo_10009 at btinternet.com wrote: > Asmus Freytag wrote as follows: > >> ?PS: of course, if a contemplated change, such as the one alluded to, >> should be ill advised, its negative effects could have wide ranging >> impacts...but that's not the topic here. > > If you object to encoding italics please say so and if possible please > provide some reasons. It's not the topic of this thread. Let's keep the discussion in one place. A./ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 16 14:53:05 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 16 Jan 2019 20:53:05 +0000 Subject: NNBSP (was: A last missing link for interoperable representation) In-Reply-To: References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> Message-ID: <20190116205305.213b335d@JRWUBU2> On Tue, 15 Jan 2019 13:25:06 +0100 Philippe Verdy via Unicode wrote: > If your fonts behave incorrectly on your system because it does not > map any glyph for NNBSP, don't blame the font or Unicode about this > problem, blame the renderer (or the application or OS using it, may > be they are very outdated and were not aware of these features, theyt > are probably based on old versions of Unicode when NNBSP was still > not present even if it was requested since very long at least for > French and even English, before even Unicode, and long before > Mongolian was then encoded, only in Unicode and not in any known > supported legacy charset: Mongolian was specified by borrowing the > same NNBSP already designed for Latin, because the Mongolian space > had no known specific behavior: the encoded whitespaces in Unicode > are compeltely script-neutral, they are generic, and are even > BiDi-neutral, they are all usable with any script). The concept of this codepoint started for Mongolian, but was generalised before the character was approved. Now, I understand that all claims about character properties that cannot be captured in the UCD should be dismissed as baseless, but if we believed the text of TUS we would find that NNBSP has some interesting properties with application only to Mongolian: 1) It has a shaping effect on following character. 2) It has zero width at the start of a line. 3) When the line-breaking algorithm does not provide enough line-breaking opportunities, it changes its line-breaking property from GL to BB. Or is property (3) appropriate for French? Richard. From unicode at unicode.org Wed Jan 16 21:38:46 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 17 Jan 2019 03:38:46 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: Victor Gaultney wrote, > Treating italic like punctuation is a win for a lot of people: Italic Unicode encoding is a win for a lot of people regardless of approach.? Each of the listed wins remains essentially true whether treated as punctuation, encoded atomically, or selected with VS. > My main point in suggesting that Unicode needs these characters is that > italic has been used to indicate specific meaning - this text is somehow > special - for over 400 years, and that content should be preserved in plain > text. ( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf ) "Plain text must contain enough information to permit the text to be rendered legibly, and nothing more." The argument is that italic information can be stripped yet still be read.? A persuasive argument towards encoding would need to negate that; it would have to be shown that removing italic information results in a loss of meaning. The decision makers at Unicode are familiar with italic use conventions such as those shown in "The Chicago Manual of Style" (first published in 1906).? The question of plain-text italics has arisen before on this list and has been quickly dismissed. Unicode began with the idea of standardizing existing code pages for the exchange of computer text using a unique double-byte encoding rather than relying on code page switching.? Latin was "grandfathered" into the standard.? Nobody ever submitted a formal proposal for Basic Latin.? There was no outreach to establish contact with the user community -- the actual people who used the script as opposed to the "computer nerds" who grew up with ANSI limitations and subsequent ISO code pages.? Because that's how Unicode rolled back then.? Unicode did what it was supposed to do WRT Basic Latin. When someone points out that italics are used for disambiguation as well as stress, the replies are consistent. "That's not what plain-text is for."? "That's not how plain-text works."? "That's just styling and so should be done in rich-text." "Since we do that in rich-text already, there's no reason to provide for it in plain-text."? "You can already hack it in plain-text by enclosing the string with slashes."? And so it goes. But if variant letter form information is stripped from a string like "Jackie Brown", the primary indication that the string represents either a person's name or a Tarantino flick title is also stripped.? "Thorstein Veblen" is either a dead economist or the name of a fictional yacht in the Travis McGee series.? And so forth. Computer text tradition aside, nobody seems to offer any legitimate reason why such information isn't worthy of being preservable in plain-text.? Perhaps there isn't one. I'm not qualified to assess the impact of italic Unicode inclusion on the rich-text world as mentioned by David Starner.? Maybe another list member will offer additional insight or a second opinion. From unicode at unicode.org Wed Jan 16 21:51:57 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 17 Jan 2019 04:51:57 +0100 Subject: NNBSP (was: A last missing link for interoperable representation) In-Reply-To: <20190116205305.213b335d@JRWUBU2> References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> Message-ID: On 16/01/2019 21:53, Richard Wordingham via Unicode wrote: > > On Tue, 15 Jan 2019 13:25:06 +0100 > Philippe Verdy via Unicode wrote: > >> If your fonts behave incorrectly on your system because it does not >> map any glyph for NNBSP, don't blame the font or Unicode about this >> problem, blame the renderer (or the application or OS using it, may >> be they are very outdated and were not aware of these features, theyt >> are probably based on old versions of Unicode when NNBSP was still >> not present even if it was requested since very long at least for >> French and even English, before even Unicode, and long before >> Mongolian was then encoded, only in Unicode and not in any known >> supported legacy charset: Mongolian was specified by borrowing the >> same NNBSP already designed for Latin, because the Mongolian space >> had no known specific behavior: the encoded whitespaces in Unicode >> are compeltely script-neutral, they are generic, and are even >> BiDi-neutral, they are all usable with any script). > > The concept of this codepoint started for Mongolian, but was generalised > before the character was approved. Indeed it was proposed as MONGOLIAN SPACE at block start, which was consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much more. When Unicode argued in favor of a unification with , this was pointed as impracticable, and the need of a specific Mongolian space for the purpose of appending suffixes was underscored. Only in London in September 1998 it was agreed that ?The Mongolian Space is retained but moved to the general punctuation block and renamed ?Narrow No Break Space???. However, unlike for the Mongolian Combination Symbols sequencing a question and exclamation mark both ways, a concrete rationale as of how useful the could be in other scripts doesn?t seem to be put on the table when the move to General Punctuation was decided. > > Now, I understand that all claims about character properties that cannot > be captured in the UCD should be dismissed as baseless, but if we > believed the text of TUS we would find that NNBSP has some interesting > properties with application only to Mongolian: As a side-note: The relevant text of TUS doesn?t predate version 11 (2018). > > 1) It has a shaping effect on following character. > 2) It has zero width at the start of a line. > 3) When the line-breaking algorithm does not provide enough > line-breaking opportunities, it changes its line-breaking property > from GL to BB. I don?t believe that these additions to TUS are in any way able to fix the many issues with in Mongolian causing so much headache and ending up in a unanimous desire to replace with a *new* *MONGOLIAN SUFFIX CONNECTOR. Indeed some suffixes are as long as 7 letters, e.g. ????????? ? https://lists.w3.org/Archives/Public/public-i18n-mongolian/2015JulSep/att-0036/DS05_Mongolian_NNBSP_Connected_Suffixes.pdf > > Or is property (3) appropriate for French? No it isn?t. It only introduces new flaws for a character that, despite being encoded for Mongolian with specific handling intended, was readily ripped off for use in French, Philippe Verdy reported, to that extent that it is actually an encoding error in Mongolian that brought the long-missing narrow non-breakable thin space into the UCS, in the block where it really belongs to, and where it had been encoded in the beginning if there had been no desire to keep it proprietary. That is the hidden (almost occult) fact where stances like ?The NNBSP can be used to represent the narrow space occurring around punctuation characters in French typography, which is called an ?espace fine ins?cable.??? (TUS) and ?When NARROW NO-BREAK SPACE occurs in French text, it should be interpreted as an ?espace fine ins?cable?.? (UAX?#14) are stemming from. The underlying meaning as I understand it now is like: ?The non-breakable thin space is usually a vendor-specific layout control in DTP applications; it?s also available via a TeX command. However, if you are interested in an interoperable representation, here?s a Unicode character you can use instead.? Due to the way made its delayed way into Unicode, font support was reported as late as almost exactly two years ago to be extremely scarce, this analysis of the first 47 fonts on Windows 10 shows: https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf Surprisingly for me, Courier New has NNBSP. We must have been using old copies. I?m really glad that this famous and widely used typeface has been unpdated. Please disregard my previous posting about Courier New unsupporting NNBSP. I?ll need to use a font manager to output a complete list wrt NNBSP support. I?m utterly worried about the fate of the non-breaking thin space in Unicode, and I wonder why the French and Canadian French people present at setup ? either on Unicode side or on JTC1/SC2/WG2 side ? didn?t get this character encoded in the initial rush. Did they really sell themselves and their locales to DTP lobbyists? Or were they tricked out? Also, at least one French typographer was extremely upset about Unicode not gathering feedback from typographers. That blame is partly wrong since at least one typographer was and still is present in WG2, and even if not being a Frenchman (but knowing French), as an Anglophone he might have been aware of the most outstanding use case of NNBSP with English (both British and American) quotation marks when a nested quotation starts or ends a quotation, where _???_ or _???_ and _???_ or _???_ are preferred over the unspaced compounds (_??_ or _??_ and _??_ or _??_), at least with proportional fonts. And not to forget the SI- conformant (and later ISO 80000 conformant) use of a thin space (non-breakable of course) for the purpose of grouping digits to triads, both before *and after* the decimal separator. Thanks to Richard Wordingham for catching this. It?s a very good point. Best regards, Marcel From unicode at unicode.org Wed Jan 16 23:45:32 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 16 Jan 2019 21:45:32 -0800 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: <86c92c68-921b-e012-fe35-1c14aabc2031@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 00:27:21 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Thu, 17 Jan 2019 06:27:21 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp> On 2019/01/17 12:38, James Kass via Unicode wrote: > ( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf ) > > "Plain text must contain enough information to permit the text to be > rendered legibly, and nothing more." > > The argument is that italic information can be stripped yet still be > read.? A persuasive argument towards encoding would need to negate that; > it would have to be shown that removing italic information results in a > loss of meaning. Well, yes. But please be aware of the fact that characters and text are human inventions grown and developed in many cultures over many centuries. It's not something where a single sentence will make all the subsequent decisions easy. So even if you can find examples where the presence or absence of styling clearly makes a semantic difference, this may or will not be enough. It's only when it's often or overwhelmingly (as opposed to occasionally) the case that a styling difference makes a semantic difference that this would start to become a real argument for plain text encoding of italics (or other styling information). To give a similar example, books about typography may discuss the different shapes of 'a' and 'g' in various fonts (often, the roman variant uses one shape (e.g. the 'g' with two circles), and the italic uses the other (e.g. the 'g' with a hook towards the bottom right)). But just because in this context, these shapes are semantically different, doesn't mean that they need to be distinguished at the plain text level. (There are variants for IPA that are restricted to specific shapes, namely '?' and '?', but that's a separate issue.) > The decision makers at Unicode are familiar with italic use conventions > such as those shown in "The Chicago Manual of Style" (first published in > 1906).? The question of plain-text italics has arisen before on this > list and has been quickly dismissed. > > Unicode began with the idea of standardizing existing code pages for the > exchange of computer text using a unique double-byte encoding rather > than relying on code page switching.? Latin was "grandfathered" into the > standard.? Nobody ever submitted a formal proposal for Basic Latin. > There was no outreach to establish contact with the user community -- > the actual people who used the script as opposed to the "computer nerds" > who grew up with ANSI limitations and subsequent ISO code pages. Because > that's how Unicode rolled back then.? Unicode did what it was supposed > to do WRT Basic Latin. I think most Unicode specialists have chosen to ignore this thread by this point. In their defense, I would like to point out that among the people who started Unicode, there were definitely many people who were very familiar with styling needs. As a simple example, Apple was interested in styled text from the very early beginning. Others were very familiar with electronic publishing systems. There were also members from the library community, who had their own requirements and character encoding standards. And many must have known TeX and other kinds of typesetting and publishing software. GML and then SGML were developed by IBM. Based by these data points, and knowing many of the people involved, my description would be that decisions about what to encode as characters (plain text) and what to deal with on a higher layer (rich text) were taken with a wide and deep background, in a gradually forming industry consensus. That doesn't mean that for all these decisions, explicit proposals were made. But it means that even where these decisions were made implicitly (at least on the level of the Consortium and the ISO/IEC and national standards body committees), they were made with a full and rich understanding of user needs and technology choices. This lead to the layering we have now: Case distinctions at the character level, but style distinctions at the rich text level. Any good technology has layers, and it makes a lot of sense to keep established layers unless some serious problem is discovered. The fact that Twitter (currently) doesn't allow styled text and that there is a small number of people who (mis)use Math alphabets for writing italics,... on Twitter doesn't look like a serious problem to me. > When someone points out that italics are used for disambiguation as well > as stress, the replies are consistent. > > "That's not what plain-text is for."? "That's not how plain-text > works."? "That's just styling and so should be done in rich-text." > "Since we do that in rich-text already, there's no reason to provide for > it in plain-text."? "You can already hack it in plain-text by enclosing > the string with slashes."? And so it goes. As such, these answers might indeed not look very convincing. But they are given in the overall framework of text representation in today's technology (see above). And please note that the end user doesn't ask for "italics in plain text", they as for "italics on Twitter" or some such. If you ask for "italics in plain text", then to people understanding the whole technology stack, that very much sounds as if you grew up with ASCII and similar plain text limitations, continuing to be a computer nerd who hasn't yet seen or understood rich text. > But if variant letter form information is stripped from a string like > "Jackie Brown", the primary indication that the string represents either > a person's name or a Tarantino flick title is also stripped.? "Thorstein > Veblen" is either a dead economist or the name of a fictional yacht in > the Travis McGee series.? And so forth. In probably around 99% or more of the cases, the semantic distinction would be obvious from the context. Also, for probably at least 90% of the readership, the style distinction alone wouldn't induce a semantic distinction, because most of the readers are not familiar with these conventions. (If you doubt that, please go out on the street and ask people what italics are used for, and count how many of them mention film titles or ship names.) (And just while we are at it, it would still not be clear which of several potential people named "Jackie Brown" or "Thorstein Veblen" would be meant.) > Computer text tradition aside, nobody seems to offer any legitimate > reason why such information isn't worthy of being preservable in > plain-text.? Perhaps there isn't one. See above. > I'm not qualified to assess the impact of italic Unicode inclusion on > the rich-text world as mentioned by David Starner.? Maybe another list > member will offer additional insight or a second opinion. I'd definitely second David Starner on this point. The more options one has to represent one and the same thing (italic styling in this thread), the more complex and error-prone the technology gets. Regards, Martin. From unicode at unicode.org Thu Jan 17 00:36:13 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Wed, 16 Jan 2019 22:36:13 -0800 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: On Wed, Jan 16, 2019 at 7:41 PM James Kass via Unicode wrote: > Computer text tradition aside, nobody seems to offer any legitimate > reason why such information isn't worthy of being preservable in > plain-text. Perhaps there isn't one. > Worthy of being preservable? Again, if you want rich text, you know where to find it. Maybe italics could have been encoded in plain text, even as late as 1991. But more than a quarter century on, everything supports italics with a few rare exceptions. You're changing everything at a very low level for a handful of systems. On the other hand, tradition matters. Again, at the bottom of this email I'm drafting is "*B* *I* *U* | *A*? tT?|?"; that is, bold, italics, underline, text color, text size, and extra options, like font choice and lists. Even non-computer geeks are familiar with that distinction. What's the advantage of moving one feature into Unicode and breaking the symmetry? On the other hand, most people won't enter anything into a tweet they can't enter from their keyboard, and if they had to, would resort to cut and paste. The only people Unicode italics could help without change are people who already can use mathematical italics. If you don't have buy-in from systems makers, people will continue to lack practical access to italics in plain text systems. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 01:47:46 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 17 Jan 2019 08:47:46 +0100 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> On 17/01/2019 07:36, David Starner via Unicode wrote: [?] > On the other hand, most people won't enter anything into a tweet they can't enter from their keyboard, and if they had to, would resort to cut and paste. The only people Unicode italics could help without change are people who already can use mathematical italics. If you don't have buy-in from systems makers, people will continue to lack practical access to italics in plain text systems. Yes that is the point here, and that?s why I wasn?t proposing anything else than we can input right from the current keyboard layout. For italic plain text we would need a second keyboard layout or some corresponding feature, and switch back and forth between the two. It?s feasible, at least for a wide subset of Latin locales, but it?s an action similar to changing the type wheel or the ball-head. Now thankfully the word is out. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 01:54:27 2019 From: unicode at unicode.org (Johannes Bergerhausen via Unicode) Date: Thu, 17 Jan 2019 08:54:27 +0100 Subject: wws dot org In-Reply-To: References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> Message-ID: Thanks for the input. I?ll discuss it with Deborah Anderson. We can make possible changes with the next update of the site to Unicode 12. > Am 16.01.2019 um 18:38 schrieb Phake Nick : > > Feedback after briefly reading the East Asia section of the website: > 1. I am pretty sure the "Kaida" script is not living anymore, according to Wikipedia description > 2. Hentaigana refers to all alternative form of kana that're used before modern standardization, I don't think they're still used actively now. > 3. The meaning of the "Old Hanzi" is not clear. If it is the same definition as the one stated in this blog: http://babelstone.blogspot.com/2007/07/old-hanzi.html , then it is not referring to a single script and instead refer to all historical ways to write Hanzi, including Oracle Bone script, Bronze script, and (Small) Seal script and such. And the list have already separately include oracle bone script, bronze script and seal script, which apparently make this "old hanzi" entry redundant. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 02:51:48 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 17 Jan 2019 08:51:48 +0000 Subject: Encoding italic In-Reply-To: <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp> Message-ID: <30bf5809-e89f-9b93-c0b9-021b96e98d48@gmail.com> On 2019-01-17 6:27 AM, Martin J. D?rst replied: > ... > So even if you can find examples where the presence or absence of > styling clearly makes a semantic difference, this may or will not be > enough. It's only when it's often or overwhelmingly (as opposed to > occasionally) the case that a styling difference makes a semantic > difference that this would start to become a real argument for plain > text encoding of italics (or other styling information). (also from PDF chapter 2,) "Plain text is public, standardized, and universally readable." The UCS is universal, which implies that even edge cases, such as failed or experimental historical orthographies, are preserved in plain text. > ... > I think most Unicode specialists have chosen to ignore this thread by > this point. Those not switched off by the thread title may well be exhausted and pressed for time because of the UTC meeting. > ... > Based by these data points, and knowing many of the people involved, my > description would be that decisions about what to encode as characters > (plain text) and what to deal with on a higher layer (rich text) were > taken with a wide and deep background, in a gradually forming industry > consensus. (IMO) All of which had to deal with the existing font size limitations of 256 characters and the need to reserve many of those for other textual symbols as well as box drawing characters.? Cause and effect.? The computer fonts weren't designed that way *because* there was a technical notion to create "layers".? It's the other way around.? (If I'm not mistaken.) >> ..."Jackie Brown"... > ... > Also, for probably at least 90% of > the readership, the style distinction alone wouldn't induce a semantic > distinction, because most of the readers are not familiar with these > conventions. Proper spelling and punctuation seem to be dwindling in popularity, as well.? There's a percentage unable to make a semantic distinction between 'your' and 'you?re'. > (If you doubt that, please go out on the street and ask people what > italics are used for, and count how many of them mention film titles or > ship names.) Or the em-dash, en-dash, Mandaic letter ash, or Gurmukhi sign yakash.? Sure, most street people have other interests. > (And just while we are at it, it would still not be clear which of > several potential people named "Jackie Brown" or "Thorstein Veblen" > would be meant.) Isn't that outside the scope of italics?? (winks) From unicode at unicode.org Thu Jan 17 02:58:57 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 Jan 2019 08:58:57 +0000 Subject: NNBSP In-Reply-To: References: <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> Message-ID: <20190117085857.33e703e5@JRWUBU2> On Thu, 17 Jan 2019 04:51:57 +0100 Marcel Schneider via Unicode wrote: > Also, at least one French typographer was extremely upset > about Unicode not gathering feedback from typographers. > That blame is partly wrong since at least one typographer > was and still is present in WG2, and even if not being a > Frenchman (but knowing French), as an Anglophone he might > have been aware of the most outstanding use case of NNBSP > with English (both British and American) quotation marks > when a nested quotation starts or ends a quotation, where > _???_ or _???_ and _???_ or _???_ are preferred over the > unspaced compounds (_??_ or _??_ and _??_ or _??_), at > least with proportional fonts. There's an alternative view that these rules should be captured by the font and avoid the need for a spacing character. There is an example in the OpenType documentation of the GPOS table where punctuation characters are moved rightwards for French. This alternative conception hits the problem that mass market Microsoft products don't select font behaviour by language, unlike LibreOffice and Firefox. (The downside is that automatic font selection may then favour a font that declares support for the language, which gets silly when most fonts only support that language and don't declare support.) Another spacing mess occurs with the Thai repetition mark U+0E46 THAI CHARACTER MAIYAMOK, which is supposed to be separated from the duplicated word by a space. I'm not sure whether this space should expand for justification any more often than inter-letter spacing. Some fonts have taken to including the preceding space in the character's glyph, which messes up interoperability. An explicit space looks ugly when the font includes the space in the repetition mark, and the lack of an explicit space looks illiterate when the font excludes the leading space. Richard. From unicode at unicode.org Thu Jan 17 03:05:03 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 17 Jan 2019 10:05:03 +0100 Subject: NNBSP (was: A last missing link for interoperable representation) In-Reply-To: References: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> Message-ID: Courier New was lacking NNBSP on Windows 7. It is including it on Windows 10. The tests I referred to were made 2 years ago. I confess that I was so disappointed to see Courier New unsupporting NNBSP a decade after encoding, while many relevant people in the industry were surely aware of its role and importance for French (at least those keeping a branch office in France), that I gave it up. Turns out that foundries are delaying support until the usage is backed by TUS, which happened in 2014, timely for Windows 10. (I?m lacking hints about Windows 8 and 8.1.) Superscripts are a handy parallel showcasing a similar process. As long as preformatted superscripts are outlawed by TUS for use in the digital representation of abbreviation indicators, vendors keep disturbing their glyphs with what one could start calling an intentional metrics disorder (IMD). One can also rank the vendors on the basis of the intensity of IMD in preformatted superscripts, but this is not the appropriate thread, and anyhow this List is not the place. A comment on CLDR ticket #11653 is better. [?] > Due to the way made its delayed way into Unicode, font > support was reported as late as almost exactly two years ago to > be extremely scarce, this analysis of the first 47 fonts on > Windows 10 shows: > > https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf > > Surprisingly for me, Courier New has NNBSP. We must have been > using old copies. I?m really glad that this famous and widely > used typeface has been unpdated. Please disregard my previous > posting about Courier New unsupporting NNBSP. [?] Marcel From unicode at unicode.org Thu Jan 17 04:51:35 2019 From: unicode at unicode.org (Victor Gaultney via Unicode) Date: Thu, 17 Jan 2019 10:51:35 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> Message-ID: <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> ( I appreciate that UTC meetings are going on - I too will be traveling a bit over the next couple of weeks, so may not respond quickly. ) Support for marking 'italic' in plain text - however it's done - would certainly require changes in text processing. That would also be the case for some of the other span-like issues others have mentioned. However a clear model for how to handle that could solve all the issues at once. Italic would only be one application of that model, and only applicable to certain scripts. Other scripts might have parallel issues. BTW - I'm speaking only about span-like things that encode content, not the additional level of rich-text presentation. If however, we say that this "does not adequately consider the harm done to the text-processing model that underlies Unicode", then that exposes a weakness in that model. That may be a weakness that we have to accept for a variety of reasons (technical difficulty, burden on developers, UI impact, cost, maturity). We then have to honestly admit that the current model cannot always unambiguously encode text content in English and many other languages. It is impossible to express Crystal's distinction between 'red slippers' and '/red/ slippers' in plain text without using other characters in non-standardized ways. Here I am using my favourite technique for this - /slashes/. There are other uses of italic that indicate difference in actual meaning, many that go back centuries, and for which other span-like punctuation like quotes aren't used. Examples: - Titles of books, films, compositions, works of art: 'Daredevil' - the Marvel comics character vs. '/Daredevil/' - the Netflix series. - Internal voice, such as a character's private thoughts within a narrative: 'She pulled out a knife. /What are you doing? How did you find out.../' - Change of author/speaker, as in editorial comments: '/The following should be considered.../' - Heavy stress in speech, which is different than Crystal's distinction: 'Come here /this instant/' - Examples: 'The phrase /I could care less/...' (quotes are sometimes used for this one) Is it important to preserve these distinctions in plain text? The text seems 'readable' without them, but that requires some knowledge of context. And without some sort of other marking, as I've done, some of the meaning is lost. This is why italics within text have always been considered an editorial decision, not a typesetting one. In a similar way, we really don't need to include diacritics when encoding French. In all but a few rare cases, French is perfectly 'readable' without accents - the content can usually be inferred from context. Yet we would never consider unaccented French to be correct. More evidence for italics as an important element within encoded text comes from current use. A couple of years ago I collected every tweet that referred to italics for a month. People frequently complained that they were not able to express themselves fully without italics, and resorted to 40 different techniques to try and mark words and phrases as 'italic'. In the current model, plain text cannot fully preserve important distinctions in content. Maybe we just need to admit and accept that. But maybe an enhancement to the text processing model would enable more complete encoding of content, both for italics in Latin script and for other features in other scripts. As for how the UIs of the world would need to change: Until there is a way to encode italic in plain text there's no motivation for people to even experiment and innovate. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 04:53:41 2019 From: unicode at unicode.org (Victor Gaultney via Unicode) Date: Thu, 17 Jan 2019 10:53:41 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: Andrew Cunningham wrote: > Underlying, bold text, interletter spacing, colour change, font style > change all are used to apply meaning in various ways. Not sure why > italic is special in this sense. Italic is uniquely different from these in that the meaning has been well-established in our writing system for centuries, and is consistently applied. The only one close to being this is use of interletter spacing for distinction, particularly in the German and Czech tradition. Of course, a model that can encode span-like features such as italic could then support other types of distinction. However the meaning of that distinction within the writing system must be clear. IOW people do use colour change to add meaning, but the meaning is not consistent, and so preserving it in plain text is relatively pointless. Even Bold doesn't have a consistent meaning other than - but that's a separate conversation. > And I am curious on your thoughts, if we distinguish italic in > Unicode, encode some way of spacifying italic text, wouldn't it make > more sense to do away with italic fonts all together? and just roll > the italic glyphs into the regular font? That's actually being done now. OpenType variation fonts allow a variety of styles within a single 'font', although I personally feel using that for italic is misguided. The reality is that the most commonly used Latin fonts - OS core ones - all have italic counterparts, so app creators only have to switch to using that counterpart for that span. And if the font has no italic counterpart then a fallback mechanism can kick in - just like is done when a font doesn't have a glyph to represent a character. > In theory changing italic from a stylistic choice as it currently is > to a encoding/character level semantic is a paradigmn shift. Yes it would be - but it could be a beneficial shift, and one that more completely reflects distinctions in the Latin script that go back over 400 years. > But it it were introduced I would prefer a system that was more > inclusive of all scripts, giving proper analysis of typeseting and > typographic conventions in each script and well founded decisions on > which should be encoded. Cherry picking one feature relevant to a > small set of scripts seems to be a problematic path. The core issue here is not really italic in Latin - that's only one case. An adjusted text model that supports span-like text features, could also unlock benefits for other scripts that have consistent span-like features. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 05:21:56 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 17 Jan 2019 12:21:56 +0100 Subject: NNBSP (was: A last missing link for interoperable representation) In-Reply-To: References: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com> <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com> <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> Message-ID: Le jeu. 17 janv. 2019 ? 05:01, Marcel Schneider via Unicode < unicode at unicode.org> a ?crit : > On 16/01/2019 21:53, Richard Wordingham via Unicode wrote: > > > > On Tue, 15 Jan 2019 13:25:06 +0100 > > Philippe Verdy via Unicode wrote: > > > >> If your fonts behave incorrectly on your system because it does not > >> map any glyph for NNBSP, don't blame the font or Unicode about this > >> problem, blame the renderer (or the application or OS using it, may > >> be they are very outdated and were not aware of these features, theyt > >> are probably based on old versions of Unicode when NNBSP was still > >> not present even if it was requested since very long at least for > >> French and even English, before even Unicode, and long before > >> Mongolian was then encoded, only in Unicode and not in any known > >> supported legacy charset: Mongolian was specified by borrowing the > >> same NNBSP already designed for Latin, because the Mongolian space > >> had no known specific behavior: the encoded whitespaces in Unicode > >> are compeltely script-neutral, they are generic, and are even > >> BiDi-neutral, they are all usable with any script). > > > > The concept of this codepoint started for Mongolian, but was generalised > > before the character was approved. > > Indeed it was proposed as MONGOLIAN SPACE at block start, which was > consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much > more. But the French "espace fine ins?cable" was requested long long before Mongolian was discussed for encodinc in the UCS. The problem is that the initial rush for French was made in a period where Unicode and ISO were competing and not in sync, so no agreement could be found, until there was a decision to merge the efforts. Tge early rush was in ISO still not using any character model but a glyph model, with little desire to support multiple whitespaces; on the Unicode side, there was initially no desire to encode all the languages and scripts, focusing initially only on trying to unify the existing vendor character sets which were already implemented by a limited set of proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national standard or de facto standards (Russia, Thailand, Japan, Korea). This early rush did not involve typographers (well there was Adobe at this time but still using another unrelated technology). Font standards were still not existing and were competing in incompatible ways, all was a mess at that time, so publishers were still required to use proprietary software solutions, with very low interoperability (at that time the only "standard" was PostScript, not needing any character encoding at all, but only encoding glyphs!) If publishers had been involded, they would have revealed that they all needed various whitespaces for correct typography (i.e. layout). Typographs themselves did not care about whitespaces because they had no value for them (no glyph to sell). Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained their typists or ads sellers to use it (that character was not "sold" in classified ads, it was necessary for correct layout, notably in narrow columns, not using it confused the readers (notably for the ":" colon): it had to be non-breaking, non-expanding by justification, narrower than digits and even narrower than standard non-justified whitespace, and was consistently used as a decimal grouping separator. But at that time the most common OSes did not support it natively because there was no vendor charset supporting it (and in fact most OSes were still unable to render proportional fonts everywhere and were frequently limited to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early start). So intermediate solution was needed. Us chose not to use at all the non-breakable thin space because in English it was not needed for basic Latin, but also because of the huge prevalence of 7-bit ASCII for everything (but including its own national symbol for the "$", competing with other ISO 646 variants). There were tons of legacy applications developed ince decenials that did not support anything else and interoperability in US was available ony with ASCII, everything else was unreliable. If you remember the early years when the Internet started to develop outside US, you remember the nightmare of non-interoperable 8-bit charsets and the famous "mojibake" we saw everywhere. Then the competition between ISO and Unicode lasted too long. But it was considered "too late" for French to change anything (and Windows used in so many places by som many users promoted the use of the Windows-1252 charset (which had a few updates before it was frozen definitely: there was no place for NNBSP in it). Typographers and publishers were upset: to use the NNBSP they still needed to use proprietary *document* encodings. The W3C did not help much too (it was long to finally adopt the UCS as a mandatory component for HTML, before that it depended only on the old IANA charset database promoting only the work of vendors and a few ISO standards). France itself wanted to keep its own national variant of ISO 646 (inherited from telegraphic systems), but it was finally abandoned: everybody was already using windows 1252 or ISO 8859-1 (even early Unix adopters which used a preliminary version made by Digital/DEC, then promoted by X11), or otherwise used Adobe proprietary encodings. Unix itself had no standard (so many different variants including with other OSes for industrial or accounting systems, made notably by IBM,, which created so many variants, almost one for each submarket, multiple ones in the same country, each time split into mutliple variants between those based on ASCII, and those based on EBCDIC...) The truth is that publishers were forgotten, because their commercial market was much narrower: each publisher then used its own internal conventions. Even libaries used their own classifications. There was no attempt to unifify the needs for publishers (working at document level) and data processors (including OSes). This effort started only very late, when W3C finally started to work seriously on fixing HTML, and make it more or less interoperable with SGML (promoted by publishers). But at national level, there were still lot of other competing standards (let's remember teletext, including the Minitel terminal and Antiope for TV). People at home did not have access to any system capable of rendering proportionaly fonts. All early computers for personal use were based on fixed-width 8-bit fonts (including in Japan). China and Korea were still not technology advanced as they are today (there were some efforts but they were costly and there was little return at that time). The adoption of the UCS was extremely long, and it is still not competely finished even if now its support is mandatory in all new computiong standards and their revisions. The last segment where it still resists is the mobile phone industry (how can the SMS be so restricted and so much non-interoperable, and inefficient?) So French has a long tradition for its "fine", its support was demanded since long but constantly ignored by vendors making "the" standard. Publishers themselves resisted against the adoption of the web as a publishing platform: they prefered their legacy solutions as well, and did not care much about interoperability, so they did not pressure enough the standard makers to adopt the "fine". The same happened in US. There was no "commercial" incentive to adopt it and littel money coming from that sector (that has since suffered a lot from the loss of advertizing revenue, the competition of online publishers, the explosion of paper cost, but as well from the huge piracy level made on the Internet that reduced their sales and then their effective measured audience; the same is happening now on the TV and radio market; and on the Internet the adverizing market has been concentrated a lot and its revenues are less and less balanced; photographs and reporters have difficulties now to live from their work). And there's little incentive now for creating quality products: so many products are developed and distributed very fast, and not enough people care about quality, or won't pay for it. The old good practives of typographs and publishers are most often ignored, they look "exotic" or "old-fashioned", and so many people say now these are "not needed" (just like they'll say that supporting multiple languages is not necessary) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 05:31:29 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 17 Jan 2019 12:31:29 +0100 Subject: NNBSP In-Reply-To: <20190117085857.33e703e5@JRWUBU2> References: <20190111011445.1773182d@JRWUBU2> <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <20190117085857.33e703e5@JRWUBU2> Message-ID: <6bd3d417-28c8-7133-6d6a-606ea3a590f8@orange.fr> On 17/01/2019 09:58, Richard Wordingham wrote: > > On Thu, 17 Jan 2019 04:51:57 +0100 > Marcel Schneider via Unicode wrote: > >> Also, at least one French typographer was extremely upset >> about Unicode not gathering feedback from typographers. >> That blame is partly wrong since at least one typographer >> was and still is present in WG2, and even if not being a >> Frenchman (but knowing French), as an Anglophone he might >> have been aware of the most outstanding use case of NNBSP >> with English (both British and American) quotation marks >> when a nested quotation starts or ends a quotation, where >> _???_ or _???_ and _???_ or _???_ are preferred over the >> unspaced compounds (_??_ or _??_ and _??_ or _??_), at >> least with proportional fonts. > > There's an alternative view that these rules should be captured by the > font and avoid the need for a spacing character. There is an example > in the OpenType documentation of the GPOS table where punctuation > characters are moved rightwards for French. Thanks, I didn?t know that this is already implemented. Sometimes one can read in discussions that the issue is dismissed to font level. That looked always utopistic to me, the more as people are trained to type spaces when bringing in former typewriting expertise, and I always believed that it?s a way for helpless keyboard layout designers to hand the job over. Turns out there is more to it. But the high-end solution notwithstanding, the use of an extra space character is recommended practice: https://www.businesswritingblog.com/business_writing/2014/02/rules-for-single-quotation-marks.html The source sums up in an overview: ?_The Associated Press Stylebook_ recommends a thin space, whereas _The Gregg Reference Manual_ promotes a full space between the quotation marks. _The Chicago Manual of Style_ says no space is necessary but adds that a space or a thin space can be inserted as ?a typographical nicety.??? The author cites three other manuals for not having retrieved any locus about the topic in them. We note that all three style guides seem completely unconcerned with non-breakability. Not so the author of the blog post: ?[?] If your software moves the double quotation mark to the next line of type, use a nonbreaking space between the two marks to keep them together.? Certainly she would recommend using a NARROW NO-BREAK SPACE if only we had it on the keyboard or if the software provided a handy shortcut by default. > > This alternative conception hits the problem that mass market Microsoft > products don't select font behaviour by language, unlike LibreOffice > and Firefox. (The downside is that automatic font selection may then > favour a font that declares support for the language, which gets silly > when most fonts only support that language and don't declare support.) Another drawback is that most environments don?t provide OpenType support, and that the whole scheme depends on language tags that could easily got lost, and that the issue as being particular to French would quickly boil down to dismiss support as not cost-effective, arguing that *if* some individual locale has special requirements for punctuation layout, its writers are welcome to pick an appropriate space from the UCS and key it in as desired. The same is also observed about Mongolian. Today, the preferred approach for appending suffixes is to encode a Mongolian Suffix Connector to make sure the renderer will use correct shaping, and to leave the space to the writer?s discretion. That looks indeed much better than to impose a hard space that unveiled itself as cumbersome in practice, and that is reported to often get in the way of a usable text layout. The problems related to NNBSP as encountered in Mongolian are completely absent when NNBSP is used with French punctuation or as the regular group separator in numbers. Hence I?m sure that everybody on this List agrees in discouraging changes made to the character properties of NNBSP, such as switching the line breaking class (as "GL" is non-tailorable), or changing general category to Cf, which could be detrimental to French. However we need to admit that NNBSP is basically not a Latin but a Mongolian space, despite being readily attracted into Western typography. A similar disturbance takes place in word processors, where except in Microsoft Word 2013, the NBSP is not justifying as intended and as it is on the web. It?s being hacked and hijacked despite being a bad compromise, for the purpose of French punctuation spacing. That tailoring is in turn very detrimental to Polish users, among others, who need a justifying no-break space for the purpose of prepending one-letter prepositions. Fortunately a Polish user found and shared a workaround using the string , the latter being still used in lieu of WORD JOINER as long as Word keeps unsupporting latest TUS (an issue that raised concern at Microsoft when it was reported, and will probably be fixed or has already been fixed meanwhile). > > Another spacing mess occurs with the Thai repetition mark U+0E46 THAI > CHARACTER MAIYAMOK, which is supposed to be separated from the > duplicated word by a space. I'm not sure whether this space should > expand for justification any more often than inter-letter spacing. Some > fonts have taken to including the preceding space in the character's > glyph, which messes up interoperability. An explicit space looks ugly > when the font includes the space in the repetition mark, and the lack of > an explicit space looks illiterate when the font excludes the leading > space. It seems to me that these disturbances are a case of underspecification. TUS treats U+0E46 thai character maiyamok [1] on a single line in the Thai section, while other marks are given more detailed descriptions. That wouldn?t be problematic per se as long as things are obvious. Obviously here they are not, but no attempt is made on Unicode level to fix them, the less as the encoding proposal, if it could be retrieved, probably would show that it didn?t provide any more details (otherwise Unicode would have implemented them I figure out). I suspect that the same holds true for French: Nobody among the relevant people at the forefront cared about making demands and specifying, so TUS authors (who anyway were ?falling like flies?) couldn?t help leaving French alone ? possibly at the discretion of a trend to lock up this key behavior inside proprietary text rendering systems (including proprietary OTF typefaces). That isn?t really what Unicode is about, the less as Latin script typically has scarce OpenType support at reach. It?s just understandable in front of disinterest and unconcernedness. At the other end, Vietnamese typographers didn?t wait for an invitation but started an ?intense lobbying? on their own behalf to get precomposed letters into the Unicode standard a long while before v1.0. Marcel [1] That?s what a copy-pasted snippet from TUS ends up as, despite my kind request about whether to set the character names in the plain text backbone to all-caps and to rather apply a resizing style. From unicode at unicode.org Thu Jan 17 05:50:08 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Thu, 17 Jan 2019 11:50:08 +0000 Subject: Encoding italic In-Reply-To: <30bf5809-e89f-9b93-c0b9-021b96e98d48@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp> <30bf5809-e89f-9b93-c0b9-021b96e98d48@gmail.com> Message-ID: <26f42a06-d636-ad62-2176-189fba605bdd@it.aoyama.ac.jp> On 2019/01/17 17:51, James Kass via Unicode wrote: > > On 2019-01-17 6:27 AM, Martin J. D?rst replied: > > ... > > Based by these data points, and knowing many of the people involved, my > > description would be that decisions about what to encode as characters > > (plain text) and what to deal with on a higher layer (rich text) were > > taken with a wide and deep background, in a gradually forming industry > > consensus. > > (IMO) All of which had to deal with the existing font size limitations > of 256 characters and the need to reserve many of those for other > textual symbols as well as box drawing characters.? Cause and effect. > The computer fonts weren't designed that way *because* there was a > technical notion to create "layers".? It's the other way around.? (If > I'm not mistaken.) Most probably not. I think Asmus has already alluded to it, but in good typography, roman and italic fonts are considered separate. They are often used in sets, but it's not impossible e.g. to cut a new italic to an existing roman or the other way round. This predates any 8-bit/256 characters limitations. Also, Unicode from the start knew that it had to deal with more than 256 characters, not only for East Asia, and so I don't think such size limits were a major issue when designing Unicode. On the other hand, the idea that all Unicode characters (or a significant and as yet undetermined subset of them) would need italic,... variants definitely will have let do shooting down such ideas, in particular because Unicode started as strictly 16-bit. Regards, Martin. From unicode at unicode.org Thu Jan 17 06:40:51 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 17 Jan 2019 12:40:51 +0000 Subject: Encoding italic In-Reply-To: <26f42a06-d636-ad62-2176-189fba605bdd@it.aoyama.ac.jp> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp> <30bf5809-e89f-9b93-c0b9-021b96e98d48@gmail.com> <26f42a06-d636-ad62-2176-189fba605bdd@it.aoyama.ac.jp> Message-ID: <0c239e30-b522-0cbe-effb-cf11e23bb18a@gmail.com> On 2019-01-17 11:50 AM, Martin J. D?rst wrote: > Most probably not. I think Asmus has already alluded to it, but in good > typography, roman and italic fonts are considered separate. So are Latin and Cyrillic fonts.? So are American English and Polish fonts, for that matter, even though they're both Latin based.? Times New Roman and Times New Roman Italic might be two separate font /files/ on computers, but they are the same type face. The point I was trying to make WRT 256-glyph fonts is that they pre-date Unicode and I believe much of the "layering" is based on artifacts from that era. Lead fonts were glyph based.? The technical concept of character came later. From unicode at unicode.org Thu Jan 17 07:36:32 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 17 Jan 2019 14:36:32 +0100 Subject: NNBSP (was: A last missing link for interoperable representation) In-Reply-To: References: <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> Message-ID: On 17/01/2019 12:21, Philippe Verdy via Unicode wrote: > > [quoted mail] > > But the French "espace fine ins?cable" was requested long long before Mongolian was discussed for encodinc in the UCS. Then we should be able to read its encoding proposal in the UTC document registry, but Google Search seems unable to retrieve it, so there is a big risk that no such proposal does exist, despite the registry goes back until 1990. The only thing that searches have brought up to me is that the part of UAX?#14 that I?ve quoted in the parent thread has been added by a Unicode Technical Director not mentioned in the author field, and that he did it on request from two gentlemen whose first names only are cited. I?m sure their full names are Martin J. D?rst and Patrick Andries, but I may be wrong. I apologize for the comment I?ve made in my e?mail. Still it would be good to learn why the French use of NNBSP is sort of taken with a grain of salt, while all involved parties were knowing that this NNBSP was (as it still is) the only Unicode character ever encoded able to represent the so-long-asked-for ?espace fine ins?cable.? There is also another question I?m asking since a while: Why the character U+2008 PUNCTUATION SPACE wasn?t given the line break property value "GL" like its sibling U+2007 FIGURE SPACE? This addition to UAX #14 is dated as soon as ?2007-08-08?. Why was the Core Specification not updated in sync, but only a 7 years later? And was Unicode aware that this whitespace is hated by the industry to such an extent that a major vendor denied support in a major font at a major release of a major OS? Or did they wait in vain that Martin and Patrick come knocking at their door to beg for font support? Regards, Marcel > The problem is that the initial rush for French was made in a period where Unicode and ISO were competing and not in sync, so no agreement could be found, until there was a decision to merge the efforts. Tge early rush was in ISO still not using any character model but a glyph model, with little desire to support multiple whitespaces; on the Unicode side, there was initially no desire to encode all the languages and scripts, focusing initially only on trying to unify the existing vendor character sets which were already implemented by a limited set of proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national standard or de facto standards (Russia, Thailand, Japan, Korea). > This early rush did not involve typographers (well there was Adobe at this time but still using another unrelated technology). Font standards were still not existing and were competing in incompatible ways, all was a mess at that time, so publishers were still required to use proprietary software solutions, with very low interoperability (at that time the only "standard" was PostScript, not needing any character encoding at all, but only encoding glyphs!) > > If publishers had been involded, they would have revealed that they all needed various whitespaces for correct typography (i.e. layout). Typographs themselves did not care about whitespaces because they had no value for them (no glyph to sell). Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained their typists or ads sellers to use it (that character was not "sold" in classified ads, it was necessary for correct layout, notably in narrow columns, not using it confused the readers (notably for the ":" colon): it had to be non-breaking, non-expanding by justification, narrower than digits and even narrower than standard non-justified whitespace, > and was consistently used as a decimal grouping separator. > > But at that time the most common OSes did not support it natively because there was no vendor charset supporting it (and in fact most OSes were still unable to render proportional fonts everywhere and were frequently limited to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early start). So intermediate solution was needed. Us chose not to use at all the non-breakable thin space because in English it was not needed for basic Latin, but also because of the huge prevalence of 7-bit ASCII for everything (but including its own national symbol for the "$", competing with other ISO 646 variants). There were tons of legacy applications developed ince decenials that did not support anything else and interoperability in US was available ony with ASCII, everything else was unreliable. > > If you remember the early years when the Internet started to develop outside US, you remember the nightmare of non-interoperable 8-bit charsets and the famous "mojibake" we saw everywhere. Then the competition between ISO and Unicode lasted too long. But it was considered "too late" for French to change anything (and Windows used in so many places by som many users promoted the use of the Windows-1252 charset (which had a few updates before it was frozen definitely: there was no place for NNBSP in it). Typographers and publishers were upset: to use the NNBSP they still needed to use proprietary *document* encodings. The W3C did not help much too (it was long to finally adopt the UCS as a mandatory component for HTML, before that it depended only on the old IANA charset database promoting only the work of vendors and a few ISO standards). > > France itself wanted to keep its own national variant of ISO 646 (inherited from telegraphic systems), but it was finally abandoned: everybody was already using windows 1252 or ISO 8859-1 (even early Unix adopters which used a preliminary version made by Digital/DEC, then promoted by X11), or otherwise used Adobe proprietary encodings. Unix itself had no standard (so many different variants including with other OSes for industrial or accounting systems, made notably by IBM,, which created so many variants, almost one for each submarket, multiple ones in the same country, each time split into mutliple variants between those based on ASCII, and those based on EBCDIC...) > > The truth is that publishers were forgotten, because their commercial market was much narrower: each publisher then used its own internal conventions. Even libaries used their own classifications. There was no attempt to unifify the needs for publishers (working at document level) and data processors (including OSes). This effort started only very late, when W3C finally started to work seriously on fixing HTML, and make it more or less interoperable with SGML (promoted by publishers). But at national level, there were still lot of other competing standards (let's remember teletext, including the Minitel terminal and Antiope for TV). People at home did not have access to any system capable of rendering proportionaly fonts. All early computers for personal use were based on fixed-width 8-bit fonts (including in Japan). China and Korea were still not technology advanced as they are today (there were some efforts but they were costly and there was little return at that time). > > The adoption of the UCS was extremely long, and it is still not competely finished even if now its support is mandatory in all new computiong standards and their revisions. The last segment where it still resists is the mobile phone industry (how can the SMS be so restricted and so much non-interoperable, and inefficient?) > > So French has a long tradition for its "fine", its support was demanded since long but constantly ignored by vendors making "the" standard. Publishers themselves resisted against the adoption of the web as a publishing platform: they prefered their legacy solutions as well, and did not care much about interoperability, so they did not pressure enough the standard makers to adopt the "fine". The same happened in US. There was no "commercial" incentive to adopt it and littel money coming from that sector (that has since suffered a lot from the loss of advertizing revenue, the competition of online publishers, the explosion of paper cost, but as well from the huge piracy level made on the Internet that reduced their sales and then their effective measured audience; the same is happening now on the TV and radio market; and on the Internet the adverizing market has been concentrated a lot and its revenues are less and less balanced; photographs and reporters have difficulties > now to live from their work). > > And there's little incentive now for creating quality products: so many products are developed and distributed very fast, and not enough people care about quality, or won't pay for it. The old good practives of typographs and publishers are most often ignored, they look "exotic" or "old-fashioned", and so many people say now these are "not needed" (just like they'll say that supporting multiple languages is not necessary) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 07:40:22 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 17 Jan 2019 14:40:22 +0100 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: If encoding italics means reencoding the normal linguistic usage, it is no ! We already have the nightmares caused by partial encoding of Latin and Greek (als a few Hebrew characters) for maths notations or IPA notations, but they are restricted to a well delimited scope of use and subset, and at least they have relevant scientific sources and auditors for what is needed in serious publications (Anyway these subsets may continue to evolve but very slowly). We could have exceptions added for chemical or electrical notations, if there are standard bodies supporting them. But for linguistic usage, there's no universal agreement and no single authority. Characters are added according to common use (by statistic survey, or because there are some national standard promoting them and sometimes making their use mandatory with defined meanings, sometimes legally binding). For everything else, languages are not constrained and users around the world invent their own letterforms, styles: there' no limit at all and if we start accepting such reencoding, the situation would in fact be worse in terms of interoperability ,because noone can support zillions variants if they are not explicitly encoded separately as surrounding styles, or scoping characters if needed (using contextual characters, possibly variant selectors if these variants are most often isolated). But italics encoded as varaint selectors would just pollute everything; and anyway "italic" is not a single universal convention and does not apply erqually to all scripts). The semantics attached to italic styles also varies from document to documents, and the sema semantics also have different typographic conventions depending on authors, and there's no agreed meaning bout the distinctions they encode. For this reason "italique/oblique/cursive/handwriting..." should remain in styles (note also that even the italic transform can be variable, it could also be later a subject of user preferences where people may want to adjust the degree or slanting, according to their reading preferences, or its orientation if they are left-handed to match how they write themselves, or if the writer is a native RTL writer; the context of use (in BiDi) may also adject this slanting orientation, e.g. inserting some Latin in Arabic could present the Latin italic letters slanted backward, to better match the slanting of Arabic itself and avoid collisions of Latin and Arabic glyphs at BiDi boundaries... One can still propose a contextual control character, but it would still be insufficient for correctly representing the many stylistic variants possible: we have better languages to do that now, and CSS (or even HTML) is better for it (including for accessibility requirements: note that there's no way to translate corretly these italics to Braille readers for example; Braille or audio readers attempt to infer an heuristic to reduce the number of contextual words or symbols they need to insert between each character, but using VSn characters would complicate that: they are already processing the standard HTML/CSS conventions to do that much more simply). direct native encoding of italic characters for lingusitic use would fail if it only covers English: it would worsen the language coverage if people are then said to remove the essential diacritics common in their language, only because of the partial coverage of their alphabet. I don't think this is worth the effort (and it would in fact cause lot of maintenance and would severely complicate the addition of new missing letters; and let's not forget the case of common ligatures, correct typograhpic features like kerning which would no longer be supported and would render ugly text if many new kerning pairs are missing in fonts, many fonts used today would no longer work properly, we would have a reduction of stylistic options and less fonts usable, and we would fall into the trap of proprietary solutions with a single provider; it would be too difficult or any font designer to start defining a usable font sellable on various market: these fonts would be reduced to niches, and would no longer find a way to be economically defined and maintained at reasonable cost. Consider the problems orthogonally: even if you use CSS/HTML styles in document encoding (rather than the plain text character encoding) you can also supply the additional semantics clearly in that document, and also encode the intent of the author, or supply enough info to permit alternate renderings (for accessibility, or for technical reasons such as small font sizes on devices will low resolution, or for people with limited vision capabilities). the same will apply to color (whose meaning is not clear, except in specific notations supported by wellknown authorities, or by a long tradition shared by many authors and kept in archives or important text corpus, such as litterature, legal, and publications that have fallen to the public domain after their ini?tial publisher disappeared and their proprietary assets were dissolved: the original documents remain as reliable sources sharable by many and which can guide the development of reuse using them as an established convention that many can now reuse without explaining them too much). we can repeat this argument to the other common styles : monospaced, bold, doublestruck, hollow, shadowed, 3D-like, underlining/striking/upperlining, generic subscripts and superscripts (I don't like the partial encoding of Latin letters in subscript/superscript working only for basic modern English, this is an abuse of what was defined mostly for jsut a few wellknown abbreviation or notations that have a long multilingual tradition): authors have much more freedom of creation using separate styles, encoding in an upper-layer protocol. However we can admit that for use in documents not intended to be rendered visually, but used technically, we would need some contextual control characters (just like those for BiDi when HTML/CSS is not usable): these are just needed for compatibility with technical contraints, provided that there's an application support for that and such application is not vendor-specific but sponsored by a wellknown standard (which should then be explicited in Unicode, probably by character properties, just like additional properties used for CJK characters specifying the dictionnary sources). That referenced standard should be open, readable at least by all (even if it is not republishable), and the standard body should have an open contact with the community, and regular meetings to solve incoming issues by defining some policies or the best practices, or the current "state of the art" (if research is still continuing), as well as some rules for making the transition and maintaining a good level of compatibility if this standard evolves or switches to another supported standard. Le jeu. 17 janv. 2019 ? 04:51, James Kass via Unicode a ?crit : > > Victor Gaultney wrote, > > > Treating italic like punctuation is a win for a lot of people: > > Italic Unicode encoding is a win for a lot of people regardless of > approach. Each of the listed wins remains essentially true whether > treated as punctuation, encoded atomically, or selected with VS. > > > My main point in suggesting that Unicode needs these characters is that > > italic has been used to indicate specific meaning - this text is somehow > > special - for over 400 years, and that content should be preserved in > plain > > text. > > ( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf ) > > "Plain text must contain enough information to permit the text to be > rendered legibly, and nothing more." > > The argument is that italic information can be stripped yet still be > read. A persuasive argument towards encoding would need to negate that; > it would have to be shown that removing italic information results in a > loss of meaning. > > The decision makers at Unicode are familiar with italic use conventions > such as those shown in "The Chicago Manual of Style" (first published in > 1906). The question of plain-text italics has arisen before on this > list and has been quickly dismissed. > > Unicode began with the idea of standardizing existing code pages for the > exchange of computer text using a unique double-byte encoding rather > than relying on code page switching. Latin was "grandfathered" into the > standard. Nobody ever submitted a formal proposal for Basic Latin. > There was no outreach to establish contact with the user community -- > the actual people who used the script as opposed to the "computer nerds" > who grew up with ANSI limitations and subsequent ISO code pages. > Because that's how Unicode rolled back then. Unicode did what it was > supposed to do WRT Basic Latin. > > When someone points out that italics are used for disambiguation as well > as stress, the replies are consistent. > > "That's not what plain-text is for." "That's not how plain-text > works." "That's just styling and so should be done in rich-text." > "Since we do that in rich-text already, there's no reason to provide for > it in plain-text." "You can already hack it in plain-text by enclosing > the string with slashes." And so it goes. > > But if variant letter form information is stripped from a string like > "Jackie Brown", the primary indication that the string represents either > a person's name or a Tarantino flick title is also stripped. "Thorstein > Veblen" is either a dead economist or the name of a fictional yacht in > the Travis McGee series. And so forth. > > Computer text tradition aside, nobody seems to offer any legitimate > reason why such information isn't worthy of being preservable in > plain-text. Perhaps there isn't one. > > I'm not qualified to assess the impact of italic Unicode inclusion on > the rich-text world as mentioned by David Starner. Maybe another list > member will offer additional insight or a second opinion. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 07:57:23 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 17 Jan 2019 14:57:23 +0100 Subject: NNBSP (was: A last missing link for interoperable representation) In-Reply-To: References: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> Message-ID: <5908f7a7-864b-2902-e41f-d77f99b7bbaa@orange.fr> On 17/01/2019 14:36, I wrote: > [?] > The only thing that searches have brought up It was actually the best thing. Here?s an even more surprising hit: B. In the rules, allow these characters to bridge both alphabetic and numeric words, with: * Replace MidLetter by (MidLetter | MidNumLet) * Replace MidNum by (MidNum | MidNumLet) ------------------------- 4. In addition, the following are also sometimes used, or could be used, as numeric separators (we don't give much guidance as to the best choice in the standard): |0020 |( ) SPACE |00A0 |( ? ) NO-BREAK SPACE |2007 |( ? ) FIGURE SPACE |2008 |( ? ) PUNCTUATION SPACE |2009 |( ? ) THIN SPACE |202F |( ? ) NARROW NO-BREAK SPACE If we had good reason to believe that if one of these only really occurred between digits in a single number, then we could add it. I don't have enough information to feel like a proposal for that is warranted, but others may. Short of that, we should at least document in the notes that some implementations may want to tailor MidNum to add some of these. I fail to understand what hack is going on. Why didn?t Unicode wish to sort out which one of these is the group separator? 1. SPACE: is breakable, hence exit. 2. NO-BREAK SPACE:?is justifying, hence exit. 3. FIGURE SPACE: has the full width of a digit, too wide, hence exit. 4. PUNCTUATION SPACE: has been left breakable against all reason and evidence and consistency, hence exit? 5. THIN SPACE: is part of the breakable spaces series, hence exit. 6. NARROW NO-BREAK SPACE: is okay. CLDR has been OK to fix this for French for release 34. At present survey?35 all is questioned again, must be assessed, may impact implementations, while all other locales using space are still impacted by bad display using NO-BREAK SPACE. I know we have another public Mail List for that, but I feel it?s important to submit this to a larger community for consideration and eventually, for feedback. Thanks. Regards, Marcel P.S. For completeness: http://unicode.org/L2/L2007/07370-punct.html And also wrt my previous post: https://www.unicode.org/L2/L2007/07209-whistler-uax14.txt -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 11:35:49 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 17 Jan 2019 18:35:49 +0100 Subject: NNBSP (was: A last missing link for interoperable representation) In-Reply-To: References: <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com> <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> Message-ID: On 17/01/2019 12:21, Philippe Verdy via Unicode wrote: > > [quoted mail] > > But the French "espace fine ins?cable" was requested long long before Mongolian was discussed for encodinc in the UCS. The problem is that the initial rush for French was made in a period where Unicode and ISO were competing and not in sync, so no agreement could be found, until there was a decision to merge the efforts. Tge early rush was in ISO still not using any character model but a glyph model, with little desire to support multiple whitespaces; on the Unicode side, there was initially no desire to encode all the languages and scripts, focusing initially only on trying to unify the existing vendor character sets which were already implemented by a limited set of proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national standard or de facto standards (Russia, Thailand, Japan, Korea). > This early rush did not involve typographers (well there was Adobe at this time but still using another unrelated technology). Font standards were still not existing and were competing in incompatible ways, all was a mess at that time, so publishers were still required to use proprietary software solutions, with very low interoperability (at that time the only "standard" was PostScript, not needing any character encoding at all, but only encoding glyphs!) Thank you for this insight. It is a still untold part of the history of Unicode. It seems that there was little incentive to involve typographers because they have no computer science training, and because they were feared as trying to enforce requirements that Unicode were neither able nor willing to meet, such as distinct code points for italics, bold, small caps? Among the grievances, Unicode is blamed for confusing Greek psili and dasia with comma shapes, and for misinterpreting Latin letter forms such as the u with descender taken for a turned h, and double u mistaken for a turned m, errors that subsequently misled font designers to apply misplaced serifs. Things were done in a hassle and a hurry, under the Damokles sword of a hostile ISO messing and menacing to unleash an unusable standard if Unicode wasn?t quicker. > > If publishers had been involded, they would have revealed that they all needed various whitespaces for correct typography (i.e. layout). Typographs themselves did not care about whitespaces because they had no value for them (no glyph to sell). Nevertheless the whole range of traditional space forms was admitted, despite they were going to be of limited usability. And they were given properties. Or can?t the misdefinition of PUNCTUATION SPACE be backtracked to that era? > Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained their typists or ads sellers to use it (that character was not "sold" in classified ads, it was necessary for correct layout, notably in narrow columns, not using it confused the readers (notably for the ":" colon): it had to be non-breaking, non-expanding by justification, narrower than digits and even narrower than standard non-justified whitespace, and was consistently used as a decimal grouping separator. No doubt they were confident that when an UCS is set up, such an important character wouldn?t be skipped. So confident that they never guessed that they had a key role in reviewing, in providing feedback, in lobbying. Too bad that we?re still so few people today, corporate vetters included, despite much things are still going wrong. > > But at that time the most common OSes did not support it natively because there was no vendor charset supporting it (and in fact most OSes were still unable to render proportional fonts everywhere and were frequently limited to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early start). Was there a lack of foresightedness? Turns out that today as those characters are needed, they aren?t ready. Not even the NNBSP. Perhaps it?s the poetic ?justice of time? that since Unicode is on, the Vietnamese are the foremost, and the French the hindmost. [I?m alluding to the early lobbying of Vietnam for a comprehensive set of precomposed letters, while French wasn?t even granted to come into the benefit of the NNBSP ? that according to PRI #308 [1] is today the only known use of NNBSP outside Mongolian ? and a handful ordinal indicators (possibly along with the rest of the alphabet, except q). [1] ?The only other widely noted use for U+202F NNBSP is for representation of the thin non-breaking space (/espace fine ins?cable/) regularly seen next to certain punctuation marks in French style typography.? > So intermediate solution was needed. Us chose not to use at all the non-breakable thin space because in English it was not needed for basic Latin, but also because of the huge prevalence of 7-bit ASCII for everything (but including its own national symbol for the "$", competing with other ISO 646 variants). There were tons of legacy applications developed ince decenials that did not support anything else and interoperability in US was available ony with ASCII, everything else was unreliable. Probably because it wouldn?t have made much sense as long as people are unwilling to key in anything more, due to the requirement of maintaining a duplicate Alt key. > > If you remember the early years when the Internet started to develop outside US, you remember the nightmare of non-interoperable 8-bit charsets and the famous "mojibake" we saw everywhere. We can still have mojibake in Windows terminal, at least on Windows 7, and when Latin-1 is coded in UTF-8 and rendered while CP1252 is default. > Then the competition between ISO and Unicode lasted too long. But it was considered "too late" for French to change anything (and Windows used in so many places by som many users promoted the use of the Windows-1252 charset (which had a few updates before it was frozen definitely: there was no place for NNBSP in it). In the wake it could have been relegated to history. What was the plot in keeping bothering end-users with an unusable legacy encoding? > Typographers and publishers were upset: to use the NNBSP they still needed to use proprietary *document* encodings. They still needed? Why didn?t they just refuse to buy it? That would have changed the vendors? minds, I guess. > The W3C did not help much too (it was long to finally adopt the UCS as a mandatory component for HTML, before that it depended only on the old IANA charset database promoting only the work of vendors and a few ISO standards). The W3C hasn?t even defined a named entity for &nnbsp;, like the have done for ‌ Who instructed them to obstruct? > > France itself wanted to keep its own national variant of ISO 646 (inherited from telegraphic systems), but it was finally abandoned: everybody was already using windows 1252 or ISO 8859-1 (even early Unix adopters which used a preliminary version made by Digital/DEC, then promoted by X11), or otherwise used Adobe proprietary encodings. Unix itself had no standard (so many different variants including with other OSes for industrial or accounting systems, made notably by IBM,, which created so many variants, almost one for each submarket, multiple ones in the same country, each time split into mutliple variants between those based on ASCII, and those based on EBCDIC...) Was that the era when the industry wasn?t ready for 16-bit computing? What a nightmare, indeed? But today the problem is that despite that?s all over and pass?, part of the industry seems to keep bullying the NNBSP as if they didn?t want French and other languages to use it right now. > > The truth is that publishers were forgotten, because their commercial market was much narrower: each publisher then used its own internal conventions. Even libaries used their own classifications. There was no attempt to unifify the needs for publishers (working at document level) and data processors (including OSes). This effort started only very late, when W3C finally started to work seriously on fixing HTML, and make it more or less interoperable with SGML (promoted by publishers). Forgetting the publishers is really bad. Now the point is that NNBSP is not only relevant to publishers, but to every single end-user trying to write in French. > But at national level, there were still lot of other competing standards (let's remember teletext, including the Minitel terminal and Antiope for TV). People at home did not have access to any system capable of rendering proportionaly fonts. All early computers for personal use were based on fixed-width 8-bit fonts (including in Japan). China and Korea were still not technology advanced as they are today (there were some efforts but they were costly and there was little return at that time). Proportional fonts at home started likely with the Macintosh, IIRC. > > The adoption of the UCS was extremely long, and it is still not competely finished even if now its support is mandatory in all new computiong standards and their revisions. The last segment where it still resists is the mobile phone industry (how can the SMS be so restricted and so much non-interoperable, and inefficient?) I thought that is a limitation proper to the type of cellphone I?m using. > > So French has a long tradition for its "fine", its support was demanded since long but constantly ignored by vendors making "the" standard. So here we have it. The need for NNBSP was ignored by UTC? I?m already fearing that UTC instructed CLDR TC to roll back the NNBSP instead of completing its implementation. Not every company has a principled house policy about doing no evil. All my suspicions about lawless lobbying and malicious marketing are hereby confirmed. That?s driving me mad. I need to stop posting to this list, and mind my business. > Publishers themselves resisted against the adoption of the web as a publishing platform: they prefered their legacy solutions as well, and did not care much about interoperability, so they did not pressure enough the standard makers to adopt the "fine". The same happened in US. There was no "commercial" incentive to adopt it and littel money coming from that sector (that has since suffered a lot from the loss of advertizing revenue, the competition of online publishers, the explosion of paper cost, but as well from the huge piracy level made on the Internet that reduced their sales and then their effective measured audience; the same is happening now on the TV and radio market; and on the Internet the adverizing market has been concentrated a lot and its revenues are less and less balanced; photographs and reporters have difficulties now to live from their work). > > And there's little incentive now for creating quality products: so many products are developed and distributed very fast, and not enough people care about quality, or won't pay for it. The old good practives of typographs and publishers are most often ignored, they look "exotic" or "old-fashioned", and so many people say now these are "not needed" (just like they'll say that supporting multiple languages is not necessary) If the users you?re referring to don?t deserve the right to type in their language?s interoperable representation, there?s no hope. You?re talking about a fringe that is generating part of the information feed on social media. The overwhelming majority of end-users are full of good will, and are very learned people. Like education is set up against illiteracy, fighting in-typography is a matter of training. There?s a mass of fine blogs out there. What may remain to do is only adding to it. Many thanks to Philippe Verdy for this valuable feedback. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 12:50:55 2019 From: unicode at unicode.org (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?= via Unicode) Date: Thu, 17 Jan 2019 19:50:55 +0100 Subject: wws dot org In-Reply-To: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> Message-ID: <843f5142-13a3-a156-b180-4b1762fb7d3e@gmail.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 13:06:48 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 17 Jan 2019 11:06:48 -0800 Subject: NNBSP In-Reply-To: References: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> Message-ID: <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 13:11:40 2019 From: unicode at unicode.org (=?utf-8?B?5qKB5rW3IExpYW5nIEhhaQ==?= via Unicode) Date: Thu, 17 Jan 2019 11:11:40 -0800 Subject: NNBSP In-Reply-To: <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> References: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> Message-ID: [Just a quick note to everyone that, I?ve just subscribed to this public list, and will look into this ongoing Mongolian-related discussion once I?ve mentally recovered from this week?s UTC stress. :)] Best, ?? Liang Hai https://lianghai.github.io > On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode wrote: > > On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote: >>> [quoted mail] >>> >>> But the French "espace fine ins?cable" was requested long long before Mongolian was discussed for encodinc in the UCS. The problem is that the initial rush for French was made in a period where Unicode and ISO were competing and not in sync, so no agreement could be found, until there was a decision to merge the efforts. Tge early rush was in ISO still not using any character model but a glyph model, with little desire to support multiple whitespaces; on the Unicode side, there was initially no desire to encode all the languages and scripts, focusing initially only on trying to unify the existing vendor character sets which were already implemented by a limited set of proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national standard or de facto standards (Russia, Thailand, Japan, Korea). >>> This early rush did not involve typographers (well there was Adobe at this time but still using another unrelated technology). Font standards were still not existing and were competing in incompatible ways, all was a mess at that time, so publishers were still required to use proprietary software solutions, with very low interoperability (at that time the only "standard" was PostScript, not needing any character encoding at all, but only encoding glyphs!) >> >> Thank you for this insight. It is a still untold part of the history of Unicode. > This historical summary does not square in key points with my own recollection (I was there). I would therefore not rely on it as if gospel truth. > > In particular, one of the key technologies that brought industry partners to cooperate around Unicode was font technology, in particular the development of the TrueType Standard. I find it not credible that no typographers were part of that project :). > > Covering existing character sets (National, International and Industry) was an (not "the") important goal at the time: such coverage was understood as a necessary (although not sufficient) condition that would enable data migration to Unicode as well as enable Unicode-based systems to process and display non-Unicode data (by conversion). > > The statement: "there was initially no desire to encode all the languages and scripts" is categorically false. > > (Incidentally, Unicode does not "encode languages" - no character encoding does). > > What has some resemblance of truth is that the understanding of how best to encode whitespace evolved over time. For a long time, there was a confusion whether spaces of different width were simply digital representations of various metal blanks used in hot metal typography to lay out text. As the placement of these was largely handled by the typesetter, not the author, it was felt that they would be better modeled by variable spacing applied mechanically during layout, such as applying indents or justification. > > Gradually it became better understood that there was a second use for these: there are situations where some elements of running text have a gap of a specific width between them, such as a figure space, which is better treated like a character under authors or numeric formatting control than something that gets automatically inserted during layout and rendering. > > Other spaces were found best modeled with a minimal width, subject to expansion during layout if needed. > > > > There is a wide range of typographical quality in printed publication. The late '70s and '80s saw many books published by direct photomechanical reproduction of typescripts. These represent perhaps the bottom end of the quality scale: they did not implement many fine typographical details and their prevalence among technical literature may have impeded the understanding of what character encoding support would be needed for true fine typography. At the same time, Donald Knuth was refining TeX to restore high quality digital typography, initially for mathematics. > > However, TeX did not have an underlying character encoding; it was using a completely different model mediating between source data and final output. (And it did not know anything about typography for other writing systems). > > Therefore, it is not surprising that it took a while and a few false starts to get the encoding model correct for space characters. > > Hopefully, well complete our understanding and resolve the remaining issues. > > A./ > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 14:21:03 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 17 Jan 2019 20:21:03 +0000 Subject: NNBSP In-Reply-To: References: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> Message-ID: <20190117202103.30162bf4@JRWUBU2> On Thu, 17 Jan 2019 18:35:49 +0100 Marcel Schneider via Unicode wrote: > Among the grievances, Unicode is blamed for confusing Greek psili and > dasia with comma shapes, and for misinterpreting Latin letter forms > such as the u with descender taken for a turned h, and double u > mistaken for a turned m, errors that subsequently misled font > designers to apply misplaced serifs. And I suppose that the influence was so great that it travelled back in time to 1976, affecting the typography of the Pelican book 'Phonetics' as reprinted in 1976. Those IPA characters originated in a tradition where new characters had been derived by rotating other characters so as to avoid having to have new type cut. Misplaced serifs appear to be original. Richard. From unicode at unicode.org Thu Jan 17 16:56:27 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 17 Jan 2019 23:56:27 +0100 Subject: NNBSP In-Reply-To: <20190117202103.30162bf4@JRWUBU2> References: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp> <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <20190117202103.30162bf4@JRWUBU2> Message-ID: <19e42ead-284a-6527-2546-9b5fe2baf9cd@orange.fr> On 17/01/2019 21:21, Richard Wordingham via Unicode wrote: > > On Thu, 17 Jan 2019 18:35:49 +0100 > Marcel Schneider via Unicode wrote: > > >> Among the grievances, Unicode is blamed for confusing Greek psili and >> dasia with comma shapes, and for misinterpreting Latin letter forms >> such as the u with descender taken for a turned h, and double u >> mistaken for a turned m, errors that subsequently misled font >> designers to apply misplaced serifs. > > And I suppose that the influence was so great that it travelled back in > time to 1976, affecting the typography of the Pelican book 'Phonetics' > as reprinted in 1976. > > Those IPA characters originated in a tradition where new characters had > been derived by rotating other characters so as to avoid having to have > new type cut. Misplaced serifs appear to be original. I merely reported what had been brought up by O. Randier [1]. Thanks for shedding the right light on these issues. The paper comes out diminished, definitely full of errors. This confirms a trend to criticize Unicode instead of cooperating. That would be enough of an explanation why UTC is not ready to make any gifts to French, neither up to now nor after my interoperable-representation-whining. The most that French could get was granted to the Canadian French and only thanks to Patrick Andries and to Martin J. D?rst. I?d like to thank these gentlemen and Ken Whistler who lent an ear and was ready to add a mention in UAX #14. That is probably the most what the French language can expect, because it ultimately may not deserve any more, due to French wrongdoing along history and fresh in memories after the terrorist attack against Greenpeace. The moral strength needed for a lobbying effort was gone, and the most that people could do is being upset when NNBSP stayed missing, as Philippe Verdy reported, but not take any action. It wasn?t until Canadian French Patrick Andries asked for a small concession based on what falls off from Mongolian, ending up in General Punctuation due to the foresight of the UTC, that Unicode started supporting French, in a merciful gesture granted through the service door in the backyard. Now I?m likely to be scared into silence, deeply ashamed. (But I?m committed to keep on the job.) Nothing happens, or does not happen, without a good reason. Finding out what reason is key to recoverage. If we want to get what we need, we must do our homework first. Thanks for helping bring it to the point. Kind regards, Marcel [1] http://www.cairn.info/article.php?ID_REVUE=DN&ID_NUMPUBLIE=DN_063&ID_ARTICLE=DN_063_0089&FRM=B#pa29 From unicode at unicode.org Thu Jan 17 17:44:50 2019 From: unicode at unicode.org (=?utf-8?B?IkouwqBTLiBDaG9pIg==?= via Unicode) Date: Thu, 17 Jan 2019 18:44:50 -0500 Subject: Loose character-name matching Message-ID: <60797095-B703-4770-8F85-F045DDED4431@icloud.com> I?m implementing a Unicode names library. I?m confused about loose character-name matching, even after rereading The Unicode Standard ? 4.8, UAX #34 ? 4, #44 ? 5.9.2 ? as well as [L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt ), [L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035 ), and the [meeting in which those two items were resolved](https://www.unicode.org/L2/L2014/14026.htm ). In particular, I?m confused by the claim in The Unicode Standard ? 4.8 saying, ?Because Unicode character names do not contain any underscore (?_?) characters, a common strategy is to replace any hyphen-minus or space in a character name by a single ?_? when constructing a formal identifier from a character name. This strategy automatically results in a syntactically correct identifier in most formal languages. Furthermore, such identifiers are guaranteed to be unique, because of the special rules for character name matching.? I?m also confused by the relationship between UAX34-R3 and UAX44-LM2. To make these issues concrete, let?s say that my library provides a function called getCharacter that takes a name argument, tries to find a loosely matching character, and then returns it (or a null value if there is no currently loosely matching character). So then what should the following expressions return? getCharacter(?HANGUL-JUNGSEONG-O-E?) getCharacter(?HANGUL_JUNGSEONG_O_E?) getCharacter(?HANGUL_JUNGSEONG_O_E_?) getCharacter(?HANGUL_JUNGSEONG_O__E?) getCharacter(?HANGUL_JUNGSEONG_O_-E?) getCharacter(?HANGUL JUNGSEONGCHARACTERO E?) getCharacter(?HANGUL JUNGSEONG CHARACTER OE?) getCharacter(?TIBETAN_LETTER_A?) getCharacter(?TIBETAN_LETTER__A?) getCharacter(?TIBETAN_LETTER _A?) getCharacter(?TIBETAN_LETTER_-A?) Thanks, J. S. Choi -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 17 18:04:11 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 18 Jan 2019 00:04:11 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: <50524911-3be8-e307-2249-c7a7eb47f6ca@gmail.com> For web searching, using the math-string ?????????????? ???????????? as the keywords finds John Maynard Keynes in web pages.? Tested this in both Google and DuckDuckGo.? Seems like search engines are accomodating actual user practices.? This suggests that social media data is possibly already being processed for the benefit of the users (and future historians) by software people who care about such things. From unicode at unicode.org Fri Jan 18 02:11:49 2019 From: unicode at unicode.org (Johannes Bergerhausen via Unicode) Date: Fri, 18 Jan 2019 09:11:49 +0100 Subject: wws dot org In-Reply-To: <843f5142-13a3-a156-b180-4b1762fb7d3e@gmail.com> References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> <843f5142-13a3-a156-b180-4b1762fb7d3e@gmail.com> Message-ID: <3277BE48-2A5A-4060-B6F5-1A0B3566CC78@bergerhausen.com> Thanks a lot for this input! We?ll check this with Deborah Anderson from SEI Berkeley. The update of the web site to Unicode 12.0 will be an opportunity to make some corrections. All the best, Johannes > Am 17.01.2019 um 19:50 schrieb Fr?d?ric Grosshans : > > Thanks for this nice website ! > > Some feedback: > > Given the number of scripts in this period, I think that splitting 10c-19c in two (or even three) would be a good idea > A finer unicode status would be nice > > Coptic is listed as European, while, I think it is Africac, (even if a member of the LGC (LAtin-Greek-Cyrillic) family since, to my knowledge, it has only be used in Africa for African llanguages (Coptic and Old Nubian). > Coptic still used for religious purpose today. Why to you write it dead in the 14th century ? > Khitan Small Script: According to Wikipedia, it ?was invented in about 924 or 925 CE?, not 920 (that is the date of the Khitan Large Script > Cyrillic I think its birth date is 890s, slightly more precice than the 10c you write > You include two well known Tolkienian scripts (Cirth and Tengwar), but you omit the third (first ?) one, the Sarati (see e.g. http://at.mansbjorkman.net/sarati.htm andhttps://en.wikipedia.org > On a side note, you the site considers visible speech as a living-script, which surprised be. This information is indeed in the Wikipedia infobox and implied by its ?HMA status? on the Berkeley SEI page, but the text of the wikipedia page says ?However, although heavily promoted [...] in 1880, after a period of a dozen years or so in which it was applied to the education of the deaf, Visible Speech was found to be more cumbersome [...] compared to other methods, and eventually faded from use.? > > My (cursory) research failed to show a more recent date than this for the system than this ?dosen of year or so [past 1880]? . Is there any indication of the system to be used later? (say, any date in the 20th century) > > > All the best, > > > Fr?d?ric > Le 15/01/2019 ? 19:22, Johannes Bergerhausen via Unicode a ?crit : >> Dear list, >> >> I am happy to report that www.worldswritingsystems.org is now online. >> >> The web site is a joint venture by >> >> ? Institut Designlabor Gutenberg (IDG), Mainz, Germany, >> ? Atelier National de Recherche Typographique (ANRT), Nancy, France and >> ? Script Encoding Initiative (SEI), Berkeley, USA. >> >> For every known script, we researched and designed a reference glyph. >> >> You can sort these 292 scripts by Time, Region, Name, Unicode version and Status. >> Exactly half of them (146) are already encoded in Unicode. >> >> Here you can find more about the project: >> www.youtube.com/watch?v=CHh2Ww_bdyQ >> >> And is a link to see the poster: >> https://shop.designinmainz.de/produkt/the-worlds-writing-systems-poster/ >> >> All the best, >> Johannes >> >> >> >> >> ? Prof. Bergerhausen >> Hochschule Mainz, School of Design, Germany >> www.designinmainz.de >> www.decodeunicode.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 06:56:05 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Fri, 18 Jan 2019 07:56:05 -0500 Subject: wws dot org In-Reply-To: <843f5142-13a3-a156-b180-4b1762fb7d3e@gmail.com> References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com> <843f5142-13a3-a156-b180-4b1762fb7d3e@gmail.com> Message-ID: <568fbd33-e203-2459-f3db-d5d1d986f673@kli.org> On 1/17/19 1:50 PM, Fr?d?ric Grosshans via Unicode wrote: > > On a side note, you the site considers visible speech as a > living-script, which surprised be. This information is indeed in the > Wikipedia infobox and implied by its ?HMA status? on the Berkeley SEI > page, but the text of the wikipedia page says ?However, although > heavily promoted [...] in 1880, after a period of a dozen years or so > in which it was applied to the education of the deaf, Visible Speech > was found to be more cumbersome [...] compared to other methods,and > eventually faded from use.? > > My (cursory) research failed to show a more recent date than this for > the system than this ?dosen of year or so [past 1880]? . Is there any > indication of the system to be used later? (say, any date in the 20th > century) > I just got email a few days ago from someone who wants to use it on an album cover... But on the whole I think you are correct; I have not seen much use or even study of it (outside of my own and a very few others) in recent times.? And I *still* have to submit a proposal for it to be included in Unicode. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 09:27:17 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 18 Jan 2019 16:27:17 +0100 Subject: NNBSP In-Reply-To: References: <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com> <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> Message-ID: <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> On 17/01/2019 20:11, ?? Liang Hai via Unicode wrote: > [Just a quick note to everyone that, I?ve just subscribed to this public list, and will look into this ongoing Mongolian-related discussion once I?ve mentally recovered from this week?s UTC stress. :)] Welcome to Unicode Public. Hopefully this discussion helps sort things out so that we?ll know both what to do wrt Mongolian and what to do wrt French. On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode > wrote: > On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote: >> ?[On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:] >>> >>> [quoted mail] >>> >>> But the French "espace fine ins?cable" was requested long long before Mongolian was discussed for encodinc in the UCS. The problem is that the initial rush for French was made in a period where Unicode and ISO were competing and not in sync, so no agreement could be found, until there was a decision to merge the efforts. Tge early rush was in ISO still not using any character model but a glyph model, with little desire to support multiple whitespaces; on the Unicode side, there was initially no desire to encode all the languages and scripts, focusing initially only on trying to unify the existing vendor character sets which were already implemented by a limited set of proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national standard or de facto standards (Russia, Thailand, Japan, Korea). >>> This early rush did not involve typographers (well there was Adobe at this time but still using another unrelated technology). Font standards were still not existing and were competing in incompatible ways, all was a mess at that time, so publishers were still required to use proprietary software solutions, with very low interoperability (at that time the only "standard" was PostScript, not needing any character encoding at all, but only encoding glyphs!) >> Thank you for this insight. It is a still untold part of the history of Unicode. > This historical summary does *not *square in key points with my own recollection (I was there). I would therefore not rely on it as if gospel truth. > > In particular, one of the key technologies that _brought industry partners to cooperate around Unicode_ was font technology, in particular the development of the /TrueType /Standard. I find it not credible that no typographers were part of that project :). > It is probably part of the (unintentional) fake blames spread by the cited author?s paper. My apologies for not sufficiently assessing the reliability of my sources. I?d already identified a number of errors but wasn?t savvy enough for seeing the other one reported by Richard Wordingham. Now the paper ends up as a mere libel. It doesn?t mention the lack of NNBSP, instead it piles up a bunch of gratuitous calumnies. Should that be the prevailing mood of average French professionals with respect to Unicode ? indeed Patrick Andries is the only French tech writer on Unicode I found whose work is acclaimed, the others are either disliked or silent (or libellists) ? then I understand only better why a significant majority of UTC is hating French. Francophobia is also palpable in Canada, beyond any technical reasons, especially in the IT industry. Hence the position of UTC is far from isolated. If ethic and personal considerations inflect decision-making, they should consistently be an integral part of discussions here. In that vein, I?d mention that by the time when Unicode was developed, there was a global hatred against France, that originated in French colonial and foreign politics since WWII, and was revived a few years ago by the French government sinking ????????????????????????????? and killing the crew?s photographer, in the port of Auckland. That crime triggered a peak of anger. > > Covering existing character sets (National, International and Industry) was _an_ (not "the") important goal at the time: such coverage was understood as a necessary (although not sufficient) condition that would enable data migration to Unicode as well as enable Unicode-based systems to process and display non-Unicode data (by conversion). > I?d take this as a touchstone to infer that there were actual data files including standard typographic spaces as encoded in U+2000..U+2006, and electronic table layout using these: ?U+2007 figure space has a fixed width, known as tabular width, which is the same width as digits used in tables. U+2008 punctuation space is a space defined to be the same width as a period.? Is that correct? > > The statement: "there was initially no desire to encode all the languages and scripts" is categorically false. > Though Unicode was designed as being limited to a 65?000 characters, and it was stated that historic scripts were out of scope, only living scripts should be encoded, for interchange. > > (Incidentally, Unicode does not "encode languages" - no character encoding does). > In an often used sense every ?language? has its ?alphabet?, although one does not currently refer to Latin as multiple scripts. > > What has some resemblance of truth is that the understanding of how best to encode whitespace evolved over time. For a long time, there was a confusion whether spaces of different width were simply digital representations of various metal blanks used in hot metal typography to lay out text. As the placement of these was largely handled by the typesetter, not the author, it was felt that they would be better modeled by variable spacing applied mechanically during layout, such as applying indents or justification. > Indeed it is stated that the multiple typographic spaces that made it into the Standard were not used in electronic typesetting and layout. > > Gradually it became better understood that there was a second use for these: there are situations where some elements of running text have a gap of a specific width between them, such as a figure space, which is better treated like a character under authors or numeric formatting control than something that gets automatically inserted during layout and rendering. > There seems to be a confusion about the figure space. What is this space really for? * The Unicode Standard hints that it was used to fill up empty positions in numeric tables. * The Unicode Line Break Algorithm UAX #14 understands that it is the group separator, although as such it is neither SI- and ISO?80000 conformant, nor is it implemented in CLDR. (Fortunately it is not, given it isn?t SI/ISO compliant, but it would have been a better pick than NBSP, because unlike NBSP, it is not justifying.) As you were there, did you see or hear how it happened that, well, FIGURE SPACE (U+2007) was declared non-breakable, and how it happened that at the same time, PUNCTUATION SPACE (U+2008) was not declared non-breakable? Hint: Was it understood (certainly it was) that a non-breakable PUNCTUATION SPACE would have been the ?espace fine ins?cable? (narrow no-break space) that the French users of character sets were languishing after? > Other spaces were found best modeled with a minimal width, subject to expansion during layout if needed. > > There is a wide range of typographical quality in printed publication. The late '70s and '80s saw many books published by direct photomechanical reproduction of typescripts. These represent perhaps the bottom end of the quality scale: they did not implement many fine typographical details and their prevalence among technical literature may have impeded the understanding of what character encoding support would be needed for true fine typography. > By that time, electronic typewriters became widespread, featuring interchangeable fonts (on type wheels), proportional advance width (for use with appropriate fonts), and bold weight (by double-typing with a tiny offset). Additionally some models had an input buffer with a linear LCD display, mitigating the expense in correction ribbon as typewriters became more and more popular. With ordinary typewriter spacing, the narrow space was not a demand, but with proportional advance width that could have changed. Do you remember the ratio fixed width / proportional width in the photomechanically reproduced printed matters you are referring to? How were typewriters with proportional width shaping the perception of typography in general, and of whitespace in particular, among the authors of Unicode? *Fine typography:* There is a current misunderstanding of ?fine typography? with respect to the NARROW NO-BREAK SPACE. The use of this character **is not** part of fine typography. It is simply part of the ordinary digital representation of the French language. To declare NNBSP as belonging to ?fine typography? is to make it optional. In French and in languages grouping digits with spaces, *NNBSP* is not optional, it*is mandatory.* In the actual state of Unicode, NNBSP is the only usable space for the purpose of grouping digits and of spacing off French punctuation (except some old-style French layout of the colon). That space would be *PSP* (PUNCTUATION SPACE) **if** Unicode had made it non-breakable. In that case, the *MONGOLIAN SPACE (MSP) would eventually have been encoded, or rather the *MONGOLIAN SUFFIX CONNECTOR (MSC), for the purpose of particular shaping. If the *MONGOLIAN SPACE had actually been encoded, it would be tailorable ad libitum, and Unicode could change its properties as desired (referring to a proposed change of General category of NNBSP from Zs to Cf, and/or of line-breaking class from GL to BB IIRC). > > At the same time, Donald Knuth was refining TeX to restore high quality digital typography, initially for mathematics. > That is very interesting an certainly worth noting here, but it cannot be enough underscored how this is off-topic to this thread, and brings us away from the matter we?re actually discussing, that is writing Mongolian and French in a functional way, also in plain text. Again, NNBSP is not fine typography and it has nothing to do with high-quality typography. NNBSP is simply a matter of not ending up with messy text. Not to use NNBSP is to mess up the text. > > However, TeX did not have an underlying character encoding; it was using a completely different model mediating between source data and final output. (And it did not know anything about typography for other writing systems). > > Therefore, it is not surprising that it took a while and a few false starts to get the encoding model correct for space characters. > Isn?t that overstating the complexity of whitespaces in Unicode? As seen from today, getting it right was as simple as giving the same GL class to both spaces allegedly encoded for tabular typesetting, but readily repurposed. As it is, PUNCTUATION SPACE is a totally useless duplicate encoding, until/unless proven otherwise. > > Hopefully, well complete our understanding and resolve the remaining issues. > > A./ > That is a great promise. Hopefully you are being backed by UTC in making it! Best regards, Marcel P. S.: The name of the Greenpeace flagship has been typeset in italics thanks to Andrew West?s online utility, [1] in respectfulness towards the organization, and with implicit reference to parent and sibling threads. //Please don?t interpret this gesture as backing demands for Unicode representation of italics.// We?re (at least I?m) actually trying to understand more in detail why UTC is struggling against NNBSP as a space (thinking at changing its Gc to Cf), while at encoding time, UTC prompted Mongolian OPs to refrain from requesting a dedicated Mongolian Space rather than shifting the new space into General Punctuation for other scripts? joint convenience. Admittedly, French has been the only script to make extensive use of it [2] ? a highly partial impression given many many other locales are using a space to group digits, and that space is then mandatorily NNBSP; anything else being highly unprofessional. So we?ll look even harder at the new TUS text wrt NNBSP in Mongolian, that Richard Wordingham draw our attention to, and we?d like to understand the role of UTC acting in favor or against NNBSP, possibly with various antagonistic components within UTC. [1] http://www.babelstone.co.uk/Unicode/text.html [2] http://www.unicode.org/review/pri308/pri308-background.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 09:51:18 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Fri, 18 Jan 2019 10:51:18 -0500 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> On 1/16/19 6:23 AM, Victor Gaultney via Unicode wrote: > > Encoding 'begin italic' and 'end italic' would introduce difficulties > when partial strings are moved, etc. But that's no different than with > current punctuation. If you select the second half of a string that > includes an end quote character you end up with a mismatched pair, > with the same problems of interpretation as selecting the second half > of a string including an 'end italic' character. Apps have to deal > with it, and do, as in code editors. > It kinda IS different.? If you paste in half a string, you get a mismatched or unmatched paren or quote or something.? A typo, but a transient one.? It looks bad where it is, but everything else is unaffected.? It's no worse than hitting an extra key by mistake. If you paste in a "begin italic" and miss the "end italic", though, then *all* your text from that point on is affected!? (Or maybe "all until a newline" or some other stopgap ending, but that's just damage-control, not damage-prevention.)? Suddenly, letters and symbols five words/lines/paragraphs/pages look different, the pagination is all altered (by far more than merely a single extra punctuation mark, since italic fonts generally are narrower than roman).? It's a disaster. No.? This kind of statefulness really is beyond what Unicode is designed to cope with.? Bidi controls are (almost?) the sole exception, and even they cause their share of headaches.? Encoding separate _text_ italics/bold is IMO also a disastrous idea, but I'm not putting out reasons for that now.? The only really feasible suggestion I've heard is using a VS in some fashion. (Maybe let it affect whole words instead of individual characters?? Makes for fewer noisy VSs, but introduces a whole other host of limitations (how to italicize part of a word, how to italicize non-letters...) and is also just damage-control, though stronger.) > Apps (and font makers) can also choose how to deal with presenting > strings of text that are marked as italic. They can choose to present > visual symbols to indicate begin/end, such as /this/. Or they can > present it using the italic variant of the font, if available. > At which point, you have invented markdown.? Instead of making Unicode declare it, just push for vendors everywhere to recognize /such notation/ as italics (OK, I know, you want dedicated characters for it which can't be confused for anything else.) > - Those who develop plain text apps (social media in particular) don't > have to build in a whole markup/markdown layer into their apps > With the complexity of writing an social media app, a markup layer is really the least of the concerns when it comes to simplifying. > > - Misuse of math chars for pseudo-italic would likely disappear > > - The text runs between markers remain intact, so they need no special > treatment in searching, selecting, etc. > > - It finally, and conclusively, would end the decades of the mess in > HTML that surrounds and . > Adding _another_ solution to something will *never* "conclusively end" anything.? On a good day, you can hope it will swamp the others, but they'll remain at least in legacy.? More likely, it will just add one more way to be confused and another side to the mess.? (People have pointed out here about the difficulties of distinguishing or not-distinguishing between HTML-level and putative plain-text italics.? And yes, that is an issue, and one that already exists with styling that can change case and such.? As with anything, the question is not whether there are going to be problems, but how those problems weigh against potential benefits.? That's an open question.) > My main point in suggesting that Unicode needs these characters is > that italic has been used to indicate specific meaning - this text is > somehow special - for over 400 years, and that content should be > preserved in plain text. > There is something to this: people have been *emphasizing* text in some fashion or another for ages.? There is room to call this plain text. ~mark From unicode at unicode.org Fri Jan 18 09:58:59 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Fri, 18 Jan 2019 10:58:59 -0500 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> Message-ID: <59a8901e-3f7e-25c6-96da-b3d290df96e4@kli.org> On 1/16/19 7:16 AM, Andrew Cunningham via Unicode wrote: > HI Victor, an off list reply. The contents are just random thoughts > sparked by an interesting conversation. > > On Wed, 16 Jan 2019 at 22:44, Victor Gaultney via Unicode > > wrote: > > > - It finally, and conclusively, would end the decades of the mess > in HTML that surrounds and . > > > I am not sure that would fix the issue, more likely compound the issue > making it even more blurry what the semantic purpose is. HTML5 make > both and semantic ... and by the definition the style of the > elements is not necessarily italic. for instance would be script > dependant, may be partially script dependant when another > appropriate semantic tag is missing. A character/encoding level > distinction is just going to compound the mess. A good point, too.? While italics are being used sort of as an example, what the "evidence" really is for (and by evidence I mean what I alluded to at the end of my last post, over centuries of writing) is that people like to *emphasize* things from time to time.? It's really more the semantic side of "this text should be read louder."? So not so much "italic marker" but "emphasis marker." But... that ignores some other points made here, about specific meanings attached to italics (or underlining, in some settings), like distinguishing book or movie titles (or vessel names) from common or proper nouns.? Is it better to lump those with emphasis as "italic", or better to distinguish them semantically, as "emphasis marker" vs "title marker"?? And if we did the latter, would ordinary folks know or care to make that distinction?? I tend to doubt it. > My main point in suggesting that Unicode needs these characters is > that italic has been used to indicate specific meaning - this text > is somehow special - for over 400 years, and that content should > be preserved in plain text. > > > Underlying, bold text, interletter spacing, colour change, font style > change all are used to apply meaning in various ways. Not sure why > italic is special in this sense. Additionally without encoding the > meaning of italic, all you know is that it is italic, not what > convention of semantic meaning lies behind it. Um... yeah.? That's what I meant, also. > > And I am curious on your thoughts, if we distinguish italic in > Unicode, encode some way of spacifying italic text, wouldn't it make > more sense to do away with italic fonts all together? and just roll > the italic glyphs into the regular font? Eh.? Fonts are not really relevant to this.? Unicode already has more characters than you can put into a single font.? It's just as sensible, still, to have italic fonts and switch to them, just like you have to switch to your Thai font when you hit Thai text that your default font doesn't support.? (However, this knocks out the simplicity of using OpenType to handle it, as has been suggested.) ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 10:12:58 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Fri, 18 Jan 2019 11:12:58 -0500 Subject: Encoding italic In-Reply-To: <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp> Message-ID: <85665441-b4cf-d911-c2c4-98b994546267@kli.org> On 1/17/19 1:27 AM, Martin J. D?rst via Unicode wrote: > > This lead to the layering we have now: Case distinctions at the > character level, but style distinctions at the rich text level. Any good > technology has layers, and it makes a lot of sense to keep established > layers unless some serious problem is discovered. The fact that Twitter > (currently) doesn't allow styled text and that there is a small number > of people who (mis)use Math alphabets for writing italics,... on Twitter > doesn't look like a serious problem to me. How small a number?? How big?? I don't know either.? To mention Second Life again, which is pretty strongly defensible as a plain-text environment (with some exceptions, as for hyperlinks), I note that the viewers for it (and the servers?) don't seem to support Unicode characters outside of the BMP.? Which leads the flip-side of the "gappy" mathematical alphabets: you can say SOME things in italic or fraktur or double-struck... but only if they have the correct few letters that happen to be in the BMP already. Obviously, this can and should be blamed on incomplete Unicode support by the software vendors, but it still matters in the same way that "incomplete" markup support (i.e. none) matters to Twitter users: people make do with what they have, and will (mis)use even the few characters they can, though that leads to odd situations (see earlier list of display names.) ~mark From unicode at unicode.org Fri Jan 18 10:44:00 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Fri, 18 Jan 2019 16:44:00 +0000 (GMT) Subject: Encoding italic (was: A last missing link) In-Reply-To: <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> Message-ID: <3f2ae1a9.c1d3.16861d90bc2.Webtop.229@btinternet.com> Mark E. Shoulson wrote: > ?, since italic fonts generally are narrower than roman). I remember reading years ago that that was why italic type was invented in the first place in the fifteenth century, so that more text could be got into small format books that could conveniently be carried around. That is, used for all of the text of a book. So not invented for expressing emphasis. The only modern use of all italics text that I can remember seeing in printed books is when poems are typeset in italics. William Overington Friday 18 January 2019 From unicode at unicode.org Fri Jan 18 12:02:45 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 18 Jan 2019 10:02:45 -0800 Subject: NNBSP In-Reply-To: <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> References: <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> Message-ID: <6acfa6d0-3f10-db86-eb63-50160c6276b7@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 12:20:22 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 18 Jan 2019 10:20:22 -0800 Subject: NNBSP In-Reply-To: <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> References: <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> Message-ID: <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 13:09:48 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 18 Jan 2019 11:09:48 -0800 Subject: NNBSP In-Reply-To: <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> References: <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com> <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> Message-ID: <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 13:18:10 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 18 Jan 2019 11:18:10 -0800 Subject: Encoding italic (was: A last missing link) In-Reply-To: <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> Message-ID: <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 13:33:14 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 18 Jan 2019 20:33:14 +0100 Subject: NNBSP In-Reply-To: <6acfa6d0-3f10-db86-eb63-50160c6276b7@ix.netcom.com> References: <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <6acfa6d0-3f10-db86-eb63-50160c6276b7@ix.netcom.com> Message-ID: <623def57-deca-e316-47de-a303484faa72@orange.fr> On 18/01/2019 19:02, Asmus Freytag via Unicode wrote: > On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote: >> ....I understand only better why a significant majority of UTC is hating French. >> >> Francophobia is also palpable in Canada, beyond any technical reasons, especially in the IT industry. Hence the position of UTC is far from isolated. If ethic and personal considerations inflect decision-making, they should consistently be an integral part of discussions here. In that vein, I?d mention that by the time when Unicode was developed, there was a global hatred against France, that originated in French colonial and foreign politics since WWII, and was revived a few years ago by the French government sinking ????????????????????????????? and killing the crew?s photographer, in the port of Auckland. That crime triggered a peak of anger. > > Again, my recollections do *not support* any issues of _Francophobia_. > > The Unicode Technical committee has always had French people on board, from the beginning, and I have witnessed no issues where they took up a different technical position based on language. Quite the opposite, the UTC generally appreciates when someone can provide native insights into the requirements for supporting a given language. How best to realize these requirements then becomes a joint effort. > > If anything, the Unicode Consortium saw itself from the beginning in contrast to an IT culture for which internationalization at times was still something of an afterthought. > > Given all that, I find your suggestions and? implications deeply hurtful and hope you will find a way to avoid a repetition in the future. > > May I suggest that trying to rake over the past and apportion blame is generally less productive than _moving forward _and addressing the outstanding problems. > It is my last-resort track that I?m deeply convinced of. But I?m thankfully eased by not needing to discuss it here further. To point a well-founded behavior is not to blame. You?ll note that I carefully founded how UTC was right in doing so if they did. I wasn?t aware that I was hurtful. You tell me, so I apologize. Please note, though, based on my past e?mail, that I see UTC as a compound of multiple, sometimes antagonistic tendencies. Just an example to help understand what I mean: When Karl Pentzlin proposed to encode a missing French abbreviation indicator, a typographer was directed to argue (on behalf of his employer IIUC) that this would be a case of encoding all scripts in bold and italic. The OP protested that it wasn?t, but he was overheard. That example raises much concern, the more as we were told on this List that decision makers in UTC are refusing to join in open and public discussions here, are only ?duelling ballot comments.? Now since regardless of being right in doing so, they did not at all, I?m plunged again into disarray. May I quote Germaine Tillion, a French ethnologue: It?s important to understand what happens to us; to understand is to exist. ? Originally, ?to exist? meant ?to stand out.? That is still somewhat implied in the strong sense of ?to exist.? Understanding does also help to overcome. That?s why I wrote one e?mail before: Nothing happens, or does not happen, without a good reason. Finding out what reason is key to recoverage. If we want to get what we need, we must do our homework first. Thanks for helping bring it to the point. Kind regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 14:41:38 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 18 Jan 2019 21:41:38 +0100 Subject: NNBSP In-Reply-To: <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> References: <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> Message-ID: <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr> On 18/01/2019 19:20, Asmus Freytag via Unicode wrote: > On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote: >>> >>> Covering existing character sets (National, International and Industry) was _an_ (not "the") important goal at the time: such coverage was understood as a necessary (although not sufficient) condition that would enable data migration to Unicode as well as enable Unicode-based systems to process and display non-Unicode data (by conversion). >>> >> I?d take this as a touchstone to infer that there were actual data files including standard typographic spaces as encoded in U+2000..U+2006, and electronic table layout using these: ?U+2007 figure space has a fixed width, known as tabular width, which is the same width as digits used in tables. U+2008 punctuation space is a space defined to be the same width as a period.? >> Is that correct? > > May I remind you that the beginnings of Unicode predate the development of the world wide web. By 1993 the web had developed to where it was possible to easily access material written in different scripts and language, and by today it is certainly possible to "sample" material to check for character usage. > > When Unicode was first developed, it was best to work from the definition of character sets and to assume that anything encoded in a give set was also used somewhere. Several corporations had assembled supersets of character sets that their products were supporting. The most extensive was a collection from IBM. (I'm blanking out on the name for this). > > These collections, which often covered international standard character sets as well, were some of the prime inputs into the early drafts of Unicode. With the merger with ISO 10646 some characters from that effort, but not in the early Unicode drafts, were also added. > > The code points from U+2000..U+2008 are part of that early collection. > > Note, that prior to Unicode, no character set standard described in detail how characters were to be used (with exception, perhaps of control functions). Mostly, it was assumed that users knew what these characters were and the function of the character set was just to give a passive enumeration. > > Unicode's character property model changed all that - but that meant that properties for all of the characters had to be determined long after they were first encoded in the original sources, and with only scant hints of the identity of what these were intended to be. (Often, the only hint was a character name and a rather poor bitmapped image). > > If you want to know the "legacy" behavior for these characters, it is more useful, therefore, to see how they have been supported in existing software, and how they have been used in documents since then. That gives you a baseline for understanding whether any change or clarification of the properties of one of these code points will break "existing practice". > > Breaking existing practice should be a dealbreaker, no matter how well-intentioned a change is. The only exception is where existing implementations are de-facto useless, because of glaring inconsistencies or other issues. In such exceptional cases, deprecating some interpretations of? character may be a net win. > > However, if there's a consensus interpretation of a given character the you can't just go in and change it, even if it would make that character work "better" for a given circumstance: you simply don't know (unless you research widely) how people have used that character in documents that work for them. Breaking those documents retroactively, is not acceptable. > That is however what was proposed to do in PRI #308: change Gc of NNBSP from Zs to Pc (not to Cf, as I mistakenly quoted from memory, confusing with the *MONGOLIAN SUFFIX CONNECTOR, that would be a format control). That would break for example those implementations relying on Gc=Zs for the purpose of applying a background color to all (otherwise invisible) space characters. By the occasion of that Public Review Issue, J. S. Choi reported another use case of NNBSP: between an integer and a vulgar fraction, pointing an error in TUS version 8.0 by the way: ?the THIN SPACE does not prevent line breaking from occurring, which is required in style guides such as the Chicago Manual of Style?. ? In version 11.0 the erroneous part is still uncorrected: ?If the fraction is to be separated from a previous number, then a space can be used, choosing the appropriate width (normal, thin, zero width, and so on). For example, 1 + thin space + 3 + fraction slash + 4 is displayed as 1?.?? Note that TUS has typeset this with the precomposed U+00BE, not with plain digits and fraction slash. If U+2008 PUNCTUATION SPACE is used as intended, changing its line break property from A to GL does not break any implementation nor document. As of possible misuse of the character in ways other than intended, generally there is no point in using as breakable space a space that is actually just a thin variant of U+2007 FIGURE SPACE. Hence the question, again: Why was PUNCTUATION SPACE not declared as non-breakable? Marcel That sample also raises concern, as it showcases how much is done or not done, as appropriate, to keep NNBSP off the usage in Latin script. To what avail? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 15:03:55 2019 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Fri, 18 Jan 2019 21:03:55 +0000 Subject: NNBSP In-Reply-To: <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr> References: <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr> Message-ID: I've been lurking on this thread a little. This discussion has gone ?all over the place?, however I?d like to point out that part of the reason NBSP has been used for thousands separators is because that it exists in all of those legacy codepages that were mentioned predating Unicode. Whether or not NNBSP provides a better typographical experience, there are a lot of legacy applications, and even web services, that depend on legacy codepages. NNBSP may be best for layout, but I doubt that making it work perfectly for thousand separators is going to be some sort of magic bullet that solves any problems that NBSP provides. If folks started always using NNBSP, there are a lot of legacy applications that are going to start giving you ? in the middle of your numbers.? Here?s a partial ?dir > out.txt? after changing my number thousands separator to NNBSP in French on Windows (for example). 13/01/2019 09:48 15?360 AcXtrnal.dll 13/01/2019 09:46 54?784 AdaptiveCards.dll 13/01/2019 09:46 67?584 AddressParser.dll 13/01/2019 09:47 24?064 adhapi.dll 13/01/2019 09:47 97?792 adhsvc.dll 10/04/2013 08:32 154?624 AdjustCalendarDate.exe 10/04/2013 08:32 1?190?912 AdjustCalendarDate.pdb 13/01/2019 10:47 534?016 AdmTmpl.dll 13/01/2019 09:48 58?368 adprovider.dll 13/01/2019 10:47 136?704 adrclient.dll 13/01/2019 09:48 248?832 adsldp.dll 13/01/2019 09:46 251?392 adsldpc.dll 13/01/2019 09:48 101?376 adsmsext.dll 13/01/2019 09:48 350?208 adsnt.dll 13/01/2019 09:46 849?920 adtschema.dll 13/01/2019 09:45 146?944 AdvancedEmojiDS.dll There are lots of web services that still don?t expect UTF-8 (I know, bad on them), and many legacy applications that don?t have proper UTF-8 or Unicode support (I know, they should be updated). It doesn?t seem to me that changing French thousands separator to NNBSP solves all of the perceived problems. -Shawn ???? ????? http://blogs.msdn.com/shawnste -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 16:05:21 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 18 Jan 2019 23:05:21 +0100 Subject: NNBSP In-Reply-To: <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com> References: <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com> Message-ID: On 18/01/2019 20:09, Asmus Freytag via Unicode wrote: > > Marcel, > > about your many detailed *technical* questions about the history of character properties, I am afraid I have no specific recollection. > Other List Members are welcome to join in, many of whom are aware of how things happened. My questions are meant to be rather simple. Summing up the premium ones: 1. Why does UTC ignore the need of a non-breakable thin space? 2. Why did UTC not declare PUNCTUATION SPACE non-breakable? A less important information would be how extensively typewriters with proportional advance width were used to write books ready for print. Another question you do answer below: > French is not the only language that uses a space to group figures. In fact, I grew up with thousands separators being spaces, but in much of the existing publications or documents there was certainly a full (ordinary) space being used. Not surprisingly, because in those years documents were typewritten and even many books were simply reproduced from typescript. > > When it comes to figures, there are two different types of spaces. > > One is a space that has the same width a digit and is used in the layout of lists. For example, if you have a leading currency symbol, you may want to have that lined up on the left and leave the digits representing the amounts "ragged". You would fill the intervening spaces with this "lining" space character and everything lines up. > That is exactly how I understood hot-metal typesetting of tables. What surprises me is why computerized layout does work the same way instead of using tabulations and appropriate tab stops (left, right, centered, decimal [with all decimal separators lining up vertically). > > In lists like that, you can get away with not using a narrow thousands separator, because the overall context of the list indicates which digits belong together and form a number. Having a narrow space may still look nicer, but complicates the space fill between the symbol and the digits. > It does not, provided that all numbers have thousands separators, even if filling with spaces. It looks nicer because it?s more legible. > > Now for numbers in running text using an ordinary space has multiple drawbacks. It's definitely less readable and, in digital representation, if you use 0020 you don't communicate that this is part of a single number that's best not broken across lines. > Right. > > The problem Unicode had is that it did not properly understand which of the two types of "numeric" spaces was represented by "figure space". (I remember that we had discussions on that during the early years, but that they were not really resolved and that we moved on to other issues, of which many were demanding attention). > You were discussing whether the thousands separator should have the width of a digit or the width of a period? Consistently with many other choices, the solution would have been to encode them both as non-breakable, the more as both were at hand, leaving the choice to the end-user. Current practice in electronic publishing was to use a non-breakable thin space, Philippe Verdy reports. Did that information come in somehow? ISO 31-0 was published in 1992, perhaps too late for Unicode. It is normally understood that the thousands separator should not have the width of a digit. The allaged reason is security. Though on a typewriter, as you state, there is scarcely any other option. By that time, all computerized text was fixed width, Philippe Verdy reports. On-screen, I figure out, not in book print > > If you want to do the right thing you need: > > (1) have a solution that works as intended for ALL language using some form of blank as a thousands separator - solving only the French issue is not enough. We should not do this a language at a time. > That is how CLDR works. But as soon as that was set up, I started lobbying for support of all relevant locales at once: https://unicode.org/cldr/trac/ticket/11423 https://unicode.org/pipermail/cldr-users/2018-September/000842.html https://unicode.org/pipermail/cldr-users/2018-September/000843.html and https://unicode.org/cldr/trac/ticket/11423#comment:2 > Do you have colleagues in Germany and other countries that can confirm whether their practice matches the French usage in all details, or whether there are differences? (Including differently acceptability of fallback renderings...). > No I don?t but people may wish to read German Wikipedia: https://de.wikipedia.org/wiki/Zifferngruppierung#Mit_dem_Tausendertrennzeichen Shared in ticket #11423: https://unicode.org/cldr/trac/ticket/11423#comment:15 > (2) have a solution that works for lining figures as well as separators. > > (3) have a solution that understands ALL uses of spaces that are narrower than normal space. Once a character exists in Unicode, people will use it on the basis of "closest fit" to make it do (approximately) what they want. Your proposal needs to address any issues that would be caused by reinterpreting a character more narrowly that it has been used. Only by comprehensively identifying ALL uses of comparable spaces in various languages and scripts, you can hope to develop a solution that doesn't simply break all non-French text in favor of supporting French typography. > There is no such problem except that NNBSP has never worked properly in Mongolian. It was an encoding error, and that is the reason why to date, all font developers unanimously request the Mongolian Suffix Connector. That leaves the NNBSP for what it is consistently used outside Mongolian: a non-breakable thin space, kind of a belated avatar of what PUNCTUATION SPACE should have been since the beginning. > > Perhaps you see why this issue has languished for so long: getting it right is not a simple matter. > Still it is as simple as not skipping PUNCTUATION SPACE when FIGURE SPACE was made non-breakable. Now we ended up with a mutated Mongolian Space that does not work properly for Mongolian, but does for French and other Latin script using languages. It would even more if TUS was blunter, urging all foundries to update their whole catalogue soon. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 16:25:06 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 18 Jan 2019 23:25:06 +0100 Subject: NNBSP In-Reply-To: References: <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr> Message-ID: On 18/01/2019 22:03, Shawn Steele via Unicode wrote: > > I've been lurking on this thread a little. > > This discussion has gone ?all over the place?, however I?d like to point out that part of the reason NBSP has been used for thousands separators is because that it exists in all of those legacy codepages that were mentioned predating Unicode. > > Whether or not NNBSP provides a better typographical experience, there are a lot of legacy applications, and even web services, that depend on legacy codepages.? NNBSP may be best for layout, but I doubt that making it work perfectly for thousand separators is going to be some sort of magic bullet that solves any problems that NBSP provides. > > If folks started always using NNBSP, there are a lot of legacy applications that are going to start giving you ? in the middle of your numbers.? > > Here?s a partial ?dir > out.txt? after changing my number thousands separator to NNBSP in French on Windows (for example). > > 13/01/2019? 09:48??????????? 15?360 AcXtrnal.dll > > 13/01/2019? 09:46? ??????????54?784 AdaptiveCards.dll > > 13/01/2019? 09:46??????????? 67?584 AddressParser.dll > > 13/01/2019? 09:47??????????? 24?064 adhapi.dll > > 13/01/2019? 09:47??????????? 97?792 adhsvc.dll > > 10/04/2013? 08:32?????????? 154?624 AdjustCalendarDate.exe > > 10/04/2013? 08:32???????? 1?190?912 AdjustCalendarDate.pdb > > 13/01/2019? 10:47?????????? 534?016 AdmTmpl.dll > > 13/01/2019? 09:48??????????? 58?368 adprovider.dll > > 13/01/2019? 10:47?????????? 136?704 adrclient.dll > > 13/01/2019? 09:48?????????? 248?832 adsldp.dll > > 13/01/2019? 09:46?????????? 251?392 adsldpc.dll > > 13/01/2019? 09:48?????????? 101?376 adsmsext.dll > > 13/01/2019? 09:48?????????? 350?208 adsnt.dll > > 13/01/2019? 09:46?????????? 849?920 adtschema.dll > > 13/01/2019? 09:45?????????? 146?944 AdvancedEmojiDS.dll > > There are lots of web services that still don?t expect UTF-8 (I know, bad on them), and many legacy applications that don?t have proper UTF-8 or Unicode support (I know, they should be updated).? It doesn?t seem to me that changing French thousands separator to NNBSP solves all of the perceived problems. > Keeping these applications outdated has no other benefit than providing a handy lobbying tool against support of NNBSP. What are all these expected to do while localized with scripts outside Windows code pages? Also when you need those apps, just tailor your French accordingly. That should not impact all other users out there interested in a civilized layout, that we cannot get with NBSP, as this is justifying and numbers are torn apart in justified layout, nor with FIGURE SPACE as recommended in UAX#14 because it?s too wide and has no other benefit. BTW figure space is the same question mark in Windows terminal I guess, based on the above. As long as SegoeUI has NNBSP support, no worries, that?s what CLDR data is for. Any legacy program can always use downgraded data, you can even replace NBSP if the expected output is plain ASCII. Downgrading is straightforward, the reverse is not true, that is why vetters are working so hard during CLDR surveys. CLDR data is kind of high-end, that is the only useful goal. Again downgrading is easy, just run a tool on the data and the job is done. You?ll end up with two libraries instead of one, but at least you?re able to provide a good UX in environments supporting any UTF. Best, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 16:46:54 2019 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Fri, 18 Jan 2019 22:46:54 +0000 Subject: NNBSP In-Reply-To: References: <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr> Message-ID: >> Keeping these applications outdated has no other benefit than providing a handy lobbying tool against support of NNBSP. I believe you?ll find that there are some French banks and other institutions that depend on such obsolete applications (unfortunately). Additionally, I believe you?ll find that there are many scenarios where older applications and newer applications need to exchange data. Either across the network, the web, or even on the same machine. One app expecting NNBSP and another expecting NBSP on the same machine will likely lead to confusion. This could be something a ?new? app running with the latest & greatest locale data and trying to import the legacy data users had saved on that app. Or exchanging data with an application using the system settings which are perhaps older. >> Also when you need those apps, just tailor your French accordingly. Having the user attempt to ?correct? their settings may not be sufficient to resolve these discrepancies because not all applications or frameworks properly consider the user overrides on all platforms. >> That should not impact all other users out there interested in a civilized layout. I?m not sure that the choice of the word ?civilized? adds value to the conversation. We have pretty much zero feedback that the OS?s French formatting is ?uncivilized? or that the NNBSP is required for correct support. >> As long as SegoeUI has NNBSP support, no worries, that?s what CLDR data is for. For compatibility, I?d actually much prefer that CLDR have an alt ?best practice? field that maintained the existing U+00A0 behavior for compatibility, yet allowed applications wanting the newer typographic experience to opt-in to the ?best practice? alternative data. As applications became used to the idea of an alternative for U+00A0, then maybe that could be flip-flopped and put U+00A0 into a ?legacy? alt form in a few years. Normally I?m all for having the ?best? data in CLDR, and there are many locales that have data with limited support for whatever reasons. U+00A0 is pretty exceptional in my view though, developers have been hard-coding dependencies on that value for ? a century without even realizing there might be other types of non-breaking spaces. Sure, that?s not really the best practice, particularly in modern computing, but I suspect you?ll still find it taught in CS classes with little regard to things like NNBSP. -Shawn -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 18:05:31 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 19 Jan 2019 01:05:31 +0100 Subject: NNBSP In-Reply-To: References: <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr> Message-ID: <9babf7ec-af68-f0f5-c04d-1cfaafdad8ac@orange.fr> On 18/01/2019 23:46, Shawn Steele wrote: > > *>> *Keeping these applications outdated has no other benefit than providing a handy lobbying tool against support of NNBSP. > > I believe you?ll find that there are some French banks and other institutions that depend on such obsolete applications (unfortunately). > If they are obsolete apps, they don?t use CLDR / ICU, as these are designed for up-to-date and fully localized apps. So one hassle is off the table. > > Additionally, I believe you?ll find that there are many scenarios where older applications and newer applications need to exchange data. ?Either across the network, the web, or even on the same machine.? One app expecting NNBSP and another expecting NBSP on the same machine will likely lead to confusion. > I didn?t look into these date interchanges but I suspect they won?t use any thousands separator at all to interchange data. The group separator is only for display and print, and there you may wish to use a compat library for obsolete apps, and a newest library for apps with Unicode support. If an app is so obsolete it will keep working without new data from ICU. > > This could be something a ?new? app running with the latest & greatest locale data and trying to import the legacy data users had saved on that app.? Or exchanging data with an application using the system settings which are perhaps older. > Again I don?t believe that apps are storing numbers with thousands separators in them. Not even spreadsheet software does do that. I say not even because these are high-end apps with latest locale data expected. Sorry you did skip this one: >> What are all these expected to do while localized with scripts outside Windows code pages? Indeed that is the paradox, that Tirhuta users are entitled to use correct display with newest data, while Latin users are bothered indefinitely with old data and legacy display. > > >> Also when you need those apps, just tailor your French accordingly. > > Having the user attempt to ?correct? their settings may not be sufficient to resolve these discrepancies because not all applications or frameworks properly consider the user overrides on all platforms. > Not the user. I?m addressing your concerns as coming from the developer side. I meant you should use the data as appropriate, and if a character is beyond support, just replace it for convenience. > > >> That should not impact all other users out there interested in a civilized layout. > > I?m not sure that the choice of the word ?civilized? adds value to the conversation. > That is to express in a mouthful of English what user feedback is or can be, even if not all the time. Users are complaining about quotation marks spaced off too far when typeset with NBSP like Word does. It?s really ugly they say. NBSP is a character with precise usage, it?s not a one-size-fits-all. BTW as you are in the job, why does Word not provide an option with a checkbox letting the user set the space as desired? NBSP or NNBSP. > > ? We have pretty much zero feedback that the OS?s French formatting is ?uncivilized? or that the NNBSP is required for correct support. > That is, at some point users stop submitting feedback when they see of how little use it is spending time to post it. From the pretty much zero you may wish to pick the one or two you get, guessing that for one you get there are one thousand other users out there having the same feedback but not submitting it. One thousand or one million, it?s hard to be precise? > > >> As long as SegoeUI has NNBSP support, no worries, that?s what CLDR data is for. > > For compatibility, I?d actually much prefer that CLDR have an alt ?best practice? field that maintained the existing U+00A0 behavior for compatibility, yet allowed applications wanting the newer typographic experience to opt-in to the ?best practice? alternative data.? As applications became used to the idea of an alternative for U+00A0, then maybe that could be flip-flopped and put U+00A0 into a ?legacy? alt form in a few years. > You dont need that field in CLDR. Here?s how it works: Take the locale data, search-and-replace all NNBSP with NBSP, and here?s the library you?ll use. Because NNBSP is not only in the group separator. I?d suggest to download common/main/fr.xml and check all instances of NNBSP. The legacy apps you?re referring to don?t use that data for sure. That data is for fine high-end apps and for user interfaces of Windows and any other OS. If you want your employer be well-served, you?d rather prefer the correct data, not legacy fallbacks. > > Normally I?m all for having the ?best? data in CLDR, and there are many locales that have data with limited support for whatever reasons.? U+00A0 is pretty exceptional in my view though, developers have been hard-coding dependencies on that value for ? a century without even realizing there might be other types of non-breaking spaces.? Sure, that?s not really the best practice, particularly in modern computing, but I suspect you?ll still find it taught in CS classes with little regard to things like NNBSP. > There have been threads about Unicode in CS curricula. I don?t believe that teachers would be doing any good to their students by training them to ignore Unicode. These people would be unresponsive through not preparing their students for real life. But I won?t base any utterings on mere suspicions. BTW Latin-1 did not exist 50 years ago. As a rough guess it has come up in the early eighties, and NBSP with it, but I may be wrong. The point in sticking with old charsets is, again, to deny Unicode support to one third of mankind. I don?t think that this is doing any good. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 18:21:12 2019 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Sat, 19 Jan 2019 00:21:12 +0000 Subject: NNBSP In-Reply-To: <9babf7ec-af68-f0f5-c04d-1cfaafdad8ac@orange.fr> References: <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr> <9babf7ec-af68-f0f5-c04d-1cfaafdad8ac@orange.fr> Message-ID: >> If they are obsolete apps, they don?t use CLDR / ICU, as these are designed for up-to-date and fully localized apps. So one hassle is off the table. Windows uses CLDR/ICU. Obsolete apps run on Windows. That statement is a little narrowminded. >> I didn?t look into these date interchanges but I suspect they won?t use any thousands separator at all to interchange data. Nope >> The group separator is only for display and print Yup, and people do the wrong thing so often that I even blogged about it. https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/ >> Sorry you did skip this one: Oops, I did mean to respond to that one and accidentally skipped it. >> What are all these expected to do while localized with scripts outside Windows code pages? (We call those ?unicode-only? locales FWIW) The users that are not supported by legacy apps can?t use those apps (obviously). And folks are strongly encouraged to write apps (and protocols) that Use Unicode (I?ve blogged about that too). However, the fact that an app may run very poorly in Cherokee or whatever doesn?t mean that there aren?t a bunch of French enterprises that depend on that app for their day-to-day business. In order for the ?unicode-only? locale users to use those apps, the app would need to be updated, or another app with the appropriate functionality would need to be selected. However, that still doesn?t impact the current French users that are ?ok? with their current non-Unicode app. Yes, I would encourage them to move to Unicode, however they tend to not want to invest in migration when they don?t see an urgent need. Since Windows depends on CLDR and ICU data, updates to that data means that those customers can experience pain when trying to upgrade to newer versions of Windows. We get those support calls, they don?t tend to pester CLDR. Which is why I suggested an ?opt-in? alt form that apps wanting ?civilized? behavior could opt-into (at least for long enough that enough badly behaved apps would be updated to warrant moving that to the default.) The data for locales like French tends to have been very stable for decades. Changes to data for major locales like that are more disruptive than to newer emerging markets where the data is undergoing more churn. -Shawn -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 18:53:16 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 19 Jan 2019 00:53:16 +0000 Subject: Loose character-name matching In-Reply-To: <60797095-B703-4770-8F85-F045DDED4431@icloud.com> References: <60797095-B703-4770-8F85-F045DDED4431@icloud.com> Message-ID: <20190119005316.7fbb0469@JRWUBU2> On Thu, 17 Jan 2019 18:44:50 -0500 "J.?S. Choi" via Unicode wrote: > I?m implementing a Unicode names library. I?m confused about loose > character-name matching, even after rereading The Unicode Standard ? > 4.8, UAX #34 ? 4, #44 ? 5.9.2 ? as well as > [L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt > ), > [L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035 > ), and > the [meeting in which those two items were > resolved](https://www.unicode.org/L2/L2014/14026.htm > ). > > In particular, I?m confused by the claim in The Unicode Standard ? > 4.8 saying, ?Because Unicode character names do not contain any > underscore (?_?) characters, a common strategy is to replace any > hyphen-minus or space in a character name by a single ?_? when > constructing a formal identifier from a character name. This strategy > automatically results in a syntactically correct identifier in most > formal languages. Furthermore, such identifiers are guaranteed to be > unique, because of the special rules for character name matching.? Unfortunately, the loose matching rules don't distinguish '__' and '_'. Note that '__' is sometimes forbidden in identifiers. > I?m also confused by the relationship between UAX34-R3 and UAX44-LM2. > > To make these issues concrete, let?s say that my library provides a > function called getCharacter that takes a name argument, tries to > find a loosely matching character, and then returns it (or a null > value if there is no currently loosely matching character). So then > what should the following expressions return? > Loose matching of names may be looser than prescribed; it shall not be stricter. > getCharacter(?HANGUL-JUNGSEONG-O-E?) U+1180 HANGUL JUNGSEONG O-E, or just possibly null. > getCharacter(?HANGUL_JUNGSEONG_O_E?) U+116C HANGUL JUNGSEONG OE* > getCharacter(?HANGUL_JUNGSEONG_O_E_?) U+116C > getCharacter(?HANGUL_JUNGSEONG_O__E?) U+116C > getCharacter(?HANGUL_JUNGSEONG_O_-E?) U+1180 > getCharacter(?HANGUL JUNGSEONGCHARACTERO E?) null or U+116C - up to you. The sequence 'CHARACTER' shall not distinguish names, but loose matching is not required to know this fact. > getCharacter(?HANGUL JUNGSEONG CHARACTER OE?) null or U+116C - up to you. > getCharacter(?TIBETAN_LETTER_A?) U+0F68 TIBETAN LETTER A > getCharacter(?TIBETAN_LETTER__A?) U+0F68 TIBETAN LETTER A** > getCharacter(?TIBETAN_LETTER _A?) U+0F68 > getCharacter(?TIBETAN_LETTER_-A?) U+0F60 TIBETAN LETTER -A *This is unfortunate, as the usual symbolic name for U+1180 would be HANGUL_JUNGSEONG_O_E. **This is also unfortunate, as the usual symbolic name for U+0F60 would be TIBETAN_LETTER__A. The key problem here is that the hyphen after a space is required in names as understood by the name property. The hyphen is also required in "HANGUL JUNGSEONG O-E". The simple tactic is: 1) Canonicalise, by stripping out spaces, underscores and medial hyphens and lowercasing. (It's probably better to fold the character U+0131 LATIN SMALL LETTER I' to 'i'.) 2) Look the result up. 3) If you get the result U+116C but the input matches ".*[oO]-[eE][_- ]*$", convert to U+1180. Symbolic identifiers in programs need not match the name; one may choose to depend on the compiler or interpreter to catch duplicates; some will, some won't. Replacing '-' by '_' to convert a name to an identifier looses the distinction between a hyphen and an arbitrarily inserted space, Richard. From unicode at unicode.org Fri Jan 18 18:55:07 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 18 Jan 2019 16:55:07 -0800 Subject: NNBSP In-Reply-To: References: <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com> Message-ID: <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 18:58:05 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 18 Jan 2019 16:58:05 -0800 Subject: NNBSP In-Reply-To: References: <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 18 19:49:09 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 19 Jan 2019 01:49:09 +0000 Subject: NNBSP In-Reply-To: <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> References: <20190112132221.7497fdea@JRWUBU2> <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> Message-ID: <20190119014909.4b093988@JRWUBU2> On Fri, 18 Jan 2019 10:20:22 -0800 Asmus Freytag via Unicode wrote: > However, if there's a consensus interpretation of a given character > the you can't just go in and change it, even if it would make that > character work "better" for a given circumstance: you simply don't > know (unless you research widely) how people have used that character > in documents that work for them. Breaking those documents > retroactively, is not acceptable. Unless the UCD contains a contrary definition only usable where the character wouldn't normally be used, in which case it is fine to try to kick the character's users in the teeth. I am referring to the belief that ZWSP separated words, whereas the UCD only defined it as a lay-out control. That outlawed belief has recently been very helpful to me in using (as opposed to testing) a nod-Lana spell-checker on Firefox. Richard. From unicode at unicode.org Sat Jan 19 01:34:13 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 19 Jan 2019 08:34:13 +0100 Subject: NNBSP In-Reply-To: <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com> References: <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com> <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com> <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com> Message-ID: On 19/01/2019 01:55, Asmus Freytag via Unicode wrote: > On 1/18/2019 2:05 PM, Marcel Schneider via Unicode wrote: >> On 18/01/2019 20:09, Asmus Freytag via Unicode wrote: >>> >>> Marcel, >>> >>> about your many detailed *technical* questions about the history of character properties, I am afraid I have no specific recollection. >>> >> Other List Members are welcome to join in, many of whom are aware of how things happened. My questions are meant to be rather simple. Summing up the premium ones: >> >> 1. Why does UTC ignore the need of a non-breakable thin space? >> 2. Why did UTC not declare PUNCTUATION SPACE non-breakable? >> >> A less important information would be how extensively typewriters with proportional advance width were used to write books ready for print. >> >> Another question you do answer below: >> >>> French is not the only language that uses a space to group figures. In fact, I grew up with thousands separators being spaces, but in much of the existing publications or documents there was certainly a full (ordinary) space being used. Not surprisingly, because in those years documents were typewritten and even many books were simply reproduced from typescript. >>> >>> When it comes to figures, there are two different types of spaces. >>> >>> One is a space that has the same width a digit and is used in the layout of lists. For example, if you have a leading currency symbol, you may want to have that lined up on the left and leave the digits representing the amounts "ragged". You would fill the intervening spaces with this "lining" space character and everything lines up. >>> >> That is exactly how I understood hot-metal typesetting of tables. What surprises me is why computerized layout does work the same way instead of using tabulations and appropriate tab stops (left, right, centered, decimal [with all decimal separators lining up vertically). > > ==> At the time Unicode was first created (and definitely before that, during the time of non-universal character sets) many applications existed that used a "typewriter model" and worked by space fill rather than decimal-point tabulation. > If you are talking about applications, as opposed to typesetting tables for book printing, then I?d suggest that the fixed-width display of tables could be done much like still today?s source code layout, where normal space is used for that purpose. In this use case, line wrap is typically turned off. That could make non-breakable spaces sort of pointless (but I?m aware of your point below), except if people are expected to re-use the data in other environments. In that case, best practice is to use NNBSP as thousands separator while displaying it like other monospace characters. That?s at least how today?s monospace fonts work (provided they?re used in environments actually supporting Unicode, which may not happen with applications running in terminal). > > From today's perspective that older model is inflexible and not the best approach, but it is impossible to say how long this legacy approach hung on in some places and how much data might exist that relied on certain long-standing behaviors of these space characters. > My position since some time is that legacy apps should use legacy libraries. But I?ll come back on this when responding to Shawn Steele. > > For a good solution, you always need to understand > > (1) the requirement of your "index" case (French, in this case) > That?s okay. > > (2) how it relates to similar requirements in (all!) other languages / scripts > That?s rather up to CLDR as I suggested, given it has the means to submit a point to all vetters. See again below (in the part that you?ve cut off without consideration). > > (3) how it relates to actual legacy practice > That?s Shawn Steele?s point (see next reply). > > (3a) what will suddenly no longer work if you change the properties on some character > > (3b) what older data will no longer work if the effective behavior of newer applications changes > I?ll already note that this needs to be aware of actual use cases and/or to delve into the OSes, that is far beyond what I can currently do, both wrt time and wrt resources. The vetter?s role is to inform CLDR with correct data from their locale. CLDR is then welcome to sort things out and to get in touch with the industry, which CLDR TC is actually doing. But that has no impact on the data submitted at survey time. Changing votes to tell ?OK let the group separator be NBSP as long as?? would be a lie. > >>> In lists like that, you can get away with not using a narrow thousands separator, because the overall context of the list indicates which digits belong together and form a number. Having a narrow space may still look nicer, but complicates the space fill between the symbol and the digits. >>> >> It does not, provided that all numbers have thousands separators, even if filling with spaces. It looks nicer because it?s more legible. >>> >>> Now for numbers in running text using an ordinary space has multiple drawbacks. It's definitely less readable and, in digital representation, if you use 0020 you don't communicate that this is part of a single number that's best not broken across lines. >>> >> Right. >>> >>> The problem Unicode had is that it did not properly understand which of the two types of "numeric" spaces was represented by "figure space". (I remember that we had discussions on that during the early years, but that they were not really resolved and that we moved on to other issues, of which many were demanding attention). >>> >> You were discussing whether the thousands separator should have the width of a digit or the width of a period? Consistently with many other choices, the solution would have been to encode them both as non-breakable, the more as both were at hand, leaving the choice to the end-user. > > ==> Right, but remember, we started off encoding a set of spaces that existed before Unicode (in some other character sets) and implicitly made the assumption that those were the correct set (just like we took punctuation from ASCII and similar sources and only added to it later, when we understood that they were missing things --- generally always added, generally did not redefine behavior or shape of existing code points). > Now I understand that what UAX #14 calls ?the preferred space for use in numbers? is actually preferred in the table layout you are referring to, because it is easier to code when only the empty decimal separator position uses PUNCTUATION SPACE, while grouping is performed with FIGURE SPACE. That raises two questions, one of which has been often asked in this thread: 1. How is FIGURE SPACE supposed to be supported in legacy environments? (UAX #14 mentions both its line breaking behavior and its width, but makes no concessions for legacy apps?) 2. Why did PUNCTUATION SPACE not be declared non-breakable? (If it had, it could have been re-purposed to space off French punctuation since the beginning of Unicode, and never French users had have a reason to be upset by lack of a narrow non-breaking space.) >> >> Current practice in electronic publishing was to use a non-breakable thin space, Philippe Verdy reports. Did that information come in somehow? > > ==> probably not in the early days. Y > Perhaps it was ignored from the beginning on, like Philippe Verdy reports that UTC ignored later demands, getting users upset. That leaves us with the question why it did so, downstream your statement that it was not what I ended up suspecting. Does "Y" stand for the peace symbol? > >> >> ISO 31-0 was published in 1992, perhaps too late for Unicode. It is normally understood that the thousands separator should not have the width of a digit. The allaged reason is security. Though on a typewriter, as you state, there is scarcely any other option. By that time, all computerized text was fixed width, Philippe Verdy reports. On-screen, I figure out, not in book print > > ==> much book printing was also done by photomechanically reproducing typescript at that time. Not everybody wanted to pay typesetters and digital typesetting wasn't as advanced. I actually did use a digital phototypesetter of the period a few years before I joined Unicode, so I know. It was more powerful than a typewriter, but not as powerful as TeX or later the Adobe products. > > For one, you didn't typeset a page, only a column of text, and it required manual paste-up etc. > Did you also see typewriters with proportional advance width (and interchangeable type wheels)? That was the high end on the typewriter market. (Already mentioned these typewriters in a previous e?mail.) Books typeset this way could use bold and (less easy) italic spans. > >>> If you want to do the right thing you need: >>> >>> (1) have a solution that works as intended for ALL language using some form of blank as a thousands separator - solving only the French issue is not enough. We should not do this a language at a time. >>> >> That is how CLDR works. > > CLDR data is by definition per-language. Except for inheritance, languages are independent. > > There are no "French" characters. When you encode characters, at best, some code points may be script-specific. For punctuation and spaces not even that may be the case. Therefore, as long as you try to solve this as if it *only* was a French problem, you are not doing proper character encoding. > Again, I did not do that (and BTW CLDR is not doing ?character encoding?). Actually, to be able to post that blame you needed to cut off all the URLs I provided you with. These links are documenting that i did not ?try to solve this as if it only was a French problem[.]? Here they are again, this time with copy-pasted snippets below. I wrote: ?But as soon as that was set up, I started lobbying for support of all relevant locales at once:? https://unicode.org/cldr/trac/ticket/11423 https://unicode.org/pipermail/cldr-users/2018-September/000842.html * ?To be cost-effective, locales using space as numbers group separator should migrate at once from the wrong U+00A0 to the correct U+202F. I didn?t aim at making French stand out, but at correcting an error in CLDR. Having even the Canadian French sublocale stick with the wrong value makes no sense and is mainly due to opaque inheritance relationships and to severe constraints on vetters applying for fr-FR and subsequently reduced to look on helpless from the sidelines when sublocales are not getting fixed.? * ?After having painstakingly catched up support of some narrow fixed-width no-break space (U+202F). the industry is now ready to migrate from U+00A0 to U+202F. Doing it in a single rush is way more cost-effective than migrating one locale this time, another locale next time, a handful locales the time after, possibly splitting them up in sublocales with different migration schedules. I really believed that now Unicode proves ready to adopt the real group separator in French, all relevant locales would be consistently pushed for correcting that value in release 34. The v34 alpha overview makes clear they are not. ? http://cldr.unicode.org/index/downloads/cldr-34#TOC-Migration I aimed at correcting an error in CLDR, not at making French stand out. Having many locales and sublocales stick with the wrong value makes no sense any more. ?https://www.unicode.org/cldr/charts/34/by_type/numbers.symbols.html#a1ef41eaeb6982d The only effect is implementers skipping migration for fr-FR while waiting for the others to catch up, then doing it for all at once. There seems to be a misunderstanding: The*locale setting *is whether to use period, comma, space, apostrophe, U+066C ARABIC THOUSANDS SEPARATOR, or another graphic. Whether "space" is NO-BREAK SPACE or NARROW NO-BREAK SPACE is *not a locale setting,* but it?s all about Unicode *design* and Unicode *implementation.* I really thought that that was clear and that there?s no need to heavily insist on the ST "French" forum. When referring to the "French thousands separator" I only meant that unlike comma- or period-using locales, the French locale uses space and that the group separator space should be the correct one. That did *not* mean that French should use *another* space than the other locales using space.? https://unicode.org/pipermail/cldr-users/2018-September/000843.html and https://unicode.org/cldr/trac/ticket/11423#comment:2 * ?I've to confess that I did focus on French and only applied for fr-FR, but there was a lot of work, see ? http://cldr.unicode.org/index/downloads/cldr-34#TOC-Growth waiting for very few vetters. Nevertheless I also cared for English (see various tickets), and also posted on CLDR-users in a belated P.S. that fr-CA hadn?t caught up the group separator correction yet: ?https://unicode.org/pipermail/cldr-users/2018-August/000825.html Also I?m sorry for failing to provide appropriate feedback after beta release and to post upstream messages urging to make sure all locales using space for group separator be kept in synchrony. I think the point about not splitting up all the data into locales is a very good one. There should be a common pool so that all locales using Arabic script have automatically group separator set to ARABIC THOUSANDS SEPARATOR (provided it actually fits all), and those locales using space should only need to specify "space" to automatically get the correct one, ie NARROW NO-BREAK SPACE as soon as Unicode is ready to give it currency in that role.? Do these recommendations meet your requirements and sound okay to you? >>> >>> Do you have colleagues in Germany and other countries that can confirm whether their practice matches the French usage in all details, or whether there are differences? (Including differently acceptability of fallback renderings...). >>> >> No I don?t but people may wish to read German Wikipedia: >> >> https://de.wikipedia.org/wiki/Zifferngruppierung#Mit_dem_Tausendertrennzeichen >> >> Shared in ticket #11423: >> https://unicode.org/cldr/trac/ticket/11423#comment:15 > > > ==> for your proposal to be effective, you need to reach out. > Basically we vetters are just reporting the locale date. Beyond that, I?ve already conceded a huge effort in reporting bugs in English data and in communicating on lists and fora, including German (since the current survey that has a very limited scope). I have limited time and resources. Normally reaching out to all relevant locales is what CLDR can do best, by posting guidelines. by e-mailing (on behalf of CLDR administrator and/or on the public CLDR-users Mail List), and by prioritizing the items on the vetters? dashboards. If I can do something else, I?m ready but people should not abuse since I?ve many other tasks I won?t be going to deprioritize any longer. At some point I?ll just start reporting to end-users that we?ve strived to get locale data in synch, but that CLDR ended up rolling back our efforts, alleging other priorities. If that is what you wish, I?d say that there?s no problem for me except that I strongly dislike documenting an ugly mess. > >> >>> (2) have a solution that works for lining figures as well as separators. >>> >>> (3) have a solution that understands ALL uses of spaces that are narrower than normal space. Once a character exists in Unicode, people will use it on the basis of "closest fit" to make it do (approximately) what they want. Your proposal needs to address any issues that would be caused by reinterpreting a character more narrowly that it has been used. Only by comprehensively identifying ALL uses of comparable spaces in various languages and scripts, you can hope to develop a solution that doesn't simply break all non-French text in favor of supporting French typography. >>> >> There is no such problem except that NNBSP has never worked properly in Mongolian. It was an encoding error, and that is the reason why to date, all font developers unanimously request the Mongolian Suffix Connector. That leaves the NNBSP for what it is consistently used outside Mongolian: a non-breakable thin space, kind of a belated avatar >> of what PUNCTUATION SPACE should have been since the beginning. > > ==> I mentioned before that if something is universally "broken" it can sometimes be resurrected, because even if you change its behavior retroactively, it will not change something that ever worked correctly. (But you need to be sure that nobody repurposed the NNBSP for something useful that is different from what you intend to use it for, otherwise you can't change anything about it). > You may wish to look up Unicode?s own PRI#308 background page, where they already hinted they?ve made sure it isn?t. http://www.unicode.org/review/pri308/pri308-background.html https://www.unicode.org/review/pri308/ https://www.unicode.org/review/pri308/feedback.html > If, however, you are merely adding a use for some existing character that does not affect its properties, that is usually not as much of a problem - as long as we can have some confidence that both usages will continue to be possible. > Actually, again, there is a problem with NNBSP in Mongolian. Richard Wordingham reported at thread launch that Unicode have started tweaking that space in a way that makes it unfit for French. Now since you are aware that this operating mode is wrong, I?d suggest that you reach back to them providing feedback about inappropriateness of last changes. Other people (including me) may do that as well, but I see better chances for your recommendations to get implemented. I say that because lastly I strongly recommended in several pieces of feedback that the math symbols should not be bidi-mirrored on a tilde?reversed-tilde basis, because mirroring these compromises legibility of the tilde symbol in low-end environments relying on glyph-exchange-bidi-mirroring for best-fit display, but UTC took no action, and off-list I was taught that UTC is not interested. Nothing else than that, in private mail. UTC are just not interested, without providing any technical reasons. Perhaps you better understand now why I posted what I suspected to be the reason why UTC is not interested, or was not interested, in supporting a narrow non-breaking space unless Mongolian was encoded and needed the same for the purpose of appending suffixes (as opposed to separating vowels, which is performed by a similar space with another shaping behavior, and proper to Mongolian). A hypothesis that you firmly dissipated in the wake, but without answering my question about */why UTC was ignoring the demand for a narrow non-breaking space, delaying support for French and heavily impacting French implementations still today/* due to less font support than if that space were in Unicode from version?1.1 on. > >>> Perhaps you see why this issue has languished for so long: getting it right is not a simple matter. >>> >> Still it is as simple as not skipping PUNCTUATION SPACE when FIGURE SPACE was made non-breakable. Now we ended up with a mutated Mongolian Space that does not work properly for Mongolian, but does for French and other Latin script using languages. It would even more if TUS was blunter, urging all foundries to update their whole catalogue soon. > > ==> You realize that I'm giving you general advice here, not something utterly specific to NNBSP - I don't have the inputs and background to know whether your approach is feasible or perhaps the best possible? > It is not ?my approach?. Other List Members may wish to help you answer my questions. > > As for PUNCTUATION SPACE - some of the spaces have acquired usage in math (as part of the added math support in Unicode 3.2). We need to be sure that the assumptions about these that may have been made in math typesetting? are not invalidated. > That adds to the reasons why I?m asking why PUNCTUATION SPACE was not made non-breakable when FIGURE SPACE was. The math usage has probably originated in repurposing that space on the basis of it?s line breaking behavior. I don?t suggest to make it non-breakable now. That deal was broken and will remain broken. Now we must live with NNBSP and get more font support, while trying to stop Unicode from making a mess of it that neither helps Mongolian nor French nor all (other) locales grouping digits with a narrow space. > > Not sure offhand whether UTR#25 captures all of that, but if you ever feel like proposing a property change you MUST research that first (with the current maintainers of that UTR or other experts). > I have NOT proposed any property change, and PUNCTUATION SPACE or "2008" are NOT found in UTR #25 (Unicode Support for Mathematics). > > This is the way Unicode is different from CLDR. > Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 19 02:42:55 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 19 Jan 2019 00:42:55 -0800 Subject: NNBSP In-Reply-To: References: <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com> <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com> Message-ID: <0f2461c9-376a-1161-7c57-9b08ef5d2478@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 19 02:58:27 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 19 Jan 2019 09:58:27 +0100 Subject: NNBSP In-Reply-To: References: <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com> <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr> <9babf7ec-af68-f0f5-c04d-1cfaafdad8ac@orange.fr> Message-ID: <25b6b9c4-d994-599c-e798-a3798e04b1f1@orange.fr> On 19/01/2019 01:21, Shawn Steele wrote: > > *>> *If they are obsolete apps, they don?t use CLDR / ICU, as these are designed for up-to-date and fully localized apps. So one hassle is off the table. > > Windows uses CLDR/ICU.? Obsolete apps run on Windows.? That statement is a little narrowminded. > > >> I didn?t look into these date interchanges but I suspect they won?t use any thousands separator at all to interchange data. > > Nope > > >> The group separator is only for display and print > > Yup, and people do the wrong thing so often that I even blogged about it. https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/ > Thanks for sharing. As it happens, I like most the first reason you provide: * ?The most obvious reason is that there is a bug in the data and we had to make a change. (Believe it or not we make mistakes ;-))? In this case our users (and yours too) want culturally correct data, so we have to fix the bug even if it breaks existing applications.? No comment :) > >> Sorry you did skip this one: > > Oops, I did mean to respond to that one and accidentally skipped it. > No problem. > > >> What are all these expected to do while localized with scripts outside Windows code pages? > > (We call those ?unicode-only? locales FWIW) > Noted. > > The users that are not supported by legacy apps can?t use those apps (obviously).? And folks are strongly encouraged to write apps (and protocols) that Use Unicode (I?ve blogged about that too). > Like here: https://blogs.msdn.microsoft.com/shawnste/2009/06/01/writing-fields-of-data-to-an-encoded-file/ You?re showcasing that despite ?The moral here is ?Use Unicode??? some people are still not using it. The stuff gets even weirder as you state that code pages and Unicode are not 1:1, contradicting the Unicode design principle of roundtrip compatibility. The point in not using Unicode, and likewise in not using verbose formats, is limited hardware resources. Often new implementations are built on top of old machines and programs, for example in the energy and shipping industies. This poses a security threat, ending up in power outages and logistic breakdowns. That is making our democracies vulnerable. Hence maintaining obsolete systems does not pay back. We?re all better off when recycling all the old hardware and investing in latest technologies, implementing Unicode by the way. What you are advocating in this thread seems like a non-starter. > However, the fact that an app may run very poorly in Cherokee or whatever doesn?t mean that there aren?t a bunch of French enterprises that depend on that app for their day-to-day business. > They?re ill-advised in doing so (see above). > > In order for the ?unicode-only? locale users to use those apps, the app would need to be updated, or another app with the appropriate functionality would need to be selected. > To be ?selected?, not developed and built. The job is already done. What are people waiting for? > > However, that still doesn?t impact the current French users that are ?ok? with their current non-Unicode app.? Yes, I would encourage them to move to Unicode, however they tend to not want to invest in migration when they don?t see an urgent need. > They may not see it because they?re lacking appropriate training in cyber security. You seem to be backing that unresponsive behavior. I can?t see that you may be doing any good by doing so, and I?d strongly advise you to reach out to your customers, or check the issue with your managers. We?re in a time where companies are still making huge benefits, and it is unclear where all that money goes once paid out to shareholders. The money is there, you only need to market the security. That job would better use your time than tampering with legacy apps. > > Since Windows depends on CLDR and ICU data, updates to that data means that those customers can experience pain when trying to upgrade to newer versions of Windows.? We get those support calls, they don?t tend to pester CLDR. > Am I pestering CLDR? Keeping CLDR in synch is just the right way to go. Since we?re on it: Do you have any hints about why some powerful UTC members seem to hate NNBSP in French? I?m mainly talking about French punctuation spacing here. > > Which is why I suggested an ?opt-in? alt form that apps wanting ?civilized? behavior could opt-into (at least for long enough that enough badly behaved apps would be updated to warrant moving that to the default.) > Asmus Freytag?s proposal seems better: ?having information on "common fallbacks" would be useful. If formatting numbers, I may be free to pick the "best", but when parsing for numbers I may want to know what deviations from "best" practice I can expect.? Because if you let your customers ?opt in? instead of urging them to update, some will never opt in, given they?re not even ready to care about cyber security. > > The data for locales like French tends to have been very stable for decades.? Changes to data for major locales like that are more disruptive than to newer emerging markets where the data is undergoing more churn. > Happy for them. Ironically the old wealthy markets are digging the trap they?ll be caught in, instead of investing in cybersecurity. Best wishes, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 19 03:51:46 2019 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 19 Jan 2019 10:51:46 +0100 Subject: NNBSP In-Reply-To: <0f2461c9-376a-1161-7c57-9b08ef5d2478@ix.netcom.com> References: <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com> <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com> <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com> <0f2461c9-376a-1161-7c57-9b08ef5d2478@ix.netcom.com> Message-ID: <2ff452c5-a7c1-f6d0-74ea-c08252470cad@orange.fr> On 19/01/2019 09:42, Asmus Freytag via Unicode wrote: > [?] > > For one, many worthwhile additions / changes to Unicode depend on getting written up in proposal form and then championed by dedicated people willing to see through the process. Usually, Unicode has so many proposals to pick from that at each point there are more than can be immediately accommodated. There's no automatic response to even issues that are "known" to many people. > > "Demands" don't mean a thing, formal proposals, presented and then refined based on feedback from the committee is what puts issues on the track of being resolved. > That is also what I suspected, that the French were not eager enough to get French supported, as opposed to the Vietnamese who lobbied long before the era of proposals and UTC meetings. Please,/where can we find the proposals for FIGURE SPACE to become non-breakable, and for PUNCTUATION SPACE to stay or become breakable?/ (That is not a rhetoric question. The ideal answer is a URL. Also, that is not about pre-Unicode documentation, but about the action that Unicode took in that era.) > > [?] > > Yes, I definitely used an IBM Selectric for many years with interchangeable type wheels, but I don't remember using proportional spacing, although I've seen it in the kinds of "typescript" books I mentioned. Some had that crude approximation of typesetting. > Thanks for reporting. > > When Unicode came out, that was no longer the state of the art as TeX and laser printers weren't limited that way. > > However, the character sets from which Unicode was assembled (or which it had to match, effectively) were designed earlier - during those times. And we inherited some things (that needed to be supported so round-trip mapping of data was possible) but that weren't as well documented in their particulars. > > I'm sure we'll eventually deprecate some and clean up others, like the Mongolian encoding (which also included some stuff that was encoded with an understanding that turned out less solid in retrospect than we had thought at the time). > > Something the UTC tries very hard to avoid, but nobody is perfect. It's best therefore to try not to ascribe non-technical motives to any action or inaction of the UTC. What outsiders see is rarely what actually went down, > That is because the meeting minutes would gain in being more explicit. > > and the real reasons for things tend to be much less interesting from an interpersonal? or intercultural perspective. > I don?t care about ?interesting? reasons. I?d just appreciate to know the truth. > > So best avoid that kind of topic altogether and never use it as basis for unfounded recriminations. > When you ask for knowing the foundations and that knowledge is persistently refused, you end up believing that those foundations just can?t be told. Note, too, that I readily ceased blaming UTC, and shifted the blame elsewhere, where it actually belongs to. I?d kindly request not to be considered a hypocrite that in reality keeps blaming the UTC. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 19 05:53:01 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 19 Jan 2019 11:53:01 +0000 Subject: NNBSP In-Reply-To: <2ff452c5-a7c1-f6d0-74ea-c08252470cad@orange.fr> References: <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr> <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com> <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com> <0f2461c9-376a-1161-7c57-9b08ef5d2478@ix.netcom.com> <2ff452c5-a7c1-f6d0-74ea-c08252470cad@orange.fr> Message-ID: Marcel Schneider wrote, > When you ask for knowing the foundations and that knowledge is persistently refused, > you end up believing that those foundations just can?t be told. > > Note, too, that I readily ceased blaming UTC, and shifted the blame elsewhere, where it > actually belongs to. Why not think of it as a learning curve?? Early concepts and priorities were made from a lower position on that curve.? We can learn from the past and apply those lessons to the future, but a post-mortem seldom benefits the cadaver. Minutiae about decisions made long ago probably exist, but may be presently poorly indexed/organized and difficult to search/access. As the collection of encoding history becomes more sophisticated and the searching technology becomes more civilized, it may become easier to glean information from the archives. (OT - A little humor, perhaps... On the topic of Francophobia, it is true that some of us do not like dead generalissimos.? But most of us adore the French for reasons beyond Brigitte Bardot and bon-bons.? Cuisine, fries, dip, toast, curls, culture, kissing, and tarts, for instance.? Not to mention cognac and champagne!) From unicode at unicode.org Sat Jan 19 12:19:35 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Sat, 19 Jan 2019 18:19:35 +0000 (GMT) Subject: Encoding italic (was: A last missing link) In-Reply-To: <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> Message-ID: Asmus Freytag wote: > This is an effort that's out of scope for Unicode to implement, or, I > should say, if the Consortium were to take it on, it would be a > separate technical standard from The Unicode Standard. I note what you say, but what concerns me is that there seem to be an increasing number of matters where things are being done and neither The Unicode Standard nor ISO/IEC 10646 include them but they are in side-documents just at the Unicode website. My understanding is that in some countries they will only use ISO/IEC 19646 and not relate (is that the word?) to Unicode. There are already issues over emoji ZWJ sequences that produce new meanings such as man ZWJ rocket producing the new meaning of astronaut and the 'base character plus tag characters' sequences to indicate a Welsh flag and a Scottish flag and if something is now done for italics (depending upon what it is that is done) the divergence between the two 'groups of documents' widens even if at a precise 'definition of scope' meaning ISO/IEC and The Unicode Standard do not diverge. > PS: I really hate the creeping expansion of pseudo-encoding via VS > characters. Well, a variation sequence character is being used for requesting emoji display (is that a control code?), so it seems there is no lack of precedent to use one for italics. It seems that someone only has to say 'out of scope' and then that is the veto for any consideration of a new idea for ISO/IEC 10646 or The Unicode Standard. There seems to be no way for a request to the committee to consider a widening of the scope to even be put before the committee if such a request is from someone outside the inner circle. > The only worse thing is adding novel control functions. For example? Would you be including things like changing the colour of the jacket that an emojiperson is wearing? It seems to me that it would be useful to have some codes that are ordinary characters in some contexts yet are control codes in others, for example for drawing simple line graphic diagrams within a document, such that they are just ordinary characters in a text document but, say, draw an image when included within a PDF (Portable Text Format) document. Their use would be optional so that people who did not want to use them could just ignore them and applications that did not use them as control codes could just display a glyph for each character. Yet there could be great possibilities for them if the chance to get them into ISO/IEC 10646 and The Unicode Standard were possible. William Overington Saturday 19 January 2019 William Over From unicode at unicode.org Sat Jan 19 14:34:48 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 19 Jan 2019 20:34:48 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> Message-ID: <19faf5d9-bfea-b732-f940-b937c852d5d8@gmail.com> On 2019-01-19 6:19 PM, wjgo_10009 at btinternet.com wrote: > It seems to me that it would be useful to have some codes that are > ordinary characters in some contexts yet are control codes in others, ... Italics aren't a novel concept.? The approach for encoding new characters is that? conventions for them exist and that people *are* exchanging them, people have exchanged them in the past, or that people demonstrably *need* to exchange them. Excluding emoji, any suggestion or proposal whose premise is "It seems to me that? it would be useful if characters supporting ..." is doomed to be deemed out of scope for the standard. From unicode at unicode.org Sat Jan 19 15:17:34 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 19 Jan 2019 13:17:34 -0800 Subject: Encoding italic In-Reply-To: <19faf5d9-bfea-b732-f940-b937c852d5d8@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <19faf5d9-bfea-b732-f940-b937c852d5d8@gmail.com> Message-ID: <2dfeeebe-86b5-aeed-1b6e-3588ef2d654b@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 19 15:24:26 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 19 Jan 2019 13:24:26 -0800 Subject: NNBSP In-Reply-To: References: <20190116205305.213b335d@JRWUBU2> <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com> <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr> <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com> <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com> <0f2461c9-376a-1161-7c57-9b08ef5d2478@ix.netcom.com> <2ff452c5-a7c1-f6d0-74ea-c08252470cad@orange.fr> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 19 19:18:19 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 20 Jan 2019 01:18:19 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> Message-ID: <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> Victor Gaultney wrote, > If however, we say that this "does not adequately consider the harm done > to the text-processing model that underlies Unicode", then that exposes a > weakness in that model. That may be a weakness that we have to accept for > a variety of reasons (technical difficulty, burden on developers, UI impact, > cost, maturity). Unicode's character encoding principles and underlying text-processing model remain robust.? They are the foundation of modern computer text processing.? The goal of ???? ???????? ??????????? needs to accommodate the best expectations of the end users and the fact that the consistent approach of the model eases the software people's burdens by ensuring that effective programming solutions to support one subset or range of characters can be applied to the other subsets of the Unicode repertoire.? And that those solutions can be shared with other developers in a standard fashion. Assigning properties to characters gives any conformant application clear instructions as to what exactly is expected as the app encounters each character in a string.? In simpler times, the only expectation was that the application would splat a glyph onto a screen (and/or sheet of paper) and store a binary string for later retrieval.? We've moved forward. 'Unicode encodes characters, not glyphs' is a core principle. There's a legitimate concern whenever anyone is perceived as heading into the general direction of turning the character encoding into a glyph registry, as it suggests a possible step backwards and might lead to a slippery slope.? For example, if italics are encoded, why not fraktur and Gaelic?? The notion that any given system can't be improved is static.? ("System" refers to Unicode's repertoire and coverage rather than its core principles.? Core principles are rock solid by nature.) ? /ne plus ultra/ ? "Conversely, significant differences in writing style for the same script may be reflected in the bibliographical classification?for example, Fraktur or Gaelic styles for the Latin script. Such stylistic distinctions are ignored in the Unicode Standard, which treats them as presentation styles of the Latin script."? Ken Whistler, http://unicode.org/reports/tr24/ ? "Static" can be interpreted as either virtually catatonic or radio noise.? Either is applicable here. From unicode at unicode.org Sat Jan 19 19:30:37 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Sun, 20 Jan 2019 02:30:37 +0100 Subject: Encoding italic (was: A last missing link) In-Reply-To: <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> Message-ID: (I have skipped some messages in this thread, so maybe the following has been pointed out already. Apologies for this message if so.) You will not like this... But... There is already a standardised, "character level" (well, it is from a character standard, though a more modern view would be that it is a higher level protocol) way of specifying italics (and bold, and underline, and more): \u001b[3mbla bla bla\u001b[0m Terminal emulators implement some such escape sequences. The terminaI emulators I use support bold (1 after the [) but not italic (3). Every time you use the "man"-command in a Linux/Unix/similar terminal you "use" the escape sequences for bold and underline... Other terminal based programs often use bold as well as colour esc-sequences for emphasis as well as for warning/error messages, and other "hints" of various kinds. For xterm, see: https://www.xfree86.org/4.8.0/ctlseqs.html. So I don't see these esc-sequences becoming obsolete any time soon. But I don't foresee them being supported outside of terminal emulators either... (Though for style esc-sequences it would certainly be possible. And a "smart" cut-and-paste operation could auto-insert an esc-sequence that sets the the style after the paste to the one before the paste...) Had HTML (somehow, magically) been invented before terminals, maybe terminals (terminal emulators) would have used some kind of "mini-HTML" instead. But things are like they are on that point. /Kent Karlsson PS The cut-and-paste I used here convert (imperfectly: bold is lost and spurious ! inserted) to HTML (surely going through some internal attribute-based representation, the HTML being generated when I press send): man(1) man(1) NAME man - format and display the on-line manual pages SYNOPSIS man [-acdfFhkKtwW] [--path] [-m system] [-p string] [-C config_file] [-M pathlist] [-P pager] [-B browser] [-H htmlpager] [-S section_list] [section] name ... Den 2019-01-18 20:18, skrev "Asmus Freytag via Unicode" : > > > I would full agree and I think Mark puts it really well in the message below > why some of the proposals brandished here are no longer plain text but > "not-so-plain" text. > > > I think we are better served with a solution that provides some form of > "light" rich text, for basic emphasis in short messages. The proper way for > this would be some form of MarkDown standard shared across vendors, and > perhaps implemented in a way that users don't necessarily need to type > anything special, but that, if exported to "true" plain text, it turns into > the source format for the "light" rich text. > > > This is an effort that's out of scope for Unicode to implement, or, I should > say, if the Consortium were to take it on, it would be a separate technical > standard from The Unicode Standard. > > > > A./ > > > PS: I really hate the creeping expansion of pseudo-encoding via VS characters. > The only worse thing is adding novel control functions. > > > > > > On 1/18/2019 7:51 AM, Mark E. Shoulson via Unicode wrote: > > >> On 1/16/19 6:23 AM, Victor Gaultney via Unicode wrote: >> >>> >>> Encoding 'begin italic' and 'end italic' would introduce difficulties when >>> partial strings are moved, etc. But that's no different than with current >>> punctuation. If you select the second half of a string that includes an end >>> quote character you end up with a mismatched pair, with the same problems of >>> interpretation as selecting the second half of a string including an 'end >>> italic' character. Apps have to deal with it, and do, as in code editors. >>> >>> >> It kinda IS different.? If you paste in half a string, you get a mismatched >> or unmatched paren or quote or something.? A typo, but a transient one.? It >> looks bad where it is, but everything else is unaffected.? It's no worse than >> hitting an extra key by mistake. If you paste in a "begin italic" and miss >> the "end italic", though, then *all* your text from that point on is >> affected!? (Or maybe "all until a newline" or some other stopgap ending, but >> that's just damage-control, not damage-prevention.)? Suddenly, letters and >> symbols five words/lines/paragraphs/pages look different, the pagination is >> all altered (by far more than merely a single extra punctuation mark, since >> italic fonts generally are narrower than roman).? It's a disaster. >> >> No.? This kind of statefulness really is beyond what Unicode is designed to >> cope with.? Bidi controls are (almost?) the sole exception, and even they >> cause their share of headaches.? Encoding separate _text_ italics/bold is IMO >> also a disastrous idea, but I'm not putting out reasons for that now.? The >> only really feasible suggestion I've heard is using a VS in some fashion. >> (Maybe let it affect whole words instead of individual characters?? Makes for >> fewer noisy VSs, but introduces a whole other host of limitations (how to >> italicize part of a word, how to italicize non-letters...) and is also just >> damage-control, though stronger.) >> >> >>> Apps (and font makers) can also choose how to deal with presenting strings >>> of text that are marked as italic. They can choose to present visual symbols >>> to indicate begin/end, such as /this/. Or they can present it using the >>> italic variant of the font, if available. >>> >>> >> At which point, you have invented markdown.? Instead of making Unicode >> declare it, just push for vendors everywhere to recognize /such notation/ as >> italics (OK, I know, you want dedicated characters for it which can't be >> confused for anything else.) >> >> >> >>> - Those who develop plain text apps (social media in particular) don't have >>> to build in a whole markup/markdown layer into their apps >>> >>> >> With the complexity of writing an social media app, a markup layer is really >> the least of the concerns when it comes to simplifying. >> >>> >>> - Misuse of math chars for pseudo-italic would likely disappear >>> >>> - The text runs between markers remain intact, so they need no special >>> treatment in searching, selecting, etc. >>> >>> - It finally, and conclusively, would end the decades of the mess in HTML >>> that surrounds and . >>> >>> >> Adding _another_ solution to something will *never* "conclusively end" >> anything.? On a good day, you can hope it will swamp the others, but they'll >> remain at least in legacy.? More likely, it will just add one more way to be >> confused and another side to the mess.? (People have pointed out here about >> the difficulties of distinguishing or not-distinguishing between HTML-level >> and putative plain-text italics.? And yes, that is an issue, and one that >> already exists with styling that can change case and such.? As with anything, >> the question is not whether there are going to be problems, but how those >> problems weigh against potential benefits.? That's an open question.) >> >> >>> My main point in suggesting that Unicode needs these characters is that >>> italic has been used to indicate specific meaning - this text is somehow >>> special - for over 400 years, and that content should be preserved in plain >>> text. >>> >>> >> There is something to this: people have been *emphasizing* text in some >> fashion or another for ages.? There is room to call this plain text. >> >> ~mark >> >> >> > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 19 21:14:21 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 20 Jan 2019 03:14:21 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> Message-ID: <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> (In the event that a persuasive proposal presentation prompts the possibility of italics encoding...) Possible approaches include: 1 - Liberating the italics from the Members Only Math Club ...which has been an ongoing practice since they were encoded.? It already works, but the set is incomplete and the (mal)practice is frowned upon.? Many of the older "shortcomings" of the set can now be overcome with combining diacritics.? These italics decompose to ASCII. 2 - Character level Variation selectors work with today's tech.? Default ignorable property suggests that apps that don't want to deal with them won't.? Many see VS as pseudo-encoding.? Stripping VS leaves ASCII behind. 3 - Open/Close punctuation treatment Stateful.? Works on ranges.? Not currently supported in plain-text. Could be supported in applications which can take a text string URL and make it a clickable link.? Default appearance in nonsupporting apps may resemble existing plain-text italic kludges such as slashes.? The ASCII is already in the character string. 4 - Leave it alone This approach requires no new characters and represents the default condition.? ASCII. - Number 1 would require that anything not already covered would have to be eventually proposed and accepted, 2 would require no new characters at all, and 3 would require two control characters for starters. As "food for thought" questions, if a persuasive case is presented for encoding italics, and excluding 4, which approach would have the least impact on the rich-text world?? Which would have the least impact on existing plain-text technology?? Which would be least likely to conflict with Unicode principles/encoding model? From unicode at unicode.org Sat Jan 19 23:30:39 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 20 Jan 2019 05:30:39 +0000 Subject: Encoding italic In-Reply-To: <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> Message-ID: <20190120053039.5d98f9a7@JRWUBU2> On Fri, 18 Jan 2019 10:51:18 -0500 "Mark E. Shoulson via Unicode" wrote: > On 1/16/19 6:23 AM, Victor Gaultney via Unicode wrote: > > > > Encoding 'begin italic' and 'end italic' would introduce > > difficulties when partial strings are moved, etc. But that's no > > different than with current punctuation. If you select the second > > half of a string that includes an end quote character you end up > > with a mismatched pair, with the same problems of interpretation as > > selecting the second half of a string including an 'end italic' > > character. Apps have to deal with it, and do, as in code editors. > > > It kinda IS different.? If you paste in half a string, you get a > mismatched or unmatched paren or quote or something.? A typo, but a > transient one.? It looks bad where it is, but everything else is > unaffected.? It's no worse than hitting an extra key by mistake. If > you paste in a "begin italic" and miss the "end italic", though, then > *all* your text from that point on is affected!? (Or maybe "all until > a newline" or some other stopgap ending, but that's just > damage-control, not damage-prevention.)? Suddenly, letters and > symbols five words/lines/paragraphs/pages look different, the > pagination is all altered (by far more than merely a single extra > punctuation mark, since italic fonts generally are narrower than > roman).? It's a disaster. The problem is worst when you have a small amount of italicisable text scattered within unitalicisable text. Unlike the case with bidi controls, the text usually remains intelligible with some work, and one can generally see where the missing italic should go. However, damage-limitation is desirable - I would suggest cancelling effects at the end of paragraph, as with bidi controls. On the other hand, the corresponding stateful ISCII character settings (for font effects and script) are ended at the end of line, which might be a finer concept. There are several stateful control characters for Arabic, mostly affecting numbers. However, as far as I can see, their effect is limited to one word (typically a string of digits). That seems too limited for italics, though it would be reasonable for switching between Antiqua and black letter. One minor problem with the stateful encoding, which seems to be in the original spirit of ISO 10646, is that redundant instances of the italic controls would build up in heavily edited text. I see that effect with ZWSP when I don't have a display mode that shows it. One solution would be for tricks such as "start italic" having a visible glyph in italic mode when the contrast between italic and non-italic mode is displayed. I don't believe italicity should be nested. However, such a build-up is a very minor problem. Richard. From unicode at unicode.org Sat Jan 19 23:49:04 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 20 Jan 2019 05:49:04 +0000 Subject: Encoding italic In-Reply-To: <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> Message-ID: <20190120054904.0e587666@JRWUBU2> On Sun, 20 Jan 2019 03:14:21 +0000 James Kass via Unicode wrote: > (In the event that a persuasive proposal presentation prompts the > possibility of italics encoding...) The use of italic script isn't just restricted to the Latin script, which includes base characters not supported by the mathematical sets for variables. It isn't hard to find their sober use in Thai - I found it in the first Thai magazine I flipped, where it was being used for quotations and names of publication, both Thai and English-language titles. > Possible approaches include: > > 1 - Liberating the italics from the Members Only Math Club Doesn't help with Thai. > 2 - Character level Works with Thai. > 3 - Open/Close punctuation treatment Works with Thai. > 4 - Leave it alone No change. Richard. From unicode at unicode.org Sun Jan 20 04:35:19 2019 From: unicode at unicode.org (Andrew West via Unicode) Date: Sun, 20 Jan 2019 10:35:19 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> Message-ID: On Sun, 20 Jan 2019 at 03:16, James Kass via Unicode wrote: > > Possible approaches include: > > 3 - Open/Close punctuation treatment > Stateful. Works on ranges. Not currently supported in plain-text. > Could be supported in applications which can take a text string URL and > make it a clickable link. Default appearance in nonsupporting apps may > resemble existing plain-text italic kludges such as slashes. The ASCII > is already in the character string. A possibility that I don't think has been mentioned so far would be to use the existing tag characters (E0020..E007F). These are no longer deprecated, and as they are used in emoji flag tag sequences, software already needs to support them, and they should just be ignored by software that does not support them. The advantages are that no new characters need to be encoded, and they are flexible so that tag sequences for start/end of italic, bold, fraktur, double-struck, script, sans-serif styles could be defined. For example start and end of italic styling could be defined as the tag sequences and (E003C E0069 E003E and E003C E002F E0069 E003E). Andrew From unicode at unicode.org Sun Jan 20 16:13:08 2019 From: unicode at unicode.org (=?utf-8?B?IkouwqBTLiBDaG9pIg==?= via Unicode) Date: Sun, 20 Jan 2019 17:13:08 -0500 Subject: Loose character-name matching In-Reply-To: <20190119005316.7fbb0469@JRWUBU2> References: <60797095-B703-4770-8F85-F045DDED4431@icloud.com> <20190119005316.7fbb0469@JRWUBU2> Message-ID: Thanks for the reply. These answers make sense. However, I am still confused by that passage from the Standard in ? 4.8. To review, it says: ?Because Unicode character names do not contain any underscore (?_?) characters, a common strategy is to replace any hyphen-minus or space in a character name by a single ?_? when constructing a formal identifier from a character name. This strategy automatically results in a syntactically correct identifier in most formal languages. Furthermore, such identifiers are guaranteed to be unique, because of the special rules for character name matching.? How is this system supposed encode names with non-medial hyphens (or U+116C?s name)? Many (most?) programming languages disallow both spaces and hyphens in identifiers. For instance, among the most-popular programming languages as ranked by TIOBE, *none* of them allow hyphens in identifiers as far as I can tell, and many of them (e.g., C, Python, MATLAB) do not allow *any* other ASCII identifier characters, including the dollar sign $. Does this mean that, it would impossible to create valid identifiers in these popular programming languages for characters with non-medial hyphens (or U+116C HANGUL JUNGSEONG O-E), contrary to the Standard?s claim in ? 4.8? One system of making valid identifiers in those languages is to make the underscore equivalent to hyphen-minus and then use camel case on space-separated words. For instance: hangulJunseongOE for U+116C HANGUL JUNGSEONG OE, hangulJunseongO_E for U+116C HANGUL JUNGSEONG O-E, tibetanLetterA for U+0F68 TIBETAN LETTER A, tibetanLetter_A for U+0F60 TIBETAN LETTER -A. A second albeit clunky method is to make the double underscore equivalent to a space then hyphen-minus (or vice versa) and then use single underscores on space-separated words. For instance: Hangul_Junseong_OE for U+116C HANGUL JUNGSEONG OE, Hangul_Junseong_O__E for U+116C HANGUL JUNGSEONG O-E, Tibetan_Letter_A for U+0F68 TIBETAN LETTER A, Tibetan_Letter__A for U+0F60 TIBETAN LETTER -A. Lastly, if the programming language allows the dollar sign $ to be in identifiers, as several such as Java and JavaScript do, then the dollar sign could be used instead of the underscore: hangulJunseongOE for U+116C HANGUL JUNGSEONG OE, hangulJunseongO$E for U+116C HANGUL JUNGSEONG O-E, tibetanLetterA for U+0F68 TIBETAN LETTER A, tibetanLetter$A for U+0F60 TIBETAN LETTER -A. ?or: Hangul_Junseong_OE for U+116C HANGUL JUNGSEONG OE, Hangul_Junseong_O$E for U+116C HANGUL JUNGSEONG O-E, Tibetan_Letter_A for U+0F68 TIBETAN LETTER A, Tibetan_Letter_$A for U+0F60 TIBETAN LETTER -A. Unfortunately, the first and second systems are not compatible with loose matching as prescribed by UAX44-LM2, so I daresay that they are not what the Standard?s claim in ? 4.8 has in mind. (The second system also assumes that there are no two characters whose names differ only by switching the positions of a space and an adjacent hyphen, which cannot be guaranteed forever without a stability policy.) But the third system is not possible in numerous popular programming languages (C, Python, etc.). How is the Standard?s system in ? 4.8 supposed encode names with non-medial hyphens (or U+116C?s name)? ?Oh, wait, I get it. This system is not supposed to necessarily be compatible with standard loose matching. I had the impression that they were supposed to be compatible, but rereading the original paragraph shows that they don?t actually mention loose matching, which is explained elsewhere in the chapter. That?s unfortunate. Thanks again for your help. > On Jan 18, 2019, at 7:53 PM, Richard Wordingham via Unicode wrote: > > On Thu, 17 Jan 2019 18:44:50 -0500 > "J. S. Choi" via Unicode wrote: > >> I?m implementing a Unicode names library. I?m confused about loose >> character-name matching, even after rereading The Unicode Standard ? >> 4.8, UAX #34 ? 4, #44 ? 5.9.2 ? as well as >> [L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt >> ), >> [L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035 >> ), and >> the [meeting in which those two items were >> resolved](https://www.unicode.org/L2/L2014/14026.htm >> ). >> >> In particular, I?m confused by the claim in The Unicode Standard ? >> 4.8 saying, ?Because Unicode character names do not contain any >> underscore (?_?) characters, a common strategy is to replace any >> hyphen-minus or space in a character name by a single ?_? when >> constructing a formal identifier from a character name. This strategy >> automatically results in a syntactically correct identifier in most >> formal languages. Furthermore, such identifiers are guaranteed to be >> unique, because of the special rules for character name matching.? > > Unfortunately, the loose matching rules don't distinguish '__' and > '_'. Note that '__' is sometimes forbidden in identifiers. > >> I?m also confused by the relationship between UAX34-R3 and UAX44-LM2. >> >> To make these issues concrete, let?s say that my library provides a >> function called getCharacter that takes a name argument, tries to >> find a loosely matching character, and then returns it (or a null >> value if there is no currently loosely matching character). So then >> what should the following expressions return? >> > Loose matching of names may be looser than prescribed; it shall not be > stricter. > >> getCharacter(?HANGUL-JUNGSEONG-O-E?) > U+1180 HANGUL JUNGSEONG O-E, or just possibly null. > >> getCharacter(?HANGUL_JUNGSEONG_O_E?) > U+116C HANGUL JUNGSEONG OE* > >> getCharacter(?HANGUL_JUNGSEONG_O_E_?) > U+116C > >> getCharacter(?HANGUL_JUNGSEONG_O__E?) > U+116C > >> getCharacter(?HANGUL_JUNGSEONG_O_-E?) > U+1180 > >> getCharacter(?HANGUL JUNGSEONGCHARACTERO E?) > null or U+116C - up to you. The sequence 'CHARACTER' shall not > distinguish names, but loose matching is not required to know this fact. > >> getCharacter(?HANGUL JUNGSEONG CHARACTER OE?) > null or U+116C - up to you. > >> getCharacter(?TIBETAN_LETTER_A?) > U+0F68 TIBETAN LETTER A > >> getCharacter(?TIBETAN_LETTER__A?) > U+0F68 TIBETAN LETTER A** > >> getCharacter(?TIBETAN_LETTER _A?) > U+0F68 > >> getCharacter(?TIBETAN_LETTER_-A?) > U+0F60 TIBETAN LETTER -A > > *This is unfortunate, as the usual symbolic name for U+1180 would be > HANGUL_JUNGSEONG_O_E. > > **This is also unfortunate, as the usual symbolic > name for U+0F60 would be TIBETAN_LETTER__A. > > The key problem here is that the hyphen after a space is required in > names as understood by the name property. The hyphen is also required > in "HANGUL JUNGSEONG O-E". The simple tactic is: > > 1) Canonicalise, by stripping out spaces, underscores and medial > hyphens and lowercasing. (It's probably better to fold the character > U+0131 LATIN SMALL LETTER I' to 'i'.) > > 2) Look the result up. > > 3) If you get the result U+116C but the input matches > ".*[oO]-[eE][_- ]*$", convert to U+1180. > > Symbolic identifiers in programs need not match the name; one may > choose to depend on the compiler or interpreter to catch duplicates; > some will, some won't. Replacing '-' by '_' to convert a name to an > identifier looses the distinction between a hyphen and an arbitrarily > inserted space, > > Richard. > From unicode at unicode.org Sun Jan 20 16:49:23 2019 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Sun, 20 Jan 2019 14:49:23 -0800 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> Message-ID: I think the real solution is for Twitter to just implement basic styling and make this a moot point. On Sun, Jan 20, 2019 at 2:37 AM Andrew West via Unicode wrote: > On Sun, 20 Jan 2019 at 03:16, James Kass via Unicode > wrote: > > > > Possible approaches include: > > > > 3 - Open/Close punctuation treatment > > Stateful. Works on ranges. Not currently supported in plain-text. > > Could be supported in applications which can take a text string URL and > > make it a clickable link. Default appearance in nonsupporting apps may > > resemble existing plain-text italic kludges such as slashes. The ASCII > > is already in the character string. > > A possibility that I don't think has been mentioned so far would be to > use the existing tag characters (E0020..E007F). These are no longer > deprecated, and as they are used in emoji flag tag sequences, software > already needs to support them, and they should just be ignored by > software that does not support them. The advantages are that no new > characters need to be encoded, and they are flexible so that tag > sequences for start/end of italic, bold, fraktur, double-struck, > script, sans-serif styles could be defined. For example start and end > of italic styling could be defined as the tag sequences and > (E003C E0069 E003E and E003C E002F E0069 E003E). > > Andrew > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 20 16:55:34 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 20 Jan 2019 22:55:34 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> Message-ID: <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> On 2019-01-20 10:49 PM, Garth Wallace wrote: > I think the real solution is for Twitter to just implement basic > styling and make this a moot point. At which time it would only become a moot point for Twitter users.? There's also Facebook and other on-line groups.? Plus scholars and linguists.? And interoperability. From unicode at unicode.org Sun Jan 20 18:52:28 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sun, 20 Jan 2019 19:52:28 -0500 Subject: Encoding italic (was: A last missing link) In-Reply-To: <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> Message-ID: On 1/19/19 10:14 PM, James Kass via Unicode wrote: > > (In the event that a persuasive proposal presentation prompts the > possibility of italics encoding...) > Possible approaches include: > > 1 - Liberating the italics from the Members Only Math Club > ...which has been an ongoing practice since they were encoded.? It > already works, but the set is incomplete and the (mal)practice is > frowned upon.? Many of the older "shortcomings" of the set can now be > overcome with combining diacritics.? These italics decompose to ASCII. Provides italics the same way that ASCII provides letters.? You can use them with any alphabet you want, as long as it's Latin.? (Or Greek, true).? Essentially requires doubling of huge chunks of the Unicode repetoire. > 2 - Character level > Variation selectors work with today's tech.? Default ignorable > property suggests that apps that don't want to deal with them won't.? > Many see VS as pseudo-encoding.? Stripping VS leaves ASCII behind. This, or something like this, is IMO the only possibility that has any chance at all. > > As "food for thought" questions, if a persuasive case is presented for > encoding italics, and excluding 4, which approach would have the least > impact on the rich-text world?? Which would have the least impact on > existing plain-text technology?? Which would be least likely to > conflict with Unicode principles/encoding model? #2. ~mark From unicode at unicode.org Sun Jan 20 20:38:09 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Sun, 20 Jan 2019 18:38:09 -0800 Subject: Encoding italic (was: A last missing link) In-Reply-To: <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> Message-ID: On Sun, Jan 20, 2019 at 2:57 PM James Kass via Unicode wrote: > At which time it would only become a moot point for Twitter users. > There's also Facebook and other on-line groups. Plus scholars and > linguists. And interoperability. > How do you envision this working? In practice, English is still often limited to ASCII, because smart quotes and dashes aren't on the top-level of the keyboard, nor are accented characters. Adding italics to Unicode isn't going to change much if input tools don't support it, and keyboards aren't likely to change. Twitter and Facebook aren't going to change much if the apps and webpages don't provide a tool to mark italics. I don't see scholars and linguists demanding this. Scholars use markup languages that can annotate the details they need annotated, far more than just italics. Various dialects of SGML, XML and TeX do the job, not plain text. You've yet to demonstrate that interoperability is an actual problem. Modern operating systems have ways of copying rich text including italics around. Maybe it would have been better to have standardized rich text, either in Unicode or in a standard layer above Unicode, back in 1991. But that train has left; you're just going to complicate systems that currently handle and exchange rich text including italics. To expand on what Mark E. Shoulson said, to add new italics characters, you're going to need to not only copy all of Latin, but also Cyrillic (and reopen the whole Macedonian italics argument, where ?, ?, ?, ?, and ? are all different in italics from in Russian). But also, Chinese is sometimes put in italics (cf. http://multilingualtypesetting.co.uk/blog/chinese-italics-oblique-fonts/ ) even if that horrifies many people. That page argues for, among other solutions, using what's effectively bold instead of italics. So we're talking about reencoding all of Chinese at least once (for emphasis) or twice (for italics and bold). That's a clear no-go. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 20 23:42:31 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 20 Jan 2019 21:42:31 -0800 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 20 23:49:13 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 20 Jan 2019 21:49:13 -0800 Subject: Encoding italic (was: A last missing link) In-Reply-To: <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> Message-ID: <67e80c6e-c1e9-66c2-ef44-290a05e1bffd@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 21 01:51:19 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 21 Jan 2019 07:51:19 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> Message-ID: <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com> Responding to David Starner, It?s true that most users can?t be troubled to take the extra time needed to insert any kind of special characters which aren?t covered by the keyboard.? Even the enthusiasts among us seldom take the trouble to include ?proper? quotes and apostrophes in e-mails ? even for posting to specialized lists such as this one where other members might notice and appreciate the extra effort involved.? Even though /we/ know how to do it and have software installed to help us do it. It?s also true that standard U.S. keyboards and drivers aren?t very helpful with diacritics.? Yet when we reply to list colleagues with surnames such as ?D?rst? or ?Bie??, we usually manage to get it right.? Sure, the ?reply? feature puts the surname into the response for us and the e-mail software adds the properly spelled names into our address books automatically.? But when we cite those colleagues in a post replying to some other list member, we typically take the time and trouble to write their names correctly.? Not only because we /can/, but because we /should/. > How do you envision this working? Splendidly!? (smile)? Social platforms, plain-text editors, and other applications do enhance their interfaces based on user demand from time to time.? User demand, at least on Twitter, seems established.? As pointed out previously in this discussion, that demand doesn?t seem to result in much ?Chicago style? text (although I have personally observed some) and may only be a passing fad /for Twitter users/.? When corporate interests aren't interested, third-party developers develop tools. > You've yet to demonstrate that interoperability is an actual problem. Copy/pasting from a web page into a plain-text editor removes any italics content, which is currently expected behavior.? Opinions differ as to whether that represents mere format removal or a loss of meaning.? Those who consider it as a loss of meaning would perceive a problem with interoperability. Consider superscript/subscript digits as a similar styling issue. The Wikipedia page for Romanization of Chinese includes information about the Wade-Giles system?s tone marks, which are superscripted digits. https://en.wikipedia.org/wiki/Romanization_of_Chinese Copy/pasting an example from the page into plain-text results in ?ma1, ma2, ma3, ma4?, although the web page displays the letters as italic and the digits as (italic) superscripts.? IMO, that?s simply wrong with respect to the superscript digits and suboptimal with respect to the italic letters. > To expand on what Mark E. Shoulson said, to add new italics characters, > you're going to need to not only copy all of Latin, but also Cyrillic ... I quite agree that expanding atomic italic encoding is off the table at this point.? (And that italicized CJK ideographs are daft.) From unicode at unicode.org Mon Jan 21 02:29:24 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Mon, 21 Jan 2019 08:29:24 +0000 (GMT) Subject: Encoding italic References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com> Message-ID: On 2019-01-21, James Kass via Unicode wrote: > Consider superscript/subscript digits as a similar styling issue. The > Wikipedia page for Romanization of Chinese includes information about > the Wade-Giles system?s tone marks, which are superscripted digits. > > https://en.wikipedia.org/wiki/Romanization_of_Chinese > > Copy/pasting an example from the page into plain-text results in ?ma1, > ma2, ma3, ma4?, although the web page displays the letters as italic and > the digits as (italic) superscripts.? IMO, that?s simply wrong with > respect to the superscript digits and suboptimal with respect to the > italic letters. Wade-Giles (which should be written with an en-dash, not a hyphen, if we're going to be fussy - as indeed Wikipedia is) is obsolete, but one could say the same about pinyin. However, printed pinyin with tones almost invariably uses the combining diacritics; in email where most people can't be bothered to write diacritics, tone numbers are written just as you have written above, with a following ascii digit. (With the proviso that Chinese speakers don't usually write tones at all when they write in pinyin.) They're often written like that even in web pages, where superscripts would be easy - see Victor Mair's frequent Language Log posts about Chinese writing and printing. This seems significantly less wrong to me that writing H2SO4 for H2SO4 which is also common in plain text... -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Mon Jan 21 02:29:42 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Mon, 21 Jan 2019 00:29:42 -0800 Subject: Encoding italic In-Reply-To: <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com> Message-ID: On Sun, Jan 20, 2019 at 11:53 PM James Kass via Unicode wrote: > Even though /we/ know how to do > it and have software installed to help us do it. You're emailing from Gmail, which has support for italics in email. The world has, in general, solved this problem. > > How do you envision this working? > > Splendidly! (smile) Social platforms, plain-text editors, and other > applications do enhance their interfaces based on user demand from time > to time. User demand, at least on Twitter, seems established. Then it would take six months, tops, for Twitter to produce and release a rich-text interface for Twitter. Far less time than waiting for Unicode to get around to it. > When corporate > interests aren't interested, third-party developers develop tools. Where are these tools? As I said, third-party developers could develop tools to convert a _underscore_ or /slash/ style italics to real italics and back without waiting on Twitter or Unicode. > Copy/pasting from a web page into a plain-text editor removes any > italics content, which is currently expected behavior. Opinions differ > as to whether that represents mere format removal or a loss of meaning. > Those who consider it as a loss of meaning would perceive a problem with > interoperability. Copy/pasting from a web page into a plain-text editor removes any pictures and destuctures tables, which definitely loses meaning. It also removes strike-out markup, which can have an even more dramatic effect on meaning than removing italics. As you pointed out below, it removes superscripts and subscripts; unless you wish to press for automatic conversion of those to Unicode, that's going to continue happening. It drops bold and font changes, and any number of other things that can carry meaning. > Copy/pasting an example from the page into plain-text results in ?ma1, > ma2, ma3, ma4?, although the web page displays the letters as italic and > the digits as (italic) superscripts. IMO, that?s simply wrong with > respect to the superscript digits and suboptimal with respect to the > italic letters. The superscripts show a problem with multiple encoding; even if you think they should be Unicode superscripts, and they look like Unicode superscripts, they might be HTML superscripts. Same thing would happen with italics if they were encoded in Unicode. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Mon Jan 21 04:29:11 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 21 Jan 2019 10:29:11 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com> Message-ID: <4922cf75-c0f2-7458-81ad-1e421a155c90@gmail.com> David Starner wrote, > You're emailing from Gmail, which has support for italics in email. But I compose e-mails in BabelPad, which has support for far more than italics in HTML mail.? And I'm using Mozilla Thunderbird to send and receive text e-mail via the Gmail account. And if I wanted to /display/ italics in a web page, I would create the source file in a plain-text editor.? (HTML mark-up is fairly easy to type with the ASCII keyboard.) If I compose a text file in BabelPad, it can be opened in many rich-text applications and the information survives intact.? Unless I am foolish enough to edit the file in the rich-text application and file-save it.? Because that mungs the plain-text file, and it can no longer be retrieved by the plain-text editor which created it. >> ...third-party... > > Where are these tools? BabelPad is an outstanding example.? Earlier in this discussion a web search found at least a handful of third-party tools devoted to liberating the math-alphas for Twitter users. > The superscripts show a problem with multiple encoding; even if you > think they should be Unicode superscripts, and they look like Unicode > superscripts, they might be HTML superscripts. Same thing would happen > with italics if they were encoded in Unicode. Hmmm.? Rich-text styled italics might be copied into other rich-text applications, but they cannot be copied into plain-text apps.? If Unicode-enabled italics existed, plain-text italics could be copy/pasted into either rich-text or plain-text applications and survive intact.? So Unicode-enabled italics would be interoperable. Anyone concerned about interoperability would be well advised to go with plain-text.? I am, so I do.? When I can. Kie eksistas fumo, tie eksistas fajro. From unicode at unicode.org Mon Jan 21 14:31:46 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 21 Jan 2019 13:31:46 -0700 Subject: Encoding italic Message-ID: <20190121133146.665a7a7059d7ee80bb4d670165c8327d.bc773d11ee.wbe@email03.godaddy.com> James Kass wrote: > Even the enthusiasts among us seldom take the trouble to include > ?proper? quotes and apostrophes in e-mails ? even for posting to > specialized lists such as this one where other members might notice > and appreciate the extra effort involved. Well, definitely not to this list, since the digest will clobber such characters (quod vide). -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Jan 21 14:46:56 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 21 Jan 2019 13:46:56 -0700 Subject: Encoding italic (was: A last missing link) Message-ID: <20190121134656.665a7a7059d7ee80bb4d670165c8327d.7d41c065d6.wbe@email03.godaddy.com> Kent Karlsson wrote: > There is already a standardised, "character level" (well, it is from > a character standard, though a more modern view would be that it is > a higher level protocol) way of specifying italics (and bold, and > underline, and more): > > \u001b[3mbla bla bla\u001b[0m > > Terminal emulators implement some such escape sequences. And indeed, the forthcoming Unicode Technical Note we are going to be writing to supplement the introduction of the characters in L2/19-025, whether next year or later, will recommend ISO 6429 sequences like this to implement features like background and foreground colors, inverse video, and more, which are not available as plain-text characters. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Jan 22 00:40:52 2019 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Tue, 22 Jan 2019 07:40:52 +0100 Subject: Encoding italic In-Reply-To: References: <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com> Message-ID: <20190122064052.dh2ofinavzflrx2f@angband.pl> On Mon, Jan 21, 2019 at 12:29:42AM -0800, David Starner via Unicode wrote: > On Sun, Jan 20, 2019 at 11:53 PM James Kass via Unicode > wrote: > > Even though /we/ know how to do > > it and have software installed to help us do it. > > You're emailing from Gmail, which has support for italics in email. ... and how exactly can they send italics in an e-mail? All they can do is to bundle a web page as an attachment, which some clients display instead of the main text. The e-mail's body text supports anything Unicode does, including ???????????? and even ?????? ????????????, but, remarkably, not italic umlauted characters, thai nor han. > > Splendidly! (smile) Social platforms, plain-text editors, and other > > applications do enhance their interfaces based on user demand from time > > to time. User demand, at least on Twitter, seems established. > > Then it would take six months, tops, for Twitter to produce and > release a rich-text interface for Twitter. Far less time than waiting > for Unicode to get around to it. Similar to many mail clients, Twitter does have a rich-text interface. It will present that rich-text as a link -- it will even has specific support to reduce the full URL to conserve the character count. But the primary interface is plain text, which unlike anything "rich" is interoperable with pretty much anything. > > Copy/pasting from a web page into a plain-text editor removes any > > italics content, which is currently expected behavior. Opinions differ > > as to whether that represents mere format removal or a loss of meaning. > > Those who consider it as a loss of meaning would perceive a problem with > > interoperability. > > Copy/pasting from a web page into a plain-text editor removes any > pictures and destuctures tables, which definitely loses meaning. > > It also removes strike-out markup, which can have an even more > dramatic effect on meaning than removing italics. As you pointed out > below, it removes superscripts and subscripts; unless you wish to > press for automatic conversion of those to Unicode, that's going to > continue happening. It drops bold and font changes, and any number of > other things that can carry meaning. Ie, any non-standard additions. There's a common base that's supposed to be interoperable, developed by a certain consortium -- and that base is pretty much guaranteed to work everywhere. Even if a specific display engine can't display some fancier elements, at least the underlying transport will transfer the text unmolested. There still are some issues here and there (like eg. people rejecting UCS2/UTF-16 on Windows which Microsoft insisted on, thus UTF-8 as system encoding is a new thing there and AFAIK even not the default yet AFAIK) -- but pretty much we're there. Last holdouts of ancient encodings are dying fast. There's a need to agree on a boundary between "this is what all means of interchange are supposed to support" and "fancy client-specific markup", and Unicode served at defining the former admirably. Meow! -- ??????? ??????? Remember, the S in "IoT" stands for Security, while P stands ??????? for Privacy. ??????? From unicode at unicode.org Tue Jan 22 11:52:36 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Tue, 22 Jan 2019 17:52:36 +0000 (GMT) Subject: Encoding italic (was: A last missing link) In-Reply-To: <65dfde71.ee61.16876ac0481.Webtop.70@btinternet.com> References: <20190121134656.665a7a7059d7ee80bb4d670165c8327d.7d41c065d6.wbe@email03.godaddy.com> <2072693815.420691.1548178041006.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <65dfde71.ee61.16876ac0481.Webtop.70@btinternet.com> Message-ID: <4b230aec.ee71.16876b147ff.Webtop.70@btinternet.com> Doug Ewell wrote: > And indeed, the forthcoming Unicode Technical Note we are going to be writing to supplement the introduction of the characters in L2/19-025, whether next year or later, will recommend ISO 6429 sequences like this to implement features like background and foreground colors, inverse video, and more, which are not available as plain-text characters. Back in the late 1980s I had the opportunity for some time, from time to time, to use a colour terminal that was attached to a mainframe computer as if it were just another basic terminal attached to a mainframe. So it could be used just as a basic terminal attached to a mainframe, and it was often used in that manner. Yet it also responded to Escape sequences which enabled it to do colour graphics, with, as best I remember now, commands to choose a colour and draw lines and so on. I note with interest Doug's suggestion to use Escape routines. However, these days systems tend to be more complicated at the underlying platform level and there is often communication between systems and so on and I wonder whether using Escape codes as such might be prone to strange problems in some circumstances before getting to the emulator software. With various platforms in common use I am wondering whether there might be problems in some cases. Maybe there is no issue and everything would be fine, yet I opine that that possibility of problems need to be looked at. I wonder if a new character, say U+FFF6, in the Specials section, could be defined that could be regarded as just an ordinary printing character in many circumstances yet as having exactly the same meaning as the Escape character in some circumstances, such as in an emulator. If that were done then the desired result could be achieved in a carefully structured manner rather than risk clashes over effectively sometimes trying to use the Escape character in two ways at the same time, perhaps with one of the ways being deep in the operating system and one in the terminal emulator with the way deep in the operating system usually winning. William Overington Tuesday 22 January 2019 From unicode at unicode.org Tue Jan 22 17:26:09 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Wed, 23 Jan 2019 00:26:09 +0100 Subject: Encoding italic (was: A last missing link) In-Reply-To: <20190121134656.665a7a7059d7ee80bb4d670165c8327d.7d41c065d6.wbe@email03.godaddy.com> Message-ID: Ok. One thing to note is that escape sequences (including control sequences, for those who care to distinguish those) probably should be "default ignorable" for display. Requiring, or even recommending, them to be default ignorable for other processing (like sorting, searching, and other things) may be a tall order. So, for display, (maximal) substrings that match: \u001B[\u0020-\002F]*[\u0030-\007E]| (\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E] should be default ignorable (i.e. invisible, but a "show invisibles" mode would show them; not interpreted ones should be kept, even if interpreted ones need not, just (re)generated on save). That is as far as Unicode should go. Some may be interpreted, this thread focuses on italic, but also bold and underlined. There is a whole bunch of "style" control sequences (those that have "m" at the end of the sequence) specified, and terminal emulators implement several of them, but not all. As for editing, if "style" control sequences ? la ISO 6429 were to be supported in text editors, I would NOT expect users to type in those escape/control sequences in any way, but use "ctrl/command-i" (etc.) or menu commands as editors do now, and the representation as esc-sequences be kept under wraps (and maybe only present in files, not in the internal representation during editing), and not seen unless one starts to analyse the byte sequences in files. So, even if you don't like this esc-sequence business: 1) It would not be seen by most users, mostly by programmers (the same goes for other ways of representing this, be it HTML, .doc, or whatever. 2) It is already standardised, and one can make (a slightly inaccurate) argument that it is "plain text". What one would need to do is: 1) Prioritise which "style" control sequences should be interpreted (rather than be ignored). 2) Lobby to "plain" text editor makers to support those styles, representing them (in files) as standard control sequences. A selection of already standardised style codes (i.e., for control sequences that end in ?m?): 0 default rendition (implementation-defined) 1 bold (2 lean) 22 normal intensity (neither bold nor lean) 3 italicized 23 not italicized (i.e. upright) 4 singly underlined (21 doubly underlined) 24 not underlined (neither singly nor doubly) (9 crossed-out (strikethrough)) (29 not crossed out) If you really want to go for colour as well (RGB values in 0?255) (colour is popular in terminal emulators...): (30-37 foreground: black, red, green, yellow, blue, magenta, cyan, white) 38 foreground colour as RGB. Next arguments 2;r;g;b 39 default foreground colour (implementation-defined) (40-47 background: black, red, green, yellow, blue, magenta, cyan, white) 48 background colour as RGB. Next arguments 2;r;g;b 49 default background colour (implementation-defined) There are some more (including some that assume a small font palette, for changing font). But far enough for now. Maybe too far already. But do not allow interpreting multiple style attribute codes in one control sequence; quite unnecessary. /Kent K Den 2019-01-21 21:46, skrev "Doug Ewell via Unicode" : > Kent Karlsson wrote: > >> There is already a standardised, "character level" (well, it is from >> a character standard, though a more modern view would be that it is >> a higher level protocol) way of specifying italics (and bold, and >> underline, and more): >> >> \u001b[3mbla bla bla\u001b[0m >> >> Terminal emulators implement some such escape sequences. > > And indeed, the forthcoming Unicode Technical Note we are going to be > writing to supplement the introduction of the characters in L2/19-025, > whether next year or later, will recommend ISO 6429 sequences like this > to implement features like background and foreground colors, inverse > video, and more, which are not available as plain-text characters. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > From unicode at unicode.org Tue Jan 22 18:16:40 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 23 Jan 2019 00:16:40 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com> Message-ID: <20190123001640.39964074@JRWUBU2> On Mon, 21 Jan 2019 00:29:42 -0800 David Starner via Unicode wrote: > The superscripts show a problem with multiple encoding; even if you > think they should be Unicode superscripts, and they look like Unicode > superscripts, they might be HTML superscripts. Same thing would happen > with italics if they were encoded in Unicode. But if one strips the mark-up out, and searching is then based on the collation elements of the text, then this is not a problem. Mathematical and ASCII capitals differ only at the identity level. Searching on the basis of codepoint sequences would come unstuck with scriptio continua scripts - WJ and ZWSP can be optionally inserted to improve line-breaking, and even to overcome spell-checkers. Richard. From unicode at unicode.org Tue Jan 22 21:43:29 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 23 Jan 2019 03:43:29 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> Message-ID: Nobody has really addressed Andrew West's suggestion about using the tag characters. It seems conformant, unobtrusive, requiring no official sanction, and could be supported by third-partiers in the absence of corporate interest if deemed desirable. One argument against it might be:? Whoa, that's just HTML.? Why not just use HTML?? SMH One argument for it might be:? Whoa, that's just HTML!? Most everybody already knows about HTML, so a simple subset of HTML would be recognizable. After revisiting the concept, it does seem elegant and workable. It would provide support for elements of writing in plain-text for anyone desiring it, enabling essential (or frivolous) preservation of editorial/authorial intentions in plain-text. Am I missing something?? (Please be kind if replying.) On 2019-01-20 10:35 AM, Andrew West wrote: > A possibility that I don't think has been mentioned so far would be to > use the existing tag characters (E0020..E007F). These are no longer > deprecated, and as they are used in emoji flag tag sequences, software > already needs to support them, and they should just be ignored by > software that does not support them. The advantages are that no new > characters need to be encoded, and they are flexible so that tag > sequences for start/end of italic, bold, fraktur, double-struck, > script, sans-serif styles could be defined. For example start and end > of italic styling could be defined as the tag sequences and > (E003C E0069 E003E and E003C E002F E0069 E003E). > > Andrew From unicode at unicode.org Tue Jan 22 20:24:59 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 22 Jan 2019 18:24:59 -0800 Subject: Encoding italic In-Reply-To: <20190123001640.39964074@JRWUBU2> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com> <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com> <20190123001640.39964074@JRWUBU2> Message-ID: On Tue, Jan 22, 2019 at 4:18 PM Richard Wordingham via Unicode wrote: > On Mon, 21 Jan 2019 00:29:42 -0800 > David Starner via Unicode wrote: > > > The superscripts show a problem with multiple encoding; even if you > > think they should be Unicode superscripts, and they look like Unicode > > superscripts, they might be HTML superscripts. Same thing would happen > > with italics if they were encoded in Unicode. > > But if one strips the mark-up out, and searching is then based on > the collation elements of the text, then this is not a problem. > Mathematical and ASCII capitals differ only at the identity level. Searching is not the only problem. Copying the data will reveal the same problem. Not only that, there was a previous argument that searching with Unicode italics would let you find titles of books and such separately from other usage of the phrase. That's not going to work if they're based on the collation elements and ignore the italics. Which also brings up the question of, if this is so important, why can't we search for italicized data in web pages right now? For anyone interacting with a web-browser that folds searching, this will change nothing, until if and when italics-sensitive searching is made available by the web-browser, which is not depending on Unicode supporting italics. There are programs that extract titles from text files; I suspect the programmers are most happy working with text formats that mark up titles as titles, not italics. In systems that just mark up italics, translating whatever form of italics marking is used is much easier than separating italicized titles from other forms of italics. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Wed Jan 23 20:07:48 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 23 Jan 2019 21:07:48 -0500 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> Message-ID: <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> On 1/19/19 1:19 PM, wjgo_10009 at btinternet.com via Unicode wrote: > > Well, a variation sequence character is being used for requesting > emoji display (is that a control code?), so it seems there is no lack > of precedent to use one for italics. It seems that someone only has to > say 'out of scope' and then that is the veto for any consideration of > a new idea for ISO/IEC 10646 or The Unicode Standard. There seems to > be no way for a request to the committee to consider a widening of the > scope to even be put before the committee if such a request is from > someone outside the inner circle. You make it sound like there's been invented some magical incantation that *anyone* can use to quash all discussion on a particular (your) topic.? It doesn't just take someone saying "out of scope."? It also has to *be* out of scope!? If someone chants the incantation, but I can persuasively argue that no, it IS in scope, then the spell fails.? Requesting the scope of Unicode be widened is not like other discussions being had here, so it makes sense that it should be treated differently, if treated at all. There were discussions and agreements made as to the scope of Unicode, long ago.? And just like you can't petition to change a character name, no matter how wrong it is, asking the Unicode consortium to redefine itself on your say-so is not going to be taken seriously either.? Out of scope means just that: it isn't something we're discussing.? Discussing how to change the scope so that whatever-it-is IS in scope is a very large undertaking, and would need a tremendous groundswell of support from all the major stakeholders in Unicode, so you should probably start there.? Get Microsoft and Google and various national bodies on your side, not just to say "um, ok, maybe," but to actively argue with you that the scope needs to be changed.? Or that there needs to be, as Asmus says, another, supplemental standard.? Raise popular support, write petitions, get signatures, all that fun stuff. "But so many of the people I would want to talk to about this are right here on this list!" you say?? Be that as it may, it doesn't mean the list has to grant you a platform.? Change the world on your own dime. > > It seems to me that it would be useful to have some codes that .... See, once you start a proposal like that, you're already looking down the wrong end of the Unicode scope.? This is exactly what Asmus (I think) said in a quote I can't seem to find, repeating it for the n+1st time: Unicode isn't here to encode cool new ideas that would be cool and new.? It's here for writing what people already do.? You want a standard that does something else?? That's another thing.? It's as appropriate to demand that Unicode support these things as it would be to go to OSHA or the Bureau of Weights and Measures or the Acad?mie Fran?aise and tell them you want some new letters... ~mark From unicode at unicode.org Wed Jan 23 20:08:31 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 23 Jan 2019 21:08:31 -0500 Subject: Encoding italic In-Reply-To: <19faf5d9-bfea-b732-f940-b937c852d5d8@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <19faf5d9-bfea-b732-f940-b937c852d5d8@gmail.com> Message-ID: On 1/19/19 3:34 PM, James Kass via Unicode wrote: > > On 2019-01-19 6:19 PM, wjgo_10009 at btinternet.com wrote: > > > It seems to me that it would be useful to have some codes that are > > ordinary characters in some contexts yet are control codes in > others, ... > > Italics aren't a novel concept.? The approach for encoding new > characters is that? conventions for them exist and that people *are* > exchanging them, people have exchanged them in the past, or that > people demonstrably *need* to exchange them. > > Excluding emoji, any suggestion or proposal whose premise is "It seems > to me that? it would be useful if characters supporting that>..." is doomed to be deemed out of scope for the standard. This was the quote I had been looking for, sorry James and Asmus.? It isn't the first time it's been pointed out here. ~mark From unicode at unicode.org Wed Jan 23 20:21:39 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 23 Jan 2019 21:21:39 -0500 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: Message-ID: <42afafc1-a0ab-e1f7-5954-371f174603d1@kli.org> On 1/22/19 6:26 PM, Kent Karlsson via Unicode wrote: > Ok. One thing to note is that escape sequences (including control sequences, > for those who care to distinguish those) probably should be "default > ignorable" for display. Requiring, or even recommending, them to be default > ignorable for other processing (like sorting, searching, and other things) > may be a tall order. So, for display, (maximal) substrings that match: > > \u001B[\u0020-\002F]*[\u0030-\007E]| > (\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E] > > should be default ignorable (i.e. invisible, but a "show invisibles" mode > would show them; not interpreted ones should be kept, even if interpreted > ones need not, just (re)generated on save). That is as far as Unicode > should go. So it isn't just "these characters should be default ignorable", but "this regular expression is default ignorable."? This gets back to "things that span more than a character" again, only this time the "span" isn't the text being styled, it's the annotation to style it.? The "bash" shell has special escape-sequences (\[ and \]) to use in defining its prompt that tell the system that the text enclosed by them is not rendered and should not be counted when it comes to doing cursor-control and line-editing stuff (so you put them around, yep, the escape sequences for coloring or boldfacing or whatever that you want in your prompt). That would seem to be at least simpler than a big ol' regexp, but really not that much of an improvement.? It also goes to show how things like this require all kinds of special handling, even/especially in a "simple" shell prompt (which could make a strong case for being "plain text", though, yes, terminal escape codes are a thing.) ~mark From unicode at unicode.org Wed Jan 23 20:32:55 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 23 Jan 2019 21:32:55 -0500 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> Message-ID: There is something deliciously simple, elegant... and kinda... rebellious? about doing this.? And it wouldn't even be in purview of Unicode.? "Yep, my HTML-renderer treats characters E0020..E007F just exactly the same 0020..007F, 'cept that it won't render 'em."? And you can send HTML text that looks for all the world like plain text to any normal Unicode-conformant viewer.? Now, the security issues of being able to write "invisible" JavaScript, or rather, Yet Another way you need to look at and reveal possible code, are a headache for someone else.? Viewed like this, you might do better taking this suggestion to W3C and having them amend the HTML/XML specs so that E0020..E007F are non-rendering synonyms for 0020..007F.? It wouldn't be a Unicode thing anymore, just changing the definition of HTML.? (I'm not saying it would be a GOOD idea, mind you.) ~mark On 1/22/19 10:43 PM, James Kass via Unicode wrote: > > Nobody has really addressed Andrew West's suggestion about using the > tag characters. > > It seems conformant, unobtrusive, requiring no official sanction, and > could be supported by third-partiers in the absence of corporate > interest if deemed desirable. > > One argument against it might be:? Whoa, that's just HTML.? Why not > just use HTML?? SMH > > One argument for it might be:? Whoa, that's just HTML!? Most everybody > already knows about HTML, so a simple subset of HTML would be > recognizable. > > After revisiting the concept, it does seem elegant and workable. It > would provide support for elements of writing in plain-text for anyone > desiring it, enabling essential (or frivolous) preservation of > editorial/authorial intentions in plain-text. > > Am I missing something?? (Please be kind if replying.) > > On 2019-01-20 10:35 AM, Andrew West wrote: > >> A possibility that I don't think has been mentioned so far would be to >> use the existing tag characters (E0020..E007F). These are no longer >> deprecated, and as they are used in emoji flag tag sequences, software >> already needs to support them, and they should just be ignored by >> software that does not support them. The advantages are that no new >> characters need to be encoded, and they are flexible so that tag >> sequences for start/end of italic, bold, fraktur, double-struck, >> script, sans-serif styles could be defined. For example start and end >> of italic styling could be defined as the tag sequences and >> (E003C E0069 E003E and E003C E002F E0069 E003E). >> >> Andrew From unicode at unicode.org Thu Jan 24 05:50:49 2019 From: unicode at unicode.org (Andrew West via Unicode) Date: Thu, 24 Jan 2019 11:50:49 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> Message-ID: On Thu, 24 Jan 2019 at 02:10, Mark E. Shoulson via Unicode wrote: > > Unicode isn't here to encode cool new ideas that would be cool and > new. It's here for writing what people already do. http://www.unicode.org/L2/L2018/18141r2-emoji-colors.pdf "Add 14 colored emoji characters for decorative and/or descriptive uses. These may be used to indicate that an emoji has a different color." No evidence has been provided that anybody is currently using colored blobs for this purpose (in fact emoji users have explicitly rejected this method for indicating emoji color: http://www.unicode.org/L2/L2018/18208-white-wine-rgi.pdf), just an assertion that it would be a good idea if emoji users could add a colored swatch to an existing emoji to indicate what color they want it to represent (note that the colored characters do not change the color of the emoji they are attached to [before or after, depending upon whether you are speaking French or English dialect of emoji], they are just intended as a visual indication of what colour you wish the emoji was). This proposal to add 14 additional colored circles, squares and hearts is a perfect example of a cool new idea for something that the authors think would be really useful, but for which there is no evidence of existing use. The UTC should have rejected it as out of scope, but we all know that rules and procedures do not apply to the Emoji Subcommittee, so in fact this cool new idea will be included in Unicode 12 in March. Andrew From unicode at unicode.org Thu Jan 24 07:56:53 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 24 Jan 2019 13:56:53 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> Message-ID: Andrew West wrote, > ... > http://www.unicode.org/L2/L2018/18208-white-wine-rgi.pdf), just an > assertion that it would be a good idea if emoji users could add a > colored swatch to an existing emoji to indicate what color they want > it to represent (note that the colored characters do not change the > color of the emoji they are attached to [before or after, depending > upon whether you are speaking French or English dialect of emoji], > they are just intended as a visual indication of what colour you wish > the emoji was). In order to simplify emoji processing, these should be stored in the data stream in logical order.? Whether these cool new characters become reordrant color blobs or not would depend upon language.? So, what we'd need is some way of indicating language in plain-text. Some kind of tagging mechanism. FAICT, the emoji repertoire is vendor-driven, just as the pre-Unicode emoji sets were vendor driven.? Pre-Unicode, if a vendor came up with cool ideas for new emoji they added new characters to the PUA.? Now that emoji are standardized, when vendors come up with new ideas they put them in the emoji ranges in order to preserve the standardization factor and ensure interoperability.? (That's probably over-simplified and there are bound to be other factors involved.) We should no more expect the conventional Unicode character encoding model to apply to emoji than we should expect the old-fashioned text ranges to become vendor-driven. From unicode at unicode.org Thu Jan 24 08:49:59 2019 From: unicode at unicode.org (Andrew West via Unicode) Date: Thu, 24 Jan 2019 14:49:59 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> Message-ID: On Thu, 24 Jan 2019 at 13:59, James Kass via Unicode wrote: > > FAICT, the emoji repertoire is vendor-driven, just as the pre-Unicode > emoji sets were vendor driven. Pre-Unicode, if a vendor came up with > cool ideas for new emoji they added new characters to the PUA. Now that > emoji are standardized, when vendors come up with new ideas they put > them in the emoji ranges in order to preserve the standardization factor > and ensure interoperability. (That's probably over-simplified and there > are bound to be other factors involved.) I do not believe that recent (post-6.0) emoji additions are vendor-driven. There is no formal vendor representation on the ESC, and most ESC members do not work for vendors. Current emoji additions are driven by ordinary users, who are actively encouraged by the UTC to propose novel characters for encoding: http://blog.unicode.org/2018/04/submissions-open-for-2020-emoji.html http://blog.unicode.org/2016/09/emoji-deadline.html The vendors happily lap up whatever emojis the UTC throws at them, but they seem to have little interest in taking control of the emoji process. > We should no more expect the conventional Unicode character encoding > model to apply to emoji than we should expect the old-fashioned text > ranges to become vendor-driven. Why should we not expect the conventional Unicode character encoding mode to apply to emoji? We were told time and time again when emoji were first proposed that they were required for encoding for interoperability with Japanese telecoms whose usage had spilled over to the internet. At that time there was no suggestion that encoding emoji was anything other than a one-off solution to a specific problem with PUA usage by different vendors, and I at least had no idea that emoji encoding would become a constant stream with an annual quota of 60+ fast-tracked user-suggested novelties. Maybe that was the hidden agenda, and I was just na?ve. The ESC and UTC do an appallingly bad job at regulating emoji, and I would like to see the Emoji Subcommittee disbanded, and decisions on new emoji taken away from the UTC, and handed over to a consortium or committee of vendors who would be given a dedicated vendor-use emoji plane to play with (kinda like a PUA plane with pre-assigned characters with algorithmic names [VENDOR-ASSIGNED EMOJI XXXXX] which the vendors can then associate with glyphs as they see fit; and as emoji seem to evolve over time they would be free to modify and reassign glyphs as they like because the Unicode Standard would not define the meaning or glyph for any characters in this plane). Andrew From unicode at unicode.org Thu Jan 24 09:42:37 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 24 Jan 2019 15:42:37 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> Message-ID: <35032127-df39-3e54-d106-573932dbfc4b@gmail.com> Andrew West wrote, > Why should we not expect the conventional Unicode character encoding > mode to apply to emoji? Remember when William Overington used to post about encoding colours, sometimes accompanied by novel suggestions about how they could be encoded or referenced in plain-text? Here's a very polite reply from John Hudson from 2000, http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/1042.html ...and, over time, many of the replies to William Overington's colorful suggestions were less than polite.? But it was clear that colors were out-of-scope for a computer plain-text encoding standard. So I don't expect the conventional model to apply to emoji because it didn't; if it had, they'd not have been encoded.? Since they're in there, the conventional model does not apply.? Of course, the conventions have changed along with the concept of what's acceptable in plain-text. Since emoji are an open-ended evolving phenomenon, there probably has to be a provision for expansion.? Any idea about them having been a finite set overlooked the probability of open-endedness and the impracticality of having only the original subset covered in plain-text while additions would be banished to higher level protocols. Thank you for the information about current emoji additions being unrelated to vendors.? I have to confess that I haven't kept up-to-date on the emoji. Maybe I should have said that emoji are fan-driven. From unicode at unicode.org Thu Jan 24 04:47:36 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Thu, 24 Jan 2019 10:47:36 +0000 (GMT) Subject: Encoding italic (was: A last missing link) In-Reply-To: <877518274.400362.1548324826544.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> <877518274.400362.1548324826544.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk> Message-ID: <49f5f750.844.1687f78e899.Webtop.228@btinternet.com> Mark E. Shoulson wrote: > It doesn't just take someone saying "out of scope." It depends who it is. The theory is that people post in the mailing list as individuals, yet some people have very great influence. > It also has to *be* out of scope! Maybe, it depends who says what. > If someone chants the incantation, but I can persuasively argue that > no, it IS in scope, then the spell fails. Well, that may work for you, it does not work for me. Decision is by an unnamed gatekeeper and the Unicode Technical Committee does not get to discuss it, and discussing whether it is in scope or not is not allowed on the mailing list, because discussion of the topic is permanently banned. > Requesting the scope of Unicode be widened is not like other > discussions being had here, so it makes sense that it should be > treated differently, if treated at all. Well, it does not make sense to me. If benefit could be produced by widening the scope of Unicode in some way, then it seems that it should be allowed to be discussed in the mailing list. And even if rejected at some time then still be allowed to be discussed at some future time as things may have changed. > There were discussions and agreements made as to the scope of Unicode, > long ago. Yes. Yet surely decisions made long ago should not lock out all progress as new ideas come along. > And just like you can't petition to change a character name, no matter > how wrong it is, asking the Unicode consortium to redefine itself on > your say-so is -not going to be taken seriously either. Well, to me it is not like that. Yes, "a character name, no matter how wrong it is," is part of the stability guarantee and cannot be changed. Adding U+FFF7 as a base character for a tag digit sequence to uniquely and interoperably and stably define a code for a specific meaning for a localizable sentence would not, as far as I am aware, break any stability guarantees for Unicode. That might widen the scope of Unicode or it might be within the present scope, yet either way if it would be of benefit to end users then it would be reasonable to consider the idea and not block its discussion: and it is not a matter of my say-so at all, putting forward an idea for fair consideration is not at all the same as dictating that something should be done on someone's say-so. Was the scope of Unicode widened for emoji? First of all emoji were encoded for compatibility, but the Unicorn Face changed all that and now it an annual "could be useful" exercise of generating new characters based on people's ideas. For the avoidance of doubt I am not against that at all, it is fun and hopefully will continue. I appreciate that the particular tag sequences to follow U+FFF7 might not be encoded by Unicode Inc., they might be encoded by an ISO committee, such as ISO/TC 37. Yet encoding U+FFF7 as the base character would allow a link as interoperable plain text rather than needing to use what amounts to a markup system. Yet please remember that Unicode Inc. has defined and published base character plus tag sequences for the some flags, including the Welsh flag and the Scottish flag. Recently I was informed that they are not part of The Unicode Standard nor part of ISO/IEC 10646. It appears that a Unicode Technical Note is being prepared with recommendations of how to express teletext control characters using Unicode characters, possibly using Escape sequences. So a Unicode Inc. publication listing numbers and meanings together with a context guide for each to help translation of meanings for a localization file of code numbers and sentences into a target language seems not unreasonable. As an example, the vertical line used as a separator, as a comma might be used within the sentence itself, so not using a comma as a separator of fields. 812|Would you like to go to the day room? Not all codes would be three digits, some would be longer. Codes where the first three digits are all different from the other two digits are three digits long. Codes where the first and third digit are the same have a length of 3 plus the value of the third digit. So, for example, codes starting 313 are six digits long and are a set of localizable sentences intended primarily for seeking information through the language barrier about relatives and friends after a disaster. The third digit being zero allows for even longer code numbers. > Discussing how to change the scope so that whatever-it-is IS in scope > is a very large undertaking, ? Not necessarily. If the Unicode Technical Committee were to consider a proposal and, after consideration and discussion were to agree to proceed, it could all be done within a short discussion at a Unicode Technical Committee meeting and then the recommendation sent to the ISO committee. I am not saying that it should be or that it will be, I am just trying to say that it is not necessarily a very large undertaking. The Unicode Technical Committee discusses many things. > ? and would need a tremendous groundswell of support from all the > major stakeholders in Unicode, ? Quite possibly. And if there were discussion in the Unicode mailing list and the topic came up at a Unicode Technical Committee meeting that might happen. > ?, so you should probably start there. Well, they meet at the Unicode Technical Committee meetings, so that is where I consider that the matter should be discussed. The problem is, it is not possible for me at present to get such a suggestion before the committee because it gets blocked and it cannot be discussed in the Unicode mailing list because the topic is permanently banned. > "But so many of the people I would want to talk to about this are > right here on this list!" you say? Be that as it may, it doesn't mean > the list has to grant you a platform. That is very true. Unicode Inc. has no obligation whatsoever to allow me to post my ideas in the Unicode mailing list and no obligation whatsoever to consider my ideas for progress at the Unicode Technical Committee. I find it quite ironic that if this idea were implemented then demonstrations of what the system could do would be a marvellous example of what is possible in displaying the languages of the world using Unicode. http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_025.pdf > Change the world on your own dime. Well, I had not met that expression before but I have had a search and I think that I understand your meaning. I am doing what I can. I am retired, at home, with a laptop computer with some budget software (yet very good software with which I can make fonts and publish PDF documents), an internet connection, and a small personal webspace hosted by a United Kingdom Public Limited Company for a small annual fee, so it is safe to access, it is not a server based on my home computer, I upload over the internet (it is a legacy webspace from a free-with-dial-up-internet-access webspace dating from 1997 after a takeover then another takeover, after the dial-up facility was closed yet I was allowed to keep the webspace with same original address.) For example, as well as producing some scientific publications, I am writing a novel, chapters 1 ..72, 75, 80, 81 all written, published on the web for free reading and legal-deposited with the British Library. http://www.users.globalnet.co.uk/~ngo/novel.htm If just browsing through, Chapters 34, 42 and 51 are good places to start browsing. > "Unicode isn't here to encode cool new ideas that would be cool and > new. It's here for writing what people already do. " That may have been true once, and maybe that is still the theory, but the continual encoding of new emoji just does not fit that! I did at one time, a few years ago, consider trying to formulate localizable sentences as emoji, each with a square glyph, but I changed from that when I realized that emoji do not have precise meanings yet a very important aspect of localizable sentences is that each one has a very precise meaning and is grammatical independent. > It's as appropriate to demand that Unicode support these things ? One of the problems I get is the Aunt Sally suggestion, not only here but in posts from others, that I am demanding anything. I am a researcher and I would like to put my ideas forward for sensible discussion. I am asking for consideration of my ideas please, I have not, and am not, demanding anything at all. When people start making out that I am making demands it is very prejudicial and, I consider, very unfair. By the way, I have been put on moderated post so please do not reply to the list unless you get a copy of this as from me via Unicode. I write this because I am not seeking to bypass the moderator's decision as if Unicode Inc. does not want any discussion of localizable sentences in its mailing list that is its right so to choose. William Overington Thursday 24 January 2019 From unicode at unicode.org Thu Jan 24 09:06:49 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Thu, 24 Jan 2019 15:06:49 +0000 (GMT) Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> Message-ID: <7918bdc9.d847.16880663a2b.Webtop.73@btinternet.com> Andrew West wrote as follows: > ? (note that the colored characters do not change the color of the > emoji they are attached to [before or after, depending upon whether > you are speaking French or English dialect of emoji], they are just > intended as a visual indication of what colour you wish the emoji > was). I thought that the idea was that they could possibly be used for glyph substitution with an appropriate font, so that there could be, for example, a glyph of a polar bear. I produced a proposal for some characters specifically intended each as a colour modifier character. http://www.unicode.org/L2/L2018/18198-colour-mod-chars.pdf I know that the document was once on the agenda for a UTC meeting but was not mentioned in the minutes, so I do not know whether consideration of the best plain text way to express a request for a particular colour for an emoji is still taking place and my document is just one of several possibilities being considered. William Overington Thursday 24 January 2019 ------ Original Message ------ From: "Andrew West via Unicode" To: "Mark E. Shoulson" Cc: "Unicode Discussion" Sent: Thursday, 2019 Jan 24 At 11:50 Subject: Re: Encoding italic (was: A last missing link) On Thu, 24 Jan 2019 at 02:10, Mark E. Shoulson via Unicode wrote: > > Unicode isn't here to encode cool new ideas that would be cool and > new. It's here for writing what people already do. http://www.unicode.org/L2/L2018/18141r2-emoji-colors.pdf "Add 14 colored emoji characters for decorative and/or descriptive uses. These may be used to indicate that an emoji has a different color." No evidence has been provided that anybody is currently using colored blobs for this purpose (in fact emoji users have explicitly rejected this method for indicating emoji color: http://www.unicode.org/L2/L2018/18208-white-wine-rgi.pdf), just an assertion that it would be a good idea if emoji users could add a colored swatch to an existing emoji to indicate what color they want it to represent (note that the colored characters do not change the color of the emoji they are attached to [before or after, depending upon whether you are speaking French or English dialect of emoji], they are just intended as a visual indication of what colour you wish the emoji was). This proposal to add 14 additional colored circles, squares and hearts is a perfect example of a cool new idea for something that the authors think would be really useful, but for which there is no evidence of existing use. The UTC should have rejected it as out of scope, but we all know that rules and procedures do not apply to the Emoji Subcommittee, so in fact this cool new idea will be included in Unicode 12 in March. Andrew From unicode at unicode.org Thu Jan 24 09:54:29 2019 From: unicode at unicode.org (Andrew West via Unicode) Date: Thu, 24 Jan 2019 15:54:29 +0000 Subject: Encoding italic (was: A last missing link) In-Reply-To: <35032127-df39-3e54-d106-573932dbfc4b@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> <35032127-df39-3e54-d106-573932dbfc4b@gmail.com> Message-ID: On Thu, 24 Jan 2019 at 15:42, James Kass wrote: > > Here's a very polite reply from John Hudson from 2000, > http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/1042.html > ...and, over time, many of the replies to William Overington's colorful > suggestions were less than polite. But it was clear that colors were > out-of-scope for a computer plain-text encoding standard. Going off topic a little, I saw this tweet from Marijn van Putten today which shows examples of Arabic script from early Quranic manuscripts with phonetic information indicated by the use of red and green dots: https://twitter.com/PhDniX/status/1088171783461703682 I would be interested to know how those should be represented in Unicode. Andrew From unicode at unicode.org Thu Jan 24 10:24:07 2019 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Thu, 24 Jan 2019 18:24:07 +0200 Subject: Encoding italic (was: A last missing link) In-Reply-To: References: <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> <35032127-df39-3e54-d106-573932dbfc4b@gmail.com> Message-ID: <20190124162407.GA2703@macbook.localdomain> On Thu, Jan 24, 2019 at 03:54:29PM +0000, Andrew West via Unicode wrote: > On Thu, 24 Jan 2019 at 15:42, James Kass wrote: > > > > Here's a very polite reply from John Hudson from 2000, > > http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/1042.html > > ...and, over time, many of the replies to William Overington's colorful > > suggestions were less than polite. But it was clear that colors were > > out-of-scope for a computer plain-text encoding standard. > > Going off topic a little, I saw this tweet from Marijn van Putten > today which shows examples of Arabic script from early Quranic > manuscripts with phonetic information indicated by the use of red and > green dots: > > https://twitter.com/PhDniX/status/1088171783461703682 > > I would be interested to know how those should be represented in Unicode. It is possible to represent this by use of color fonts. The green (sometimes golden) dots are the hamza, the red ones are various vowel marks. A color font would use colored glyphs for these instead of the modern shapes. I did a color fonts that does a similar thing (but still use the modern forms) and it is on my to do list to do a font using archaic Kufi forms. Regards, Khaled From unicode at unicode.org Thu Jan 24 10:33:39 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 24 Jan 2019 16:33:39 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> <35032127-df39-3e54-d106-573932dbfc4b@gmail.com> Message-ID: <59ecdeef-5b8f-19be-2649-fe9e38d1a020@gmail.com> > Maybe I should have said emoji are fan-driven. That works.? Here's the previous assertion rephrased: ? We should no more expect the conventional Unicode character encoding ? model to apply to emoji than we should expect the old-fashioned text ? ranges to become fan-driven. And if we don't want the text ranges to become fan driven, as pointed out by Martin D?rst and others, we take a cautious and conservative approach to moving forward with the standard. Veering back on-topic, the anti fan driven aversion doesn't apply to encoding italics, although /fans/ would benefit.? There's pre-existing conventions for italics, and a scholar with the credentials of Victor Gaultney should be able to make a credible proposal for encoding them.? I hope we haven't overwhelmed him with a surplus of rhetoric. From unicode at unicode.org Thu Jan 24 16:42:59 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 24 Jan 2019 22:42:59 +0000 Subject: Encoding italic In-Reply-To: <20190124162407.GA2703@macbook.localdomain> References: <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> <35032127-df39-3e54-d106-573932dbfc4b@gmail.com> <20190124162407.GA2703@macbook.localdomain> Message-ID: <20190124224259.54ec3e28@JRWUBU2> On Thu, 24 Jan 2019 18:24:07 +0200 Khaled Hosny via Unicode wrote: > On Thu, Jan 24, 2019 at 03:54:29PM +0000, Andrew West via Unicode > wrote: >> On Thu, 24 Jan 2019 at 15:42, James Kass >> wrote: >>> Going off topic a little, I saw this tweet from Marijn van Putten >>> today which shows examples of Arabic script from early Quranic >>> manuscripts with phonetic information indicated by the use of red >>> and green dots: >>> >>> https://twitter.com/PhDniX/status/1088171783461703682 >> I would be interested to know how those should be represented in >> Unicode. > It is possible to represent this by use of color fonts. The limitations of rendering technology should not be an argument against an encoding. We have characters that differ only in their properties, such as word-breaking and line-breaking. In this case, it may be argued that their colours apply only to their 'plain' colouring. Who determines what their colour should be in blue text? (Font technology seems to dictate that their colour is unaffected by the choice of foreground colour.) Richard. From unicode at unicode.org Thu Jan 24 17:00:10 2019 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Fri, 25 Jan 2019 01:00:10 +0200 Subject: Encoding italic In-Reply-To: <20190124224259.54ec3e28@JRWUBU2> References: <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> <35032127-df39-3e54-d106-573932dbfc4b@gmail.com> <20190124162407.GA2703@macbook.localdomain> <20190124224259.54ec3e28@JRWUBU2> Message-ID: <20190124230010.GB2703@macbook.localdomain> On Thu, Jan 24, 2019 at 10:42:59PM +0000, Richard Wordingham via Unicode wrote: > On Thu, 24 Jan 2019 18:24:07 +0200 > Khaled Hosny via Unicode wrote: > > > On Thu, Jan 24, 2019 at 03:54:29PM +0000, Andrew West via Unicode > > wrote: > >> On Thu, 24 Jan 2019 at 15:42, James Kass > >> wrote: > > >>> Going off topic a little, I saw this tweet from Marijn van Putten > >>> today which shows examples of Arabic script from early Quranic > >>> manuscripts with phonetic information indicated by the use of red > >>> and green dots: > >>> > >>> https://twitter.com/PhDniX/status/1088171783461703682 > > >> I would be interested to know how those should be represented in > >> Unicode. > > > It is possible to represent this by use of color fonts. > > The limitations of rendering technology should not be an argument > against an encoding. We have characters that differ only in their > properties, such as word-breaking and line-breaking. They are already encoded, in their modern uncolored form. Some of the modern forms like U+06E5 ARABIC SMALL WAW, U+06E5 ARABIC SMALL WAW, etc. were even specifically ?invented? in the previous century to overcome the impracticality of printing in multiple colors, so the colored and uncolored forms are different representations of the same underlying characters. > In this case, it may be argued that their colours apply only to their > 'plain' colouring. Who determines what their colour should be in blue > text? (Font technology seems to dictate that their colour is > unaffected by the choice of foreground colour.) The colors don?t change, the vowel marks are always red, the hamza is always green/yellow. From unicode at unicode.org Thu Jan 24 17:46:35 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Fri, 25 Jan 2019 00:46:35 +0100 Subject: Encoding italic (was: A last missing link) In-Reply-To: <42afafc1-a0ab-e1f7-5954-371f174603d1@kli.org> Message-ID: Den 2019-01-24 03:21, skrev "Mark E. Shoulson via Unicode" : > On 1/22/19 6:26 PM, Kent Karlsson via Unicode wrote: >> Ok. One thing to note is that escape sequences (including control sequences, >> for those who care to distinguish those) probably should be "default >> ignorable" for display. Requiring, or even recommending, them to be default >> ignorable for other processing (like sorting, searching, and other things) >> may be a tall order. So, for display, (maximal) substrings that match: >> >> \u001B[\u0020-\002F]*[\u0030-\007E]| >> (\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E] >> >> should be default ignorable (i.e. invisible, but a "show invisibles" mode >> would show them; not interpreted ones should be kept, even if interpreted >> ones need not, just (re)generated on save). That is as far as Unicode >> should go. > > So it isn't just "these characters should be default ignorable", but > "this regular expression is default ignorable."? This gets back to > "things that span more than a character" again, only this time the > "span" isn't the text being styled, it's the annotation to style it.? True. That is how ECMA/ISO/ANSI escape/control-sequences are designed. Had they not already been designed, and implemented, but we were to do a design today, it would surely be done differently; e.g. having "controls" that consisted only of (individually) "default-ignorable" characters. But, and this is the important thing here: a) The current esc/control-sequences is an accepted standard, since long. b) This standard is still in very much active use, albeit mostly by terminal emulators. But the styling stuff need not at all be limited to terminal emulators. Since it is an actively and widely used standard, I don't see the point of trying to design another way of specifying "default ignorable"-controls for text styling. (HTML, for instance, does not have "default ignorable" controls, since ALL characters in the "controls" are printable characters, so one needs a "second level" for parsing the controls.) True, ignoring or interpreting an esc/control-sequence requires some processing of substrings, since some (all but the first) are printable characters. But not that hard. It has been implemented over and over... Had this standard been defunct, then there would be an opportunity to design something different. > The "bash" shell has special escape-sequences (\[ and \]) to use in > defining its prompt that tell the system that the text enclosed by them > is not rendered and should not be counted when it comes to doing Never heard of. Cannot find any reference mentioning them. Reference? > cursor-control and line-editing stuff (so you put them around, yep, the > escape sequences for coloring or boldfacing or whatever that you want in > your prompt). Line editing stuff in bash is done on an internal buffer (there is a library for doing this, and that library can be used by various other command line programs; bash does not use the system input line editing). Then that library tries to show what is in the buffer on the terminal. So, I'm not sure what you are talking about; bash does NOT (somehow) scrape the screen (terminal emulator window). Furthermore, colouring and bold/underline is quite common not only in prompts, but also in output directed at a terminal from various programs. (And it works just fine.) Unfortunately cut-and-paste tends to loose much (or all) of that. (Would be nicer if it got converted to HTML, RTF, .doc, or whatever is the target format; or just nicely kept if "plain text" is the target.) > That would seem to be at least simpler than a big ol' > regexp, but really not that much of an improvement.? It also goes to > show how things like this require all kinds of special handling, > even/especially in a "simple" shell prompt (which could make a strong > case for being "plain text", though, yes, terminal escape codes are a > thing.) They are NOT "terminal escape codes". It is just that, for now, it is just about only terminal emulator that implement esc/control-sequences. >From https://www.ecma-international.org/publications/standards/Ecma-048.htm: "The control functions are intended to be used embedded in character-coded data for interchange, in particular with character-imaging devices." A (plain) text editor is an example of a 'character-imaging device'. (Yes, the terminology is a bit dated.) /Kent K > > ~mark From unicode at unicode.org Thu Jan 24 23:44:21 2019 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Thu, 24 Jan 2019 21:44:21 -0800 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> Message-ID: On Wed, Jan 23, 2019 at 1:27 AM James Kass via Unicode wrote: > > Nobody has really addressed Andrew West's suggestion about using the tag > characters. > > It seems conformant, unobtrusive, requiring no official sanction, and > could be supported by third-partiers in the absence of corporate > interest if deemed desirable. > > One argument against it might be: Whoa, that's just HTML. Why not just > use HTML? SMH > > One argument for it might be: Whoa, that's just HTML! Most everybody > already knows about HTML, so a simple subset of HTML would be recognizable. > > After revisiting the concept, it does seem elegant and workable. It > would provide support for elements of writing in plain-text for anyone > desiring it, enabling essential (or frivolous) preservation of > editorial/authorial intentions in plain-text. > > Am I missing something? (Please be kind if replying.) > There is also RFC 1896 "enriched text", which is an attempt at a lightweight HTML substitute for styling in email. But these, and the ANSI escape code suggestion, seem like they're trying to solve the wrong problem here. Here's how I understand the situation: * Some people using forms of text or mostly-text communication that do not provide styling features want to use styling, for emphasis or personal flair * Some of these people caught on to the existence of the "styled" mathematical alphanumerics and, not caring that this is "wrong", started using them as a workaround * The use of these symbols, which are not technically equivalent to basic Latin, make posts inaccessible to screen readers, among other problems These are suggestions for Unicode to provide a different, more "acceptable" workaround for a lack of functionality in these social media systems (this mostly seems to be an issue with Twitter; IME this shows up much less on Facebook). But the root problem isn't the kludge, it's the lack of functionality in these systems: if Twitter etc. simply implemented some styling on their own, the whole thing would be a moot point. Essentially, this is trying to add features to Twitter without waiting for their development team. Interoperability is not an issue, since in modern computers copying and pasting styled text between apps works just fine. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 00:34:10 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 24 Jan 2019 22:34:10 -0800 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> Message-ID: <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 01:14:40 2019 From: unicode at unicode.org (Tex via Unicode) Date: Thu, 24 Jan 2019 23:14:40 -0800 Subject: Encoding italic In-Reply-To: <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> Message-ID: <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> I am surprised at the length of this debate, especially since the arguments are repetitive? That said: Twitter was offered as an example, not the only example just one of the most ubiquitous. Many messaging apps and other apps would benefit from italics. The argument is not based on adding italics to twitter. Most apps today have security protections that filter or translate problematic characters. If the proposal would cause ?normalization? problems, adding the proposed characters to the filter lists or substitution lists would not be a big burden. The biggest burden would be to the apps that would benefit, to add italicizing and editing capabilities. tex From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag via Unicode Sent: Thursday, January 24, 2019 10:34 PM To: unicode at unicode.org Subject: Re: Encoding italic On 1/24/2019 9:44 PM, Garth Wallace via Unicode wrote: But the root problem isn't the kludge, it's the lack of functionality in these systems: if Twitter etc. simply implemented some styling on their own, the whole thing would be a moot point. Essentially, this is trying to add features to Twitter without waiting for their development team. Interoperability is not an issue, since in modern computers copying and pasting styled text between apps works just fine. Yep, that's what this is: trying to add features to some platforms that could very simply be added by the respective developers while in the process causing a normalization issue (of sorts) everywhere else. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 01:25:12 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Thu, 24 Jan 2019 23:25:12 -0800 Subject: Encoding italic In-Reply-To: <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> Message-ID: On 1/24/2019 11:14 PM, Tex wrote: > > I am surprised at the length of this debate, especially since the > arguments are repetitive? > > That said: > > Twitter was offered as an example, not the only example just one of > the most ubiquitous. Many messaging apps and other apps would benefit > from italics. The argument is not based on adding italics to twitter. > > Most apps today have security protections that filter or translate > problematic characters. If the proposal would cause ?normalization? > problems, adding the proposed characters to the filter lists or > substitution lists would not be a big burden. > > The biggest burden would be to the apps that would benefit, to add > italicizing and editing capabilities. > The "normalization" is when you import to rich text, you don't want competing formatting instructions. Getting styled character codes normalized to styling of character runs is the most difficult, that's why the abuse of math italics really is abuse in terms of interoperability. Other schemes, like a VS per code point, also suffer from being different in philosophy from "standard" rich text approaches. Best would be as standard extension to all the messaging systems (e.g. a common markdown language, supported by UI). A./ > tex > > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Asmus Freytag via Unicode > *Sent:* Thursday, January 24, 2019 10:34 PM > *To:* unicode at unicode.org > *Subject:* Re: Encoding italic > > On 1/24/2019 9:44 PM, Garth Wallace via Unicode wrote: > > But the root problem isn't the kludge, it's the lack of > functionality in these systems: if Twitter etc. simply implemented > some styling on their own, the whole thing would be a moot point. > Essentially, this is trying to add features to Twitter without > waiting for their development team. > > Interoperability is not an issue, since in modern computers > copying and pasting styled text between apps works just fine. > > Yep, that's what this is: trying to add features to some platforms > that could very simply be added by the? respective developers while in > the process causing a normalization issue (of sorts) everywhere else. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 05:59:09 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Fri, 25 Jan 2019 03:59:09 -0800 Subject: Encoding italic In-Reply-To: <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> Message-ID: On Thu, Jan 24, 2019 at 11:16 PM Tex via Unicode wrote: > Twitter was offered as an example, not the only example just one of the most ubiquitous. Many messaging apps and other apps would benefit from italics. The argument is not based on adding italics to twitter. And again, color me skeptical. If italics are just added to Unicode and not to the relevant app or interface, they will not see much use, in the same way that most non-ASCII characters for proper English--the quotes, the dashes, the accents--are often ignored because they're too hard to enter. But if you're going to add italics, having it in Unicode doesn't make it significantly easier, particularly when they need to support systems that predate Unicode adding italics. > The biggest burden would be to the apps that would benefit, to add italicizing and editing capabilities. If they would benefit or if they'd accept the burden, they'd have already added italics, via HTML or Markdown or escape sequences or whatever. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Fri Jan 25 06:07:21 2019 From: unicode at unicode.org (James Tauber via Unicode) Date: Fri, 25 Jan 2019 07:07:21 -0500 Subject: Ancient Greek apostrophe marking elision Message-ID: There seems some debate amongst digital classicists in whether to use U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking elision. (e.g. ?? for ?? preceding a word starting with a vowel). It seems to me that U+2019 is the technically correct choice per the Unicode Standard but it is not without at least one problem: default word breaking rules. I'm trying to provide guidelines for digital classicists in this regard. Is it correct to say the following: 1) U+2019 is the correct character to use for the apostrophe in Ancient Greek when marking elision. 2) U+02BC is a misuse of a modifier for this purpose 3) However, use of U+2019 (unlike U+02BC) means the default Word Boundary Rules in UAX#29 will (incorrectly) exclude the apostrophe from the word token 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules in UAX#29 will (incorrectly) include the apostrophe as part of a glyph cluster with the previous letter 5) The correct solution is to tailor the Word Boundary Rules in the case of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't have the same ambiguity problems with the single quotation mark as in English as it should not be used as a quotation mark in Ancient Greek) Many thanks in advance. James -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 03:06:35 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Fri, 25 Jan 2019 09:06:35 +0000 (GMT) Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> Message-ID: <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> Asmus Freytag wrote; > Other schemes, like a VS per code point, also suffer from being > different in philosophy from "standard" rich text approaches. Best > would be as standard extension to all the messaging systems (e.g. a > common markdown language, supported by UI). A./ Yet that claim of what would be best would be stateful and statefulness is the very thing that Unicode seeks to avoid. Plain text is the basic system and a Variation Selector mechanism after each character that is to become italicized is not stateful and can be implemented using existing OpenType technology. If an organization chooses to develop and use a rich text format then that is a matter for that organization and any changing of formatting of how italics are done when converting between plain text and rich text is the responsibility of the organization that introduces its rich text format. Twitter was just an example that someone introduced along the way, it was not the original request. Also this is not only about messaging. Of primary importance is the conservation of texts in plain text format, for example, where a printed book has one word italicized in a sentence and the text is being transcribed into a computer. William Overington Friday 25 January 2019 From unicode at unicode.org Fri Jan 25 11:34:40 2019 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 25 Jan 2019 18:34:40 +0100 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: Message-ID: U+2019 is normally the character used, except where the ? is considered a letter. When it is between letters it doesn't cause a word break, but because it is also a right single quote, at the end of words there is a break. Thus in a phrase like ?tryin? to go? there is a word break after the n, because one can't tell. So something like "?? ??????" (picking a phrase at random) would have a word break after the delta. Word break: ?? ?????? However, there is no *line break* between them (which is the more important operation in normal usage). Probably not worth tailoring the word break. Line break: ?? ?????? Mark On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode < unicode at unicode.org> wrote: > There seems some debate amongst digital classicists in whether to use > U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking > elision. (e.g. ?? for ?? preceding a word starting with a vowel). > > It seems to me that U+2019 is the technically correct choice per the > Unicode Standard but it is not without at least one problem: default word > breaking rules. > > I'm trying to provide guidelines for digital classicists in this regard. > > Is it correct to say the following: > > 1) U+2019 is the correct character to use for the apostrophe in Ancient > Greek when marking elision. > 2) U+02BC is a misuse of a modifier for this purpose > 3) However, use of U+2019 (unlike U+02BC) means the default Word Boundary > Rules in UAX#29 will (incorrectly) exclude the apostrophe from the word > token > 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules in > UAX#29 will (incorrectly) include the apostrophe as part of a glyph cluster > with the previous letter > 5) The correct solution is to tailor the Word Boundary Rules in the case > of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't > have the same ambiguity problems with the single quotation mark as in > English as it should not be used as a quotation mark in Ancient Greek) > > Many thanks in advance. > > James > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 11:39:47 2019 From: unicode at unicode.org (James Tauber via Unicode) Date: Fri, 25 Jan 2019 12:39:47 -0500 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: Message-ID: Thank you, although the word break does still affect things like double-clicking to select. And people do seem to want to use U+02BC for this reason (and I'm trying to articulate why that isn't what U+02BC is meant for). James On Fri, Jan 25, 2019 at 12:34 PM Mark Davis ?? wrote: > U+2019 is normally the character used, except where the ? is considered a > letter. When it is between letters it doesn't cause a word break, but > because it is also a right single quote, at the end of words there is a > break. Thus in a phrase like ?tryin? to go? there is a word break after the > n, because one can't tell. > > So something like "?? ??????" (picking a phrase at random) would have a > word break after the delta. > > Word break: > ?? ?????? > > However, there is no *line break* between them (which is the more > important operation in normal usage). Probably not worth tailoring the word > break. > > Line break: > ?? ?????? > > Mark > > > On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode < > unicode at unicode.org> wrote: > >> There seems some debate amongst digital classicists in whether to use >> U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking >> elision. (e.g. ?? for ?? preceding a word starting with a vowel). >> >> It seems to me that U+2019 is the technically correct choice per the >> Unicode Standard but it is not without at least one problem: default word >> breaking rules. >> >> I'm trying to provide guidelines for digital classicists in this regard. >> >> Is it correct to say the following: >> >> 1) U+2019 is the correct character to use for the apostrophe in Ancient >> Greek when marking elision. >> 2) U+02BC is a misuse of a modifier for this purpose >> 3) However, use of U+2019 (unlike U+02BC) means the default Word Boundary >> Rules in UAX#29 will (incorrectly) exclude the apostrophe from the word >> token >> 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules >> in UAX#29 will (incorrectly) include the apostrophe as part of a glyph >> cluster with the previous letter >> 5) The correct solution is to tailor the Word Boundary Rules in the case >> of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't >> have the same ambiguity problems with the single quotation mark as in >> English as it should not be used as a quotation mark in Ancient Greek) >> >> Many thanks in advance. >> >> James >> > -- *James Tauber* Greek Linguistics: https://jktauber.com/ Music Theory: https://modelling-music.com/ Digital Tolkien: https://digitaltolkien.com/ Twitter: @jtauber -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 12:05:40 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 25 Jan 2019 18:05:40 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: Message-ID: For U+2019, there's a note saying 'this is the preferred character to use for apostrophe'. Mark Davis wrote, > When it is between letters it doesn't cause a word break, ... Some applications don't seem to get that.? For instance, the spellchecker for Mozilla Thunderbird flags the string "aren" for correction in the word "aren?t", which suggests that users trying to use preferred characters may face uphill battles. From unicode at unicode.org Fri Jan 25 15:26:33 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 25 Jan 2019 21:26:33 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: Message-ID: <20190125212633.147193ac@JRWUBU2> On Fri, 25 Jan 2019 12:39:47 -0500 James Tauber via Unicode wrote: > Thank you, although the word break does still affect things like > double-clicking to select. > > And people do seem to want to use U+02BC for this reason (and I'm > trying to articulate why that isn't what U+02BC is meant for). It's a bit tricky when the reason is that it was too hard to get users of English to make a distinction between U+02BC and U+2019. And for Larry Niven's elephant-like aliens in _Footfall__, is _fi'_, the singular of _fithp_, better written with U+02BC or U+2019? And does the phonetically faithful spelling of Estuarine English _fi'_ for _fit_ depend on whether the glottal stop is dropped? The science-fiction ethnonym _Vl'harg_ is also tricky. Does its elegant encoding depend on whether the apostrophe is a vowel symbol (so U+02BC) or the indication of an omitted vowel (so U+2019)? Richard. From unicode at unicode.org Fri Jan 25 15:59:58 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Fri, 25 Jan 2019 13:59:58 -0800 Subject: Encoding italic In-Reply-To: <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> Message-ID: <6d3e4948-648a-2883-c2f1-aa60559677f2@ix.netcom.com> On 1/25/2019 1:06 AM, wjgo_10009 at btinternet.com wrote: > Asmus Freytag wrote; > >> Other schemes, like a VS per code point, also suffer from being >> different in philosophy from "standard" rich text approaches. Best >> would be as standard extension to all the messaging systems (e.g. a >> common markdown language, supported by UI).???? A./ > > Yet that claim of what would be best would be stateful and > statefulness is the very thing that Unicode seeks to avoid. All rich text is stateful, and rich text is very widely used and cut&paste tends to work rather well among applications that support it, as do conversions of entire documents. Trying to duplicate it with "yet another mechanism" is a doubtful achievement, even if it could be made "stateless". A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 16:02:25 2019 From: unicode at unicode.org (James Tauber via Unicode) Date: Fri, 25 Jan 2019 17:02:25 -0500 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190125212633.147193ac@JRWUBU2> References: <20190125212633.147193ac@JRWUBU2> Message-ID: I guess U+02BC is category Lm not Mn, but doesn't that still mean it modifies the previous character (i.e. is really part of the same grapheme cluster) and so isn't appropriate as either a vowel or an indication of an omitted vowel? On Fri, Jan 25, 2019 at 4:30 PM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Fri, 25 Jan 2019 12:39:47 -0500 > James Tauber via Unicode wrote: > > > Thank you, although the word break does still affect things like > > double-clicking to select. > > > > And people do seem to want to use U+02BC for this reason (and I'm > > trying to articulate why that isn't what U+02BC is meant for). > > It's a bit tricky when the reason is that it was too hard to get users > of English to make a distinction between U+02BC and U+2019. And for > Larry Niven's elephant-like aliens in _Footfall__, is _fi'_, the > singular of _fithp_, better written with U+02BC or U+2019? And does > the phonetically faithful spelling of Estuarine English _fi'_ for > _fit_ depend on whether the glottal stop is dropped? > > The science-fiction ethnonym _Vl'harg_ is also tricky. Does its elegant > encoding depend on whether the apostrophe is a vowel symbol (so > U+02BC) or the indication of an omitted vowel (so U+2019)? > > Richard. > -- *James Tauber* Greek Linguistics: https://jktauber.com/ Music Theory: https://modelling-music.com/ Digital Tolkien: https://digitaltolkien.com/ Twitter: @jtauber -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 16:03:52 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 25 Jan 2019 14:03:52 -0800 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 16:06:57 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 25 Jan 2019 14:06:57 -0800 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: Message-ID: <8d8f81ea-88d9-8382-f8b1-5b648eda9923@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 17:49:15 2019 From: unicode at unicode.org (Andrew Cunningham via Unicode) Date: Sat, 26 Jan 2019 10:49:15 +1100 Subject: Encoding italic In-Reply-To: <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> Message-ID: Assuming some mechanism for italics is added to Unicode, when converting between the new plain text and HTML there is insufficient information to correctly convert to HTML. many elements may have italic stying and there would be no meta information in Unicode to indicate the appropriate HTML element. On Friday, 25 January 2019, wjgo_10009 at btinternet.com via Unicode < unicode at unicode.org> wrote: > Asmus Freytag wrote; > > Other schemes, like a VS per code point, also suffer from being different >> in philosophy from "standard" rich text approaches. Best would be as >> standard extension to all the messaging systems (e.g. a common markdown >> language, supported by UI). A./ >> > > Yet that claim of what would be best would be stateful and statefulness is > the very thing that Unicode seeks to avoid. > > Plain text is the basic system and a Variation Selector mechanism after > each character that is to become italicized is not stateful and can be > implemented using existing OpenType technology. > > If an organization chooses to develop and use a rich text format then that > is a matter for that organization and any changing of formatting of how > italics are done when converting between plain text and rich text is the > responsibility of the organization that introduces its rich text format. > > Twitter was just an example that someone introduced along the way, it was > not the original request. > > Also this is not only about messaging. Of primary importance is the > conservation of texts in plain text format, for example, where a printed > book has one word italicized in a sentence and the text is being > transcribed into a computer. > > William Overington > Friday 25 January 2019 > > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 18:18:32 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Fri, 25 Jan 2019 16:18:32 -0800 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> Message-ID: <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> On 1/25/2019 3:49 PM, Andrew Cunningham wrote: > Assuming some mechanism for italics is added to Unicode,? when > converting between the new plain text and HTML there is insufficient > information to correctly convert to HTML. many elements may have > italic stying and there would be no meta information in Unicode to > indicate the appropriate HTML element. > > So, we would be creating an interoperability issue. A./ > > > On Friday, 25 January 2019, wjgo_10009 at btinternet.com > via Unicode > wrote: > > Asmus Freytag wrote; > > Other schemes, like a VS per code point, also suffer from > being different in philosophy from "standard" rich text > approaches. Best would be as standard extension to all the > messaging systems (e.g. a common markdown language, supported > by UI).? ? ?A./ > > > Yet that claim of what would be best would be stateful and > statefulness is the very thing that Unicode seeks to avoid. > > Plain text is the basic system and a Variation Selector mechanism > after each character that is to become italicized is not stateful > and can be implemented using existing OpenType technology. > > If an organization chooses to develop and use a rich text format > then that is a matter for that organization and any changing of > formatting of how italics are done when converting between plain > text and rich text is the responsibility of the organization that > introduces its rich text format. > > Twitter was just an example that someone introduced along the way, > it was not the original request. > > Also this is not only about messaging. Of primary importance is > the conservation of texts in plain text format, for example, where > a printed book has one word italicized in a sentence and the text > is being transcribed into a computer. > > William Overington > Friday 25 January 2019 > > > > -- > Andrew Cunningham > lang.support at gmail.com > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 25 20:36:27 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 26 Jan 2019 02:36:27 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <20190125212633.147193ac@JRWUBU2> Message-ID: <20190126023627.4962951e@JRWUBU2> On Fri, 25 Jan 2019 17:02:25 -0500 James Tauber via Unicode wrote: > I guess U+02BC is category Lm not Mn, but doesn't that still mean it > modifies the previous character (i.e. is really part of the same > grapheme cluster) and so isn't appropriate as either a vowel or an > indication of an omitted vowel? To quote TUS: "A few may modify the following letter, and some may serve as a independent letters". Bear in mind that one of the uses of U+02BC is the scholarly representation of a glottal stop, especially in Arabic names. Richard. From unicode at unicode.org Sat Jan 26 00:12:25 2019 From: unicode at unicode.org (James Tauber via Unicode) Date: Sat, 26 Jan 2019 01:12:25 -0500 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190126023627.4962951e@JRWUBU2> References: <20190125212633.147193ac@JRWUBU2> <20190126023627.4962951e@JRWUBU2> Message-ID: On Fri, Jan 25, 2019 at 9:41 PM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > To quote TUS: > > "A few may modify the following letter, and some may serve as a > independent letters". > > Bear in mind that one of the uses of U+02BC is the scholarly > representation of a glottal stop, especially in Arabic names. > Okay, so this legitimises the use of U+02BC (with its better word-breaking properties) for the apostrophe marking elision in Ancient Greek even though U+2019 is stated as the preferred character _in general_ for the apostrophe. On balance, this would seem to suggest U+02BC can (and perhaps should) be used for the specific purpose in Ancient Greek. (Of course, the other character that comes up is U+1FBD, but there the consensus seems strong that this is just plain wrong.) Thank you all. James -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 26 00:39:42 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 26 Jan 2019 06:39:42 +0000 Subject: Encoding italic In-Reply-To: <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> Message-ID: <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> On 2019-01-26 12:18 AM, Asmus Freytag (c) responded: > On 1/25/2019 3:49 PM, Andrew Cunningham wrote: >> Assuming some mechanism for italics is added to Unicode,? when >> converting between the new plain text and HTML there is insufficient >> information to correctly convert to HTML. many elements may have >> italic stying and there would be no meta information in Unicode to >> indicate the appropriate HTML element. >> >> > So, we would be creating an interoperability issue. > > What happens now when we convert plain-text to HTML? From unicode at unicode.org Sat Jan 26 00:42:36 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 26 Jan 2019 06:42:36 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <8d8f81ea-88d9-8382-f8b1-5b648eda9923@ix.netcom.com> References: <8d8f81ea-88d9-8382-f8b1-5b648eda9923@ix.netcom.com> Message-ID: <83391552-c908-ce3e-d9b1-9126a427dfe8@gmail.com> On 2019-01-25 10:06 PM, Asmus Freytag via Unicode wrote: > James, by now it's unclear whether your ' is 2019 or 02BC. The example word "aren't" in previous message used U+2019.? Sorry if I was unclear. From unicode at unicode.org Sat Jan 26 05:02:58 2019 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sat, 26 Jan 2019 12:02:58 +0100 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: Message-ID: > breaking selection for "d'Artagnan" or "can't" into two is overly fussy. True, and that is not what U+2019 does; it does not break medially. Mark On Fri, Jan 25, 2019 at 11:07 PM Asmus Freytag via Unicode < unicode at unicode.org> wrote: > On 1/25/2019 9:39 AM, James Tauber via Unicode wrote: > > Thank you, although the word break does still affect things like > double-clicking to select. > > And people do seem to want to use U+02BC for this reason (and I'm trying > to articulate why that isn't what U+02BC is meant for). > > For normal edition operations, breaking selection for "d'Artagnan" or > "can't" into two is overly fussy. > > No wonder people get frustrated. > > A./ > > James > > On Fri, Jan 25, 2019 at 12:34 PM Mark Davis ?? wrote: > >> U+2019 is normally the character used, except where the ? is considered a >> letter. When it is between letters it doesn't cause a word break, but >> because it is also a right single quote, at the end of words there is a >> break. Thus in a phrase like ?tryin? to go? there is a word break after the >> n, because one can't tell. >> >> So something like "?? ??????" (picking a phrase at random) would have a >> word break after the delta. >> >> Word break: >> ?? ?????? >> >> However, there is no *line break* between them (which is the more >> important operation in normal usage). Probably not worth tailoring the word >> break. >> >> Line break: >> ?? ?????? >> >> Mark >> >> >> On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode < >> unicode at unicode.org> wrote: >> >>> There seems some debate amongst digital classicists in whether to use >>> U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking >>> elision. (e.g. ?? for ?? preceding a word starting with a vowel). >>> >>> It seems to me that U+2019 is the technically correct choice per the >>> Unicode Standard but it is not without at least one problem: default word >>> breaking rules. >>> >>> I'm trying to provide guidelines for digital classicists in this regard. >>> >>> Is it correct to say the following: >>> >>> 1) U+2019 is the correct character to use for the apostrophe in Ancient >>> Greek when marking elision. >>> 2) U+02BC is a misuse of a modifier for this purpose >>> 3) However, use of U+2019 (unlike U+02BC) means the default Word >>> Boundary Rules in UAX#29 will (incorrectly) exclude the apostrophe from the >>> word token >>> 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules >>> in UAX#29 will (incorrectly) include the apostrophe as part of a glyph >>> cluster with the previous letter >>> 5) The correct solution is to tailor the Word Boundary Rules in the case >>> of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't >>> have the same ambiguity problems with the single quotation mark as in >>> English as it should not be used as a quotation mark in Ancient Greek) >>> >>> Many thanks in advance. >>> >>> James >>> >> > > -- > *James Tauber* > Greek Linguistics: https://jktauber.com/ > Music Theory: https://modelling-music.com/ > Digital Tolkien: https://digitaltolkien.com/ > > Twitter: @jtauber > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 26 05:45:19 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 26 Jan 2019 11:45:19 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: Message-ID: Mark Davis responded to Asmus Freytag, >> breaking selection for "d'Artagnan" or "can't" into two is overly fussy. > > True, and that is not what U+2019 does; it does not break medially. Mark Davis earlier posted this example, > So something like "?? ??????" (picking a phrase at random) would have > a word break after the delta. If the user wanted to use the preferred character, U+2019, would using the no break space (U+00A0) after it resolve the word or line break issues?? Or possibly NNBSP (U+202F)? It's a shame if users choose suboptimal characters over preferred characters because of what are essentially rendering/text selection issues.? IMO, it's better to use preferred characters in the long run. (Users should file bug reports on applications which improperly medially break strings which include U+2019.) From unicode at unicode.org Sat Jan 26 09:45:54 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 26 Jan 2019 15:45:54 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: Message-ID: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> Perhaps I'm not understanding, but if the desired behavior is to prohibit both line and word breaks in the example string, then... In Notepad, replacing U+0020 with U+00A0 removes the line-break. U+0020 ( ?? ?????? ) U+00A0 ( ????????? ) U+202F ( ????????? ) It also changes the advancement of the text cursor (Ctrl + arrows), suggesting that word/string selection would be as desired.? (U+202F also does this and may offer a more pleasing appearance to classisists by default.) Wouldn't it be best to handle substitution of U+00A0 for U+0020 at the input method / keyboard driver level where appropriate, so that preferred apostrophe U+2019 can be used? From unicode at unicode.org Sat Jan 26 17:52:28 2019 From: unicode at unicode.org (James Tauber via Unicode) Date: Sat, 26 Jan 2019 18:52:28 -0500 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> Message-ID: Well, *my* desire it to simple know whether to tell people doing digital editions of Ancient Greek texts whether to use U+2019 or U+02BC for the apostrophe marking elision (or at least accurately describe the trade-offs of each). On Sat, Jan 26, 2019 at 10:50 AM James Kass via Unicode wrote: > > Perhaps I'm not understanding, but if the desired behavior is to > prohibit both line and word breaks in the example string, then... > > In Notepad, replacing U+0020 with U+00A0 removes the line-break. > U+0020 ( ?? ?????? ) > U+00A0 ( ?? ?????? ) > U+202F ( ????????? ) > It also changes the advancement of the text cursor (Ctrl + arrows), > suggesting that word/string selection would be as desired. (U+202F also > does this and may offer a more pleasing appearance to classisists by > default.) > > Wouldn't it be best to handle substitution of U+00A0 for U+0020 at the > input method / keyboard driver level where appropriate, so that > preferred apostrophe U+2019 can be used? > > -- *James Tauber* Eldarion | jktauber.com (Greek Linguistics) | Modelling Music | Digital Tolkien -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 26 18:32:43 2019 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sun, 27 Jan 2019 00:32:43 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> Message-ID: I?ll be publishing a translation of Alice into Ancient Greek in due course. I will absolutely only use U+2019 for the apostrophe. It would be wrong for lots of reasons to use U+02BC for this. Moreover, implementations of U+02BC need to be revised. In the context of Polynesian languages, it is impossible to use U+02BC if it is _identical_ to U+2019. Readers cannot work out what is what. I will prepare documentation on this in due course. > On 26 Jan 2019, at 23:52, James Tauber via Unicode wrote: > > Well, my desire it to simple know whether to tell people doing digital editions of Ancient Greek texts whether to use U+2019 or U+02BC for the apostrophe marking elision (or at least accurately describe the trade-offs of each). From unicode at unicode.org Sat Jan 26 19:11:49 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 26 Jan 2019 17:11:49 -0800 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: Message-ID: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 26 19:15:18 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 27 Jan 2019 01:15:18 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> Message-ID: <20190127011518.0b7e2ace@JRWUBU2> On Sat, 26 Jan 2019 15:45:54 +0000 James Kass via Unicode wrote: > Perhaps I'm not understanding, but if the desired behavior is to > prohibit both line and word breaks in the example string, then... > > In Notepad, replacing U+0020 with U+00A0 removes the line-break. I believe the problem is that "?? ??????" should have non-blank *words*. With U+2019, one gets 3. Line-break suppressing spaces don't help with word-breaking, because they are not treated as letters. A clunky solution would be to have a sequence . However, there is no such thing as a 'control-joining-words' if one complies with the TUS injunction in Section 23.3, "The word joiner should be ignored in contexts other than line breaking". A robust, trainable spell-checker will treat this institutionally racist injunction with the contempt it deserves. It's interesting that the spellings "'bus" and "'phone" have died. They would once have hit the word-boundary problems when "bus" and "phone" were rejected. Richard. From unicode at unicode.org Sat Jan 26 19:37:39 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 27 Jan 2019 01:37:39 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> Message-ID: <20190127013739.3eb50597@JRWUBU2> On Sun, 27 Jan 2019 00:32:43 +0000 Michael Everson via Unicode wrote: > I?ll be publishing a translation of Alice into Ancient Greek in due > course. I will absolutely only use U+2019 for the apostrophe. It > would be wrong for lots of reasons to use U+02BC for this. Please list them. Will your coding decision be machine readable for the readership? > Moreover, implementations of U+02BC need to be revised. In the > context of Polynesian languages, it is impossible to use U+02BC if it > is _identical_ to U+2019. Readers cannot work out what is what. I > will prepare documentation on this in due course. It looks as though you've found a new character - or a revived distinction. Richard. From unicode at unicode.org Sat Jan 26 19:43:57 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 27 Jan 2019 01:43:57 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> Message-ID: <20190127014357.78efc612@JRWUBU2> On Sat, 26 Jan 2019 17:11:49 -0800 Asmus Freytag via Unicode wrote: > To make matters worse, users for languages that "should" use U+02BC > aren't actually consistent; much data uses U+2019 or U+0027. Ordinary > users can't tell the difference (and spell checkers seem not > successful in enforcing the practice). That appears to contradict Michael Everson's remark about a Polynesian need to distinguish the two visually. Richard. From unicode at unicode.org Sat Jan 26 19:55:29 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 27 Jan 2019 01:55:29 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190127014357.78efc612@JRWUBU2> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> Message-ID: <9280d0ca-88d7-728e-03aa-70d55486274e@gmail.com> Richard Wordingham replied to Asmus Freytag, >> To make matters worse, users for languages that "should" use U+02BC >> aren't actually consistent; much data uses U+2019 or U+0027. Ordinary >> users can't tell the difference (and spell checkers seem not >> successful in enforcing the practice). > > That appears to contradict Michael Everson's remark about a Polynesian > need to distinguish the two visually. Does it? U+02BC /should/ be used but ordinary users can't tell the difference because the glyphs in their displays are identical, resulting in much data which uses U+2019 or U+0027.? I don't see any contradiction. From unicode at unicode.org Sat Jan 26 19:59:27 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 27 Jan 2019 01:59:27 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190127013739.3eb50597@JRWUBU2> References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> Message-ID: <3ccba01a-cabf-8e54-1983-fb12b6bb9ef4@gmail.com> Richard Wordingham responded to Michael Everson, >> I?ll be publishing a translation of Alice into Ancient Greek in due >> course. I will absolutely only use U+2019 for the apostrophe. It >> would be wrong for lots of reasons to use U+02BC for this. > > Please list them. Let's see the list of reasons why U+02BC should be used first. From unicode at unicode.org Sat Jan 26 20:06:57 2019 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sun, 27 Jan 2019 02:06:57 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190127014357.78efc612@JRWUBU2> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> Message-ID: <144A0313-1566-4C3E-862F-C7C313881B65@evertype.com> Polynesians are using 0027 as a fallback, and this has to do with education, keyboarding, and training. The typography of the fallback is of no consequence. It?s a fallback. > On 27 Jan 2019, at 01:43, Richard Wordingham via Unicode wrote: > > On Sat, 26 Jan 2019 17:11:49 -0800 > Asmus Freytag via Unicode wrote: > >> To make matters worse, users for languages that "should" use U+02BC >> aren't actually consistent; much data uses U+2019 or U+0027. Ordinary >> users can't tell the difference (and spell checkers seem not >> successful in enforcing the practice). > > That appears to contradict Michael Everson's remark about a Polynesian > need to distinguish the two visually. > > Richard. From unicode at unicode.org Sat Jan 26 20:25:41 2019 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sun, 27 Jan 2019 02:25:41 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190127013739.3eb50597@JRWUBU2> References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> Message-ID: On 27 Jan 2019, at 01:37, Richard Wordingham via Unicode wrote: > >> I?ll be publishing a translation of Alice into Ancient Greek in due >> course. I will absolutely only use U+2019 for the apostrophe. It >> would be wrong for lots of reasons to use U+02BC for this. > > Please list them. The Greek use is of an apostrophe. Often a mark elision (as here), that?s what 2019 is for. 02BC is a letter. Usually a glottal stop. I didn?t follow the beginning of this. Evidently it has something to do with word selection of d? + a space + what follows. If that?s so, then there?s no argument at all for 02BC. It?s a question of the space, and that?s got nothing to do with the identity of the apostrophe. > Will your coding decision be machine readable for the readership? I don?t know what you mean by ?readable?. >> Moreover, implementations of U+02BC need to be revised. In the >> context of Polynesian languages, it is impossible to use U+02BC if it >> is _identical_ to U+2019. Readers cannot work out what is what. I >> will prepare documentation on this in due course. > > It looks as though you've found a new character - or a revived > distinction. It may not be ?revived?. In origin, linguists took the lead-type 2019 and used it as a consonant letter. Now, in the 21st century, where Harry Potter is translated into Hawaiian, and where Harry Potter has glottals alongside both single and double quotation marks, the 02BC?s need to be bigger or the text can?t be read easily. In our work we found that a vertical height of 140% bigger than the quotation mark improved legibility hugely. Fine typography asks for some other alterations to the glyph, but those are cosmetic. If the recommended glyph for 02BC were to be changed, it would in no case impact adversely on scientific linguistics texts. It would just make the mark a bit bigger. But for practical use in Polynesian languages where the character has to be found alongside the quotation marks, a glyph distinction must be made between this and punctuation. Michael Everson From unicode at unicode.org Sat Jan 26 20:26:23 2019 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sun, 27 Jan 2019 02:26:23 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <3ccba01a-cabf-8e54-1983-fb12b6bb9ef4@gmail.com> References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <3ccba01a-cabf-8e54-1983-fb12b6bb9ef4@gmail.com> Message-ID: <6BD6F978-B123-4F97-9E50-1A8189A36787@evertype.com> Fair enough, but I didn?t wait. > On 27 Jan 2019, at 01:59, James Kass via Unicode wrote: > > > Richard Wordingham responded to Michael Everson, > > >> I?ll be publishing a translation of Alice into Ancient Greek in due > >> course. I will absolutely only use U+2019 for the apostrophe. It > >> would be wrong for lots of reasons to use U+02BC for this. > > > > Please list them. > > Let's see the list of reasons why U+02BC should be used first. > From unicode at unicode.org Sat Jan 26 21:53:06 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 27 Jan 2019 03:53:06 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <9280d0ca-88d7-728e-03aa-70d55486274e@gmail.com> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> <9280d0ca-88d7-728e-03aa-70d55486274e@gmail.com> Message-ID: <20190127035306.27d7a124@JRWUBU2> On Sun, 27 Jan 2019 01:55:29 +0000 James Kass via Unicode wrote: > Richard Wordingham replied to Asmus Freytag, > > >> To make matters worse, users for languages that "should" use > >> U+02BC aren't actually consistent; much data uses U+2019 or > >> U+0027. Ordinary users can't tell the difference (and spell > >> checkers seem not successful in enforcing the practice). > > > > That appears to contradict Michael Everson's remark about a > > Polynesian need to distinguish the two visually. > > Does it? > > U+02BC /should/ be used but ordinary users can't tell the difference > because the glyphs in their displays are identical, resulting in much > data which uses U+2019 or U+0027.? I don't see any contradiction. I had assumed that Polynesians would be writing with paper and ink. It depends on what 'tell the difference' means. In normal parlance it means that they are unaware of the difference in the symbols; you are assuming that it means that printed material doesn't show the difference. In general, handwritten differences can show up in various ways. For example, one can find a slight, unreliable difference in the relative positioning of characters that reflects the difference in the usage of characters. Of course, Asmus's facts have to be unreliable. It's like someone typing U+1142A NEWA LETTER MHA for Sanskrit , which we've been assured would never happen. There must be something wrong with reality. Richard. From unicode at unicode.org Sat Jan 26 23:11:36 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 26 Jan 2019 21:11:36 -0800 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190127014357.78efc612@JRWUBU2> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 26 23:23:04 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 26 Jan 2019 21:23:04 -0800 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 26 23:28:50 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 26 Jan 2019 21:28:50 -0800 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190127035306.27d7a124@JRWUBU2> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> <9280d0ca-88d7-728e-03aa-70d55486274e@gmail.com> <20190127035306.27d7a124@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 27 00:08:31 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 27 Jan 2019 06:08:31 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> Message-ID: <20190127060831.2e96572d@JRWUBU2> On Sat, 26 Jan 2019 21:11:36 -0800 Asmus Freytag via Unicode wrote: > On 1/26/2019 5:43 PM, Richard Wordingham via Unicode wrote: >> That appears to contradict Michael Everson's remark about a >> Polynesian >> need to distinguish the two visually. > Why do you need to distinguish them? To code text correctly (so the > invisible properties are what the software expects) or because a > human reader needs the disambiguation in order to follow the text? > The latter phenomenon is so common throughout many writing systems, > that I have difficulties buying it. It may be a matter of literacy in Hawaiian. If the test readership doesn't use ?okina, it could be confusing to have to resolve the difference between a sentence(?) starting with one from a sentence in single quotes. Otherwise, one does wonder why the issue should only arise now. One other possibility is that single quote punctuation is being used on a readership used to double quote punctuation. Double quotes would avoid the confusion. > PS: I wasn't talking about what the Polynesians do; different part of > the world. Why should the Polynesians be different? Richard. From unicode at unicode.org Sun Jan 27 00:19:50 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 26 Jan 2019 22:19:50 -0800 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190127060831.2e96572d@JRWUBU2> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> <20190127060831.2e96572d@JRWUBU2> Message-ID: <973744ae-797d-18da-c5b3-1fd3ccd229cf@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 27 02:02:07 2019 From: unicode at unicode.org (Andrew Cunningham via Unicode) Date: Sun, 27 Jan 2019 19:02:07 +1100 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <973744ae-797d-18da-c5b3-1fd3ccd229cf@ix.netcom.com> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> <20190127060831.2e96572d@JRWUBU2> <973744ae-797d-18da-c5b3-1fd3ccd229cf@ix.netcom.com> Message-ID: On Sunday, 27 January 2019, Asmus Freytag via Unicode wrote: > > Choice of quotation marks is language-based and for novels, many times > there are > additional conventions that may differ by publisher. > > Wonder why the publisher is forcing single quotes on them > In theory quotation marks are language based but many languages have had the puntuation and typographic conventions of colonial languages imposed, even when it isn't the best choice. And publishers are following established patterns. The publishers that care about the language do try to distinguish or refine these characters typographically. Andrew -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 27 09:08:19 2019 From: unicode at unicode.org (Tom Gewecke via Unicode) Date: Sun, 27 Jan 2019 08:08:19 -0700 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190127060831.2e96572d@JRWUBU2> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> <20190127060831.2e96572d@JRWUBU2> Message-ID: <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org> > On Jan 26, 2019, at 11:08 PM, Richard Wordingham via Unicode wrote: > > It may be a matter of literacy in Hawaiian. If the test readership > doesn't use ?okina, I think the Unicode Hawaiian ?okina is supposed to be U+02BB (instead of U+02BC). From unicode at unicode.org Sun Jan 27 09:37:37 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 27 Jan 2019 15:37:37 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> <20190127060831.2e96572d@JRWUBU2> <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org> Message-ID: <745b8264-7e20-6a73-199a-6de4e9cab66a@gmail.com> On 2019-01-27 3:08 PM, Tom Gewecke via Unicode wrote: > I think the Unicode Hawaiian ?okina is supposed to be U+02BB (instead > of U+02BC). notes for U+02BB * typographical alternate for 02BD or 02BF * used in Hawai'ian orthorgraphy as 'okina (glottal stop) From unicode at unicode.org Sun Jan 27 10:08:20 2019 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sun, 27 Jan 2019 16:08:20 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190127052149.1baaf1b2@JRWUBU2> References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> Message-ID: On 27 Jan 2019, at 05:21, Richard Wordingham wrote: >>> I?ll be publishing a translation of Alice into Ancient Greek in due >>>> course. I will absolutely only use U+2019 for the apostrophe. It >>>> would be wrong for lots of reasons to use U+02BC for this. >>> >>> Please list them. >> >> The Greek use is of an apostrophe. Often a mark elision (as here), >> that?s what 2019 is for. >> >> 02BC is a letter. Usually a glottal stop. > > So it would seem that the 'lots of reasons' is just that it goes against the *recommendation* of TUS. I have no idea what TUS says about this. I did not look it up. I know a lot about characters, though. > Incidentally, I believe the principal use of U+2019 RIGHT SINGLE QUOTATION MARK is as a quotation mark. You can believe what you like, but that isn?t likely true. In books which prefer ?this kind? of quotation marks for primary quotations and ?this kind? for nested quotations, 2019 is primarily used for the apostrophe in words like I?m, can?t, isn?t, don?t etc. In books which prefer ?this kind? for primary quotations 2019 the statistics will be different. But 2019 is still the correct character for both. > As you have noted in the text left in below, U+02BC started out as the apostrophe. Lead-type typesetters used that sort, yes. And that sort was used for both apostrophe and single quotation marks. > The closing single inverted comma has a different origin to the apostrophe. No, it doesn?t, but you are welcome to try to prove your assertion. > My argument for U+02BC is that this apostrophe is an integral part of the word. It is a letter. In ?can?t? the apostrophe isn?t a letter. It?s a mark of elision. I can double-click on the three words in this paragraph which have the apostrophe in them, and they are all whole-word selected. > The main constituent of a prototypical word are letters and their attendant marks. Now, the word-breaking algorithm in TR27 allows for various generally overloaded elements to join elements of a word. However, this apostrophe does not mark the boundary of constituents. Accordingly it makes sense to treat it as a letter. The behaviour of 2019 it not broken. I use it every day. I?ve typeset many many books in English and Cornish and Irish, all of which use single quotation marks and double quotation marks and lots and lots of apostrophes, and I have no trouble with them. 2019 has for decades been treated correctly in software that I use. > Treating the Greek apostrophe as a letter (U+02BC) gives better word-breaking. Why do you claim this? I did not read the beginning of this thread and I am not going to try to find it. What is the problem you claim to have? In what software? On what platform? > I don't see any downside in treating it like a Polynesian glottal stop. I do. And to try to replace the apostrophe in English can?t and don?t and all is doomed to fail. Doomed. Moreover there are good practical reasons to change the glyph for the Polynesian letter. When I typeset Greek, I will use 2019 for the apostrophe. > Is someone going to tell me there is an advantage in treating "men's? as one word but "dogs'" as two? As I've said, the argument for encoding English apostrophes as U+2019 is that even with adequate keyboards, users cannot be relied upon to distinguish U+02BC and U+2019 - especially with no feedback. A writing system should choose one and stick with it. User unreliability forces a compromise. Polynesian users need to 02BC to be visually distinguished from 2019. European users don?t need the apostrophe to be visually distinguished from 2019. The edge case of ?dogs?? doesn?t convince me. In all my years of typesetting I have never once noticed this, much less considered it a problem that needed fixing. > Now, if text processors were to enable a difference, then the arguments would change. I for one find it helpful that Microsoft Word is willing to display visible symbols for spaces and tab characters so that I know what white space is composed of. Most word-processing typesetting programs will do this. Quark and InDesign do. Word and LibreOffice and Apple Pages do. >> I didn?t follow the beginning of this. Evidently it has something to do with word selection of d? + a space + what follows. If that?s so, then there?s no argument at all for 02BC. It?s a question of the space, and that?s got nothing to do with the identity of the apostrophe. > > The word selection issue is that except before a letter, the standard word-breaking algorithm says that there is a word boundary between the delta and apostrophe. Well, that?s the expected behaviour for a character which is polyvalent. If you have problems double-clicking ?d? Artagnan? you should probably just write ?d?Artagnan?. > >>> Will your coding decision be machine readable for the readership? >> >> I don?t know what you mean by ?readable?. > > Will the difference between U+02BC and U+2019 be discernible by the readers? They should be, in Polynesian languages. Otherwise the text isn't easily legible. > If one could copy a phrase to a general application and select a word by double-clicking, then the difference would be visible. If you know what the behaviour is then you can take it into account when you are copying a word. You can?t fix this by character encoding. Certainly not by screwing with 02BC. > If the result of the publishing is simply a printed book, then your choice of U+2019 or U+02BC will depend only on font differences. That non-argument can be applied to everything. > Not that it makes much difference to the issue, but isn't the correct encoding for the ?okina U+02BB MODIFIER LETTER TURNED COMMA? Yes, but both 02BB and 02BC are used in linguistic transcriptions and in Polynesian languages, and the graphic identity with 2018 and 2019 is problematic and unnecessary. Using 02BC for the apostrophe is a mistake, in my view. Michael Everson From unicode at unicode.org Sun Jan 27 10:11:12 2019 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sun, 27 Jan 2019 16:11:12 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> <20190127060831.2e96572d@JRWUBU2> <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org> Message-ID: <95DB9A94-B1E8-4CEA-929D-9AB3018CCF93@evertype.com> Yes, yes. It doesn?t matter. The discussion applies to both the two quotation marks and the two modifier letters. > On 27 Jan 2019, at 15:08, Tom Gewecke via Unicode wrote: > > >> On Jan 26, 2019, at 11:08 PM, Richard Wordingham via Unicode wrote: >> >> It may be a matter of literacy in Hawaiian. If the test readership >> doesn't use ?okina, > > I think the Unicode Hawaiian ?okina is supposed to be U+02BB (instead of U+02BC). > From unicode at unicode.org Sun Jan 27 11:32:42 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sun, 27 Jan 2019 12:32:42 -0500 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> Message-ID: <2214f4d5-e440-69f0-18c2-be06aa064171@kli.org> Well, sure; some languages work better with some fonts.? There's nothing wrong with saying that 02BC might look the same as 2019... but it's nice, when writing Hawaiian (or Klingon for that matter) to use a bigger glyph. That's why they pay typesetters the big bucks (you wish): to make things look good on the page. I recall in early Volap?k, ? was a letter (presumably 02BC), with value /h/.? And the "capital" ? was the same, except bolder: see https://archive.org/details/cu31924027111453/page/n11 (entry 4, on the left-hand page). ~mark On 1/27/19 12:23 AM, Asmus Freytag via Unicode wrote: > On 1/26/2019 6:25 PM, Michael Everson via Unicode wrote: > the 02BC?s need to be bigger or the text can?t be read easily. In our > work we found that a vertical height of 140% bigger than the quotation > mark improved legibility hugely. Fine typography asks for some other > alterations to the glyph, but those are cosmetic. >> If the recommended glyph for 02BC were to be changed, it would in no case impact adversely on scientific linguistics texts. It would just make the mark a bit bigger. But for practical use in Polynesian languages where the character has to be found alongside the quotation marks, a glyph distinction must be made between this and punctuation. > > It somehow seems to me that an evolution of the glyph shape of 02BC in > a direction of increased distinction from U+2019 is something that > Unicode has indeed made possible by a separate encoding. However, that > evolution is a matter of ALL the language communities that use U+02BC > as part of their orthography, and definitely NOT something were > Unicode can be permitted to take a lead. Unicode does not *recommend* > glyphs for letters. > > However, as a publisher, you are of course free to experiment and to > see whether your style becomes popular. > > There is a concern though, that your choice may appeal only to some > languages that use this code point and not become universally accepted. > > A./ > > From unicode at unicode.org Sun Jan 27 11:38:39 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sun, 27 Jan 2019 12:38:39 -0500 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> Message-ID: <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org> On 1/27/19 11:08 AM, Michael Everson via Unicode wrote: > It is a letter. In ?can?t? the apostrophe isn?t a letter. It?s a mark of elision. I can double-click on the three words in this paragraph which have the apostrophe in them, and they are all whole-word selected. That doesn't work when I try it: I double-click on the "a" in "can?t" and get only the "can" selected. This does not necessarily prove anything; my software (Thunderbird) is arguably doing it wrong. ~mark From unicode at unicode.org Sun Jan 27 12:19:28 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 27 Jan 2019 18:19:28 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org> References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org> Message-ID: <20190127181928.2d5225a4@JRWUBU2> On Sun, 27 Jan 2019 12:38:39 -0500 "Mark E. Shoulson via Unicode" wrote: > On 1/27/19 11:08 AM, Michael Everson via Unicode wrote: > > It is a letter. In ?can?t? the apostrophe isn?t a letter. It?s a > > mark of elision. I can double-click on the three words in this > > paragraph which have the apostrophe in them, and they are all > > whole-word selected. > > That doesn't work when I try it: I double-click on the "a" in "can?t" > and get only the "can" selected. > > This does not necessarily prove anything; my software (Thunderbird) > is arguably doing it wrong. Except the Uniocde-compliant processes aren't required to follow the scheme of TR27 Unicode Text Segmentation. However, it is only required to select the whole word because the U+2019 is followed by a letter. TR27 prescribes different behaviour for "dogs'" with U+2019 (interpret as two 'words') and U+02BC (interpret as one word). The GTK-based email client I'm using has that difference, but also fails with "don't" unless one uses U+02BC. However LibreOffice treats "don't" as a single word for U+0027, U+02BC and U+2019, but "dogs'" as a single word only for U+02BC. This complies with TR27. I'm not surprised, as LibreOffice does use or has used ICU. Richard. From unicode at unicode.org Sun Jan 27 12:19:52 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 27 Jan 2019 18:19:52 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <95DB9A94-B1E8-4CEA-929D-9AB3018CCF93@evertype.com> References: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com> <20190127014357.78efc612@JRWUBU2> <20190127060831.2e96572d@JRWUBU2> <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org> <95DB9A94-B1E8-4CEA-929D-9AB3018CCF93@evertype.com> Message-ID: <20190127181952.6cd1ba46@JRWUBU2> On Sun, 27 Jan 2019 16:11:12 +0000 Michael Everson via Unicode wrote: > Yes, yes. It doesn?t matter. The discussion applies to both the two > quotation marks and the two modifier letters. Actually, there is a difference. As the ?okina doesn?t occur at the end of a word in Hawaiian, one only strictly needs a contrast at the beginning of a word - unless Hawaiian makes significant use of the apostrophe for abbreviation. Unfortunately, U+02BB is worse than U+02BC from this perspective. Richard. From unicode at unicode.org Sun Jan 27 13:09:31 2019 From: unicode at unicode.org (James Tauber via Unicode) Date: Sun, 27 Jan 2019 14:09:31 -0500 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190127181928.2d5225a4@JRWUBU2> References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org> <20190127181928.2d5225a4@JRWUBU2> Message-ID: On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > Except the Uniocde-compliant processes aren't required to follow the > scheme of TR27 Unicode Text Segmentation. However, it is only required > to select the whole word because the U+2019 is followed by a letter. > TR27 prescribes different behaviour for "dogs'" with U+2019 (interpret > as two 'words') and U+02BC (interpret as one word). The GTK-based > email client I'm using has that difference, but also fails with > "don't" unless one uses U+02BC. > > However LibreOffice treats "don't" as a single word for U+0027, U+02BC > and U+2019, but "dogs'" as a single word only for U+02BC. This > complies with TR27. I'm not surprised, as LibreOffice does use or has > used ICU. > This comes back to my original question that started this thread. Many people creating Ancient Greek digital resources use U+02BC seemingly because of incorrect word-breaking with *word-final* U+2019 (which is the only time it occurs in Ancient Greek and always marking elision, never as the end of a quotation). I am trying to write guidelines as to why they should use U+2019. I'm convinced it's technically the right code point to use but am wanting to get my facts straight about how to address the word-breaking issue (specifically for word-final U+2019 in Ancient Greek, to be clear). In my original post, I asked if a language-specific tailoring of the text segmentation algorithm was the solution but no one here has agreed so far. Here's a concrete example from Smyth's Grammar: ??????? ?? Double-clicking on the first word should select the U+2019 as well. Interestingly on macOS Mojave it does in Pages[1] but not in Notes, the Terminal or here in Gmail on Chrome. To be clear: when I say "should" I mean that that is the expectation classicists have and the failure to meet it is why some of them insist on using U+02BC. I'm happy if the answer is "use U+2019 and go get your text segmentation implementations fixed"[2] but am looking for confirmation of that. James [1] To be honest, I was impressed Pages got it right. [2] In the same spirit as "if certain combining character combinations don't work, the solution is not to add precomposed characters, it's to improve the fonts" or "tonos and oxia are the same and if they look different, it's the fault of your font". -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 27 13:57:37 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 27 Jan 2019 19:57:37 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org> <20190127181928.2d5225a4@JRWUBU2> Message-ID: On 2019-01-27 7:09 PM, James Tauber via Unicode wrote: > In my original post, I asked if a language-specific tailoring of the > text segmentation algorithm was the solution but no one here has > agreed so far. If there are likely to be many languages requiring exceptions to the segmentation algorithm wrt U+2019, then perhaps it would be better to establish conventions using ZWJ/ZWNJ and adjust the algorithm accordingly so that it would be cross-languages.? (Rather than requiring additional and open ended language-specific tailorings.) (I inserted several combinations of ZWJ/ZWNJ into James Tauber's example, but couldn't improve the segmentation in LibreOffice, although it was possible to make it worse.) From unicode at unicode.org Sun Jan 27 14:03:00 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 27 Jan 2019 20:03:00 +0000 Subject: Encoding italic In-Reply-To: <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> Message-ID: <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> A new beta of BabelPad has been released which enables input, storing, and display of italics, bold, strikethrough, and underline in plain-text using the tag characters method described earlier in this thread.? This enhancement is described in the release notes linked on this download page: http://www.babelstone.co.uk/Software/index.html From unicode at unicode.org Sun Jan 27 14:17:45 2019 From: unicode at unicode.org (Tom Gewecke via Unicode) Date: Sun, 27 Jan 2019 13:17:45 -0700 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org> <20190127181928.2d5225a4@JRWUBU2> Message-ID: <42122A2E-0EA8-40CB-98D1-582CB4D54CFA@bluesky.org> > On Jan 27, 2019, at 12:09 PM, James Tauber via Unicode wrote: > > ??????? ?? > > Double-clicking on the first word should select the U+2019 as well. Interestingly on macOS Mojave it does in Pages[1] but not in Notes On my ipad/iphone, Word does it correctly but Pages and Notes do not. From unicode at unicode.org Sun Jan 27 15:00:40 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Sun, 27 Jan 2019 21:00:40 +0000 (GMT) Subject: Ancient Greek apostrophe marking elision References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> Message-ID: On 2019-01-27, Michael Everson via Unicode wrote: > On 27 Jan 2019, at 05:21, Richard Wordingham wrote: >> The closing single inverted comma has a different origin to the apostrophe. > No, it doesn?t, but you are welcome to try to prove your assertion. As far as I can tell from the easily accessible literature, the apostrophe derives from an in-line manuscript mark that is a point with a tail, while the quotation marks derive from a marginal mark shaped like an arrowhead (like modern guillemets). What is your story about them? >> Is someone going to tell me there is an advantage in treating "men's? as one word but "dogs'" as two? As I've said, the argument for encoding English apostrophes as U+2019 is that even with adequate keyboards, users cannot be relied upon to distinguish U+02BC and U+2019 - especially with no feedback. A writing system should choose one and stick with it. User unreliability forces a compromise. > > Polynesian users need to 02BC to be visually distinguished from 2019. European users don?t need the apostrophe to be visually distinguished from 2019. The edge case of ?dogs?? doesn?t convince me. In all my years of typesetting I have never once noticed this, much less considered it a problem that needed fixing. You have a very low opinion of Polynesian users. People (as opposed to computers) use context to remove ambiguity. Before we had to interact with pedantic computers, we were rarely confused by the typewriter-induced confusion of 1 and l and 0 and O (or, indeed, the use of symmetrical quotation marks). Now a sensible orthographic choice for a language using comma-like letters would be to use guillemets for quotation, and while I don't know (there being precious few modern Polynesian materials online), I would guess that the languages of French Polynesia do that. If, like Hawaiian, you're stuck with English-style quotation marks for historical reasons, an obvious typographic solution is to thin-space them, French-style. (See previous thread!). That seems visually preferable to relying on a small difference in size of what is already a small letter compared to everything else on the page. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Sun Jan 27 15:30:55 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 27 Jan 2019 22:30:55 +0100 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <2214f4d5-e440-69f0-18c2-be06aa064171@kli.org> References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <2214f4d5-e440-69f0-18c2-be06aa064171@kli.org> Message-ID: For Volap?k, it looks much more like U+02BE (right half ring modifier letter) than like U+02BC (apostrophe "modifier" letter). according to the PDF on https://archive.org/details/cu31924027111453/page/n12 The half ring makes a clear distinction with the regular apostrophe (for elisions) or quotation marks. It is used really in this context as a modifier after another consonnants for borrowing words *phonetically* from other languages, notably after 'c' and 'l'. Then U+02BD (left half ring "modifier" letter) is a regular letter (for translitterating the expirated 'h' from English). But I'm currious about the diacritic used above 'h' on item (5) ("ta") of that page to transliteratiung the English soft "th". But this was describing the "Labas" orthography. On the next chapter ("Noms Tonabas"), another convention is used for the aopostrophe like letters, and U+02BE (right half ring modifier letter) is used instead of U+02BD for the expirated 'h' (see paragraph 18), but it is said to use the "Greek mark" (not sure if the author meant the coronis U+01FBD or the soft spirit U+01FBF). So it looks like these were various early adaptations of the basic Volap?k orthography to borrow foreign names (notable proper names for people, trademarks, toponyms and other place names), and these were part of several competing proposals. I'm curious to know if there was finally a wide enough consensus to standardize these. So It seems that for Volap?k all the apostrophe-like letters are not formally assigned, authors will use anyone as they want when they transliterate foreign words, or will simply avoid transliterating them completely if they exist natively in a Latin form (I bet English is not transliterated at all, and French or German accents are preserved as is if they are already part of the basic alphabet and the only standard diacritic is then the "diaeresis", as used in the German umlaut (Volap?k does not need any true diaeresis to avoid the formation of diphtongs and digrams, all its orthography use a single base letter as a foundation principle. If so, the 1st convention using the apostrophe-like modifier to create digrams is probably not favored and ther Tonabas convention is proably more convenient and more compliant t othe principles. I don't think they will ever use directly the greek signs or letters (like the one used for transliterating the English 'ng' and would prefer using now the Latin Eng letter. The right half-ring being rarely supported is now most probably supported using U+02BC (for both letter cases, ignoring the bolder style for the capital variant) which uses a curved comma shape (with a filled bowl at top). If there are case distinction, the same glyph would be used but at different height instead of using bold distinctions, or dictinction would be made using the alternate forms of the comma (probably the wedge for lowercase, and the bowl with curl for capitals). Note: Are the different shapes of the comma (and similar apostrophe-like letters, or even the semicolon) distinguished with encoded variant selectors ? Le dim. 27 janv. 2019 ? 18:42, Mark E. Shoulson via Unicode < unicode at unicode.org> a ?crit : > Well, sure; some languages work better with some fonts. There's nothing > wrong with saying that 02BC might look the same as 2019... but it's > nice, when writing Hawaiian (or Klingon for that matter) to use a bigger > glyph. That's why they pay typesetters the big bucks (you wish): to make > things look good on the page. > > I recall in early Volap?k, ? was a letter (presumably 02BC), with value > /h/. And the "capital" ? was the same, except bolder: see > https://archive.org/details/cu31924027111453/page/n11 (entry 4, on the > left-hand page). > > ~mark > > On 1/27/19 12:23 AM, Asmus Freytag via Unicode wrote: > > On 1/26/2019 6:25 PM, Michael Everson via Unicode wrote: > > the 02BC?s need to be bigger or the text can?t be read easily. In our > > work we found that a vertical height of 140% bigger than the quotation > > mark improved legibility hugely. Fine typography asks for some other > > alterations to the glyph, but those are cosmetic. > >> If the recommended glyph for 02BC were to be changed, it would in no > case impact adversely on scientific linguistics texts. It would just make > the mark a bit bigger. But for practical use in Polynesian languages where > the character has to be found alongside the quotation marks, a glyph > distinction must be made between this and punctuation. > > > > It somehow seems to me that an evolution of the glyph shape of 02BC in > > a direction of increased distinction from U+2019 is something that > > Unicode has indeed made possible by a separate encoding. However, that > > evolution is a matter of ALL the language communities that use U+02BC > > as part of their orthography, and definitely NOT something were > > Unicode can be permitted to take a lead. Unicode does not *recommend* > > glyphs for letters. > > > > However, as a publisher, you are of course free to experiment and to > > see whether your style becomes popular. > > > > There is a concern though, that your choice may appeal only to some > > languages that use this code point and not become universally accepted. > > > > A./ > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 27 17:21:31 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 27 Jan 2019 23:21:31 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org> <20190127181928.2d5225a4@JRWUBU2> Message-ID: <20190127232131.077a4448@JRWUBU2> On Sun, 27 Jan 2019 14:09:31 -0500 James Tauber via Unicode wrote: > On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > However LibreOffice treats "don't" as a single word for U+0027, > > U+02BC and U+2019, but "dogs'" as a single word only for U+02BC. > > This complies with TR27. I'm not surprised, as LibreOffice does > > use or has used ICU. > This comes back to my original question that started this thread. Yes. I'm driving home the problem for those who somehow fail to understand your opening post. > Here's a concrete example from Smyth's Grammar: > > ??????? ?? > > Double-clicking on the first word should select the U+2019 as well. > Interestingly on macOS Mojave it does in Pages[1] but not in Notes, > the Terminal or here in Gmail on Chrome. > > To be clear: when I say "should" I mean that that is the expectation > classicists have and the failure to meet it is why some of them > insist on using U+02BC. > > I'm happy if the answer is "use U+2019 and go get your text > segmentation implementations fixed"[2] but am looking for > confirmation of that. The problem with that approach is that it assumes one can have a language-sensitive implementation, and that that will suffice. Smyth?s grammar gives the concrete example, ???????? ???. It contains the word ????. Should double-clicking the first Greek word in the paragraph above select it? That's not going to work if the paragraph above is considered to be in English. And what about double clicking the third Greek word? What should that select? Or is that paragraph ungrammatical? To fix the problem with possessive plural "dogs?" with U+2019 one has to parse enough of the paragraph to distinguish an apostrophe from a closing single inverted comma. Moreover, it assumes that end-of-word apostrophes will not be included in a span bounded by single inverted commas. I may observe such a rule, but I don't remember being taught it. In Unicode 2.0 the apostrophe was U+02BC; it was changed to U+2019 in Unicode 2.1. The justification I could find given for the change is in the Unicore thread (members only) starting at https://www.unicode.org/mail-arch/unicore-ml/y1997-A/0185.html . The justification recorded there was merely that: 1) Windows and Mac Latin character sets had equivalents of U+0027, to which the 'letter apostrophe' was mapped, and U+2019, which was used for single quotes. 2) The 'punctuation apostrophe' was being mapped to the U+2019 by the 'smart quote' apparatus. 3) For consistency, the 'punctuation apostrophe' should therefore be encoded by U+2019 instead of U+02BC. This argument didn't persuade everyone even then, and it feels even weaker now. Perhaps I just have the problem that I don't see a sharp difference between the letter apostrophe and the punctuation apostrophe. For example, when the pronunciation of English "letter" with a glottal stop as the intervocalic consonant is represented in writing as something like "le'er", is it a letter apostrophe because it's a glottal stop, or a punctuation apostrophe because the 'tt' is dropped? The issue arises in the orthography of Finnish. The genitive singular of _keko_ 'a pile' is _keon_ - the 'k' is 'dropped' because of consonant gradation. However, regularly, the genitive singular of _raaka_ 'raw' is _raa'an_, where the U+0027 I wrote represent an apostrophe and is pronounced as a glottal stop. Is this a letter apostrophe or a punctuation apostrophe? The 'k' has been dropped by the same rule, but because of the vowel pattern it is replaced by a glottal stop and written with an apostrophe. English Wiktionary chooses U+2019: the Finnish Wiktionary ducks the issue and uses U+0027. Richard. From unicode at unicode.org Sun Jan 27 17:38:40 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 27 Jan 2019 23:38:40 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org> <20190127181928.2d5225a4@JRWUBU2> Message-ID: <20190127233840.72bd25cb@JRWUBU2> On Sun, 27 Jan 2019 19:57:37 +0000 James Kass via Unicode wrote: > On 2019-01-27 7:09 PM, James Tauber via Unicode wrote: > > In my original post, I asked if a language-specific tailoring of > > the text segmentation algorithm was the solution but no one here > > has agreed so far. > If there are likely to be many languages requiring exceptions to the > segmentation algorithm wrt U+2019, then perhaps it would be better to > establish conventions using ZWJ/ZWNJ and adjust the algorithm > accordingly so that it would be cross-languages.? (Rather than > requiring additional and open ended language-specific tailorings.) (I > inserted several combinations of ZWJ/ZWNJ into James Tauber's > example, but couldn't improve the segmentation in LibreOffice, > although it was possible to make it worse.) If you look at TR29, you will see that ZWJ should only affect word boundaries for emoji. ZWNJ shall have no effect. What you want is a control that joins words, but we don't have that. Richard. From unicode at unicode.org Sun Jan 27 17:44:18 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 28 Jan 2019 00:44:18 +0100 Subject: Encoding italic In-Reply-To: <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> Message-ID: You're not very explicit about the Tag encoding you use for these styles. Of course it must not be a language tag so the introducer is not U+E0001, or a cancel-all tag so it is not prefixed by U+E007F It cannot also use letter-like, digit-like and hyphen-like tag characters for its introduction. So probably you use some prefix in U+E0002..U+E001F and some additional tag (tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S" for strikethough?) and the cancel tag to return to normal text (terminate the tagged sequence). Or may be you just use standard HTML encoding by adding U+E0000 to each character of the HTML tag syntax (including attributes and close tags, allowing embedding?) So you use the "<" and ">" tag characters (possibly also the space tag U+E0020, or TAB tag U+E0009 for separating attributes and the quotation tags for attribute values)? Is your proposal also allowing the embedding of other HTML objects (such as SVG)? In that case what you do is only to remap the HTML syntax outside the standard text. If an attribute values contains standard text (such as ...) do you also remap the attribute value, i.e. "Some text"? Do you remap the technical name of the HTML tag itself i.e. "span" in the last example? And what is then the interest compared to standard HTML (it is not more compact, and just adds another layer on top of it), except allowing to embed it in places where plain HTML would be restricted by form inputs or would be reconverted using character entities hiding the effect of "<", ">" and "&" in HTML so they are not reinterpreted as HTML but as plain-text characters? Now let's suppose that your convention starts being decoded and used in some applications, this could be used to transport sensitive active scripts (e.g. Javascript event handlers or plain