From markus.icu at gmail.com Wed Jan 4 11:11:10 2023 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 4 Jan 2023 09:11:10 -0800 Subject: article: How gender-neutral emojis found their way onto the web Message-ID: original German: https://www.heise.de/hintergrund/Wie-geschlechtsneutrale-Emojis-ihren-Weg-ins-Netz-fanden-7364527.html English via Google Translate: https://www-heise-de.translate.goog/hintergrund/Wie-geschlechtsneutrale-Emojis-ihren-Weg-ins-Netz-fanden-7364527.html?_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Wed Jan 4 18:53:40 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 5 Jan 2023 01:53:40 +0100 Subject: =?utf-8?B?4oCccGxhaW4gdGV4dCBzdHlsaW5n4oCd4oCm?= Message-ID: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> More or less regularly there are (informal) requests on this list for encoding (new) control codes or control code sequences for text styling (like bold, italics, text colour, ?) also for ?plain text?. This instead of using such things RTF, SGML, HTML, ODF, etc. In the latter, the style (and other) controls are given as strings of printable characters (like , ), not involving control characters. An important aspect of the (informal) proposals that pop up is that the formatting encoding is more lightweight, but also less powerful, than HTML/CSS, RTF, ODF, etc. Not surprisingly, this is not a new idea. It has popped up several times during (computer) history, going back at least 50 years, probably more. The advantage this approach has is that by using a separate class of characters, no substring of printable characters (including SP, HT), no substring of printable characters can be confused with controls for text styling. As I've mentioned long before, there is no need to reinvent that approach (unless you really, really want to...). There are two of them that are still ?alive?, one of which is given by a standard referenced by the Unicode/10646 standard(s). One is Teletext (yes, it is still alive, though very much on the decline), now based on the ETSI EN 300 706 standard and is available also over DVB. Teletext is not a standard referenced by Unicode. The other, referenced by Unicode, is ECMA-48 (a.k.a. ISO/IEC 6429, but I can never memorise that number). Further, the text in Teletext is embedded in a rather complex out of line (i.e., outside of the text), but not very powerful, protocol. Even line breaks are in the out of line protocol, not inline, i.e., there are no CR, LF or other line break characters (and there is in reality actually no ESC character either). Though some of the text styling is inline, some is not (such as additional colours, and requesting a proportional font) and are instead given out of line in the protocol. (Nit: Teletext relies very much on code page switching, but **none** of that switching is done by escape sequences; there is not even any (real) ESC character, let alone any escape sequences.) ECMA-48, however, is fully inline with the text. It is mostly still alive via terminal emulators, and for terminal emulators ECMA-48 will continue to be used for foreseeable time. But the text styling part of ECMA-48 (plus proposed extensions/updates) can very well also be used as a text file storage format, allowing styled text documents to represent the styling via ECMA-48 control sequences. That it is ?old? does not matter. Other parts of ECMA-48 concern things like cursor movement, window scrolling, and even (terminal emulator window) erase controls, and those "non-styling" controls are not suitable for styled text file storage. Unfortunately, the ECMA-48 text does not make such clear distinction; it has to be inferred. You may think that ECMA-48 is old-fashioned. And, yes, it was produced quite some time ago. But so was SGML, the origin of HTML and XML. The styling commands in ECMA-48 are a bit cryptic, but they are also comparatively compact, especially in comparison to HTML/CSS. ECMA-48 styling controls maybe was not in origin intended as a storage format (but nothing in ECMA-48 prevents that), but as an output format (cmp. the ?man? command in Unix/Linux, where the input/storage is an nroff typesetting command file, and the output uses ECMA-48 styling). But now, with text editors where style is edited by (selecting a substring and) using a menu selection or a keyboard shortcut, there is no technical reason why ECMA-48 styling cannot be used as a styled text file storage format. It is, however, a while ago since the last update to ECMA-48, and that shows. I?ve compiled a proposed update for the text styling part of ECMA-48: https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2022.pdf As mentioned, using control codes for encoding styling is not an idea unique to ECMA-48. Two others are ISCII and Teletext. They do have some peculiarities in their formatting controls. However, by doing certain additions to the formatting controls in ECMA-48 these peculiarities can be covered. Conversions from ISCII and Teletext are hinted in tables in the paper referenced above. A few more hinted mappings (for PETSCII, ATASCII, EBCDIC, and various ISO registered sets of control codes) are given in https://www.unicode.org/L2/L2022/22013r-c0-c1-stability.pdf . So ECMA-48 (esp. with updates as in the above reference) is a possible (relatively) light-weight text formatting alternative, in between the ?pure plain text? and ?powerful text formatting? (like MS Word, ODF, HTML/CSS). And, it fits in contexts that are otherwise ?plain text?. ECMA-48 (with some additions) also enables proper conversion of older control code sets, some of which include text styling, to modern character coding sets, without loosing or mistreating the ?old? control codes. /Kent Karlsson PS Using ECMA-48 styling is fully compatible with the math expression representation (C1 alternative), which is a completely separate proposal, that I sent an email about last month. PPS ECMA-48 is a bit like Unicode/10646, in that it is a bit of a smorgasbord. Implementors may support the parts they decide to support. Implementations need not support everything specified. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Wed Jan 4 19:13:35 2023 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 4 Jan 2023 20:13:35 -0500 Subject: =?UTF-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> Message-ID: <8458214e-170b-7441-e6c7-737d4d68c565@shoulson.com> Actually not necessarily a bad idea, at least at first browsing, but it's kind of out of scope for Unicode, isn't it?? It sounds like an update to ECMA-48 (which isn't part of Unicode), and they're the people you'd have to convince. ~mark On 1/4/23 19:53, Kent Karlsson via Unicode wrote: > > .... > It is, however, a while ago since the last update to ECMA-48, and that > shows. I?ve compiled a proposed update for the text styling part of > ECMA-48: > https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2022.pdf > > .... -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Wed Jan 4 19:23:47 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 5 Jan 2023 02:23:47 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <8458214e-170b-7441-e6c7-737d4d68c565@shoulson.com> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <8458214e-170b-7441-e6c7-737d4d68c565@shoulson.com> Message-ID: <93C72EA1-D44E-4826-8B3E-A45E8B17F2B1@bahnhof.se> Well, yes... But the problem is that, IIUC, the ECMA-48 committee is currently the empty set of people? /K > 5 jan. 2023 kl. 02:13 skrev Mark E. Shoulson via Unicode : > > Actually not necessarily a bad idea, at least at first browsing, but it's kind of out of scope for Unicode, isn't it? It sounds like an update to ECMA-48 (which isn't part of Unicode), and they're the people you'd have to convince. > > ~mark > > On 1/4/23 19:53, Kent Karlsson via Unicode wrote: >> .... >> It is, however, a while ago since the last update to ECMA-48, and that shows. I?ve compiled a proposed update for the text styling part of ECMA-48: https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2022.pdf .... -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Jan 4 20:06:52 2023 From: doug at ewellic.org (Doug Ewell) Date: Thu, 5 Jan 2023 02:06:52 +0000 Subject: =?utf-8?B?UkU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <8458214e-170b-7441-e6c7-737d4d68c565@shoulson.com> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <8458214e-170b-7441-e6c7-737d4d68c565@shoulson.com> Message-ID: Mark E. Shoulson replied to Kent Karlsson: >> It is, however, a while ago since the last update to ECMA-48, and >> that shows. I?ve compiled a proposed update for the text styling part >> of ECMA-48: >> https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2022.pdf > > Actually not necessarily a bad idea, at least at first browsing, but > it's kind of out of scope for Unicode, isn't it? It sounds like an > update to ECMA-48 (which isn't part of Unicode), and they're the > people you'd have to convince. Actually, Kent's document does include updates and clarifications that are specific to Unicode. So there is certainly something for readers of this list. I agree with Kent's overall assessment that ECMA-48 is the way to go for styling attributes in an environment that strives to remain "plain text," and is far superior, for many reasons, to any proposal to create a completely new mechanism to achieve the same goal. I have only had time to skim this latest 50-page update, but I would make the same suggestions that I have made before, plus a few others: 1. Clarifications to existing specifications and usage are fine. 2. Completely new inventions, even if they are in the spirit of ECMA-48, should be proposed in separate sections and handled with care. The argument that ECMA-48 is a time-tested standard, widely implemented, loses force in proportion to the amount of emphasis placed on unilaterally creating new stuff. 3. Deprecated items, items newly noted as "one should try to avoid," and other new restrictions on existing sequences or existing implementations should be proposed in separate sections, and handled with EXTREME care. Restricting platforms, for example, from implementing "bold" with zero color change, or from implementing "italic" or "oblique" at an angle outside the range 8??12?, or attempting to forbid certain characters beyond what Unicode recommends, introduces a strong risk that the proposed new standard may be ignored. Think of the concessions that had to be made for Unicode itself to be adopted. 4. Tables that compare existing and proposed ECMA-48 mechanisms, and call attention to the changes, need to be included. 5. A table of contents and index, and perhaps a glossary, are badly needed for a document anywhere near this size. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From kent.b.karlsson at bahnhof.se Thu Jan 5 17:58:53 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Fri, 6 Jan 2023 00:58:53 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <8458214e-170b-7441-e6c7-737d4d68c565@shoulson.com> Message-ID: > 5 jan. 2023 kl. 03:06 skrev Doug Ewell via Unicode : > > Mark E. Shoulson replied to Kent Karlsson: > >>> It is, however, a while ago since the last update to ECMA-48, and >>> that shows. I?ve compiled a proposed update for the text styling part >>> of ECMA-48: >>> https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2022.pdf >> >> Actually not necessarily a bad idea, at least at first browsing, but >> it's kind of out of scope for Unicode, isn't it? It sounds like an >> update to ECMA-48 (which isn't part of Unicode), and they're the >> people you'd have to convince. > > Actually, Kent's document does include updates and clarifications that are specific to Unicode. So there is certainly something for readers of this list. > > I agree with Kent's overall assessment that ECMA-48 is the way to go for styling attributes in an environment that strives to remain "plain text," and is far superior, for many reasons, to any proposal to create a completely new mechanism to achieve the same goal. > > I have only had time to skim this latest 50-page update, but I would make the same suggestions that I have made before, plus a few others: > > 1. Clarifications to existing specifications and usage are fine. > > 2. Completely new inventions, even if they are in the spirit of ECMA-48, should be proposed in separate sections and handled with care. The argument that ECMA-48 is a time-tested standard, widely implemented, loses force in proportion to the amount of emphasis placed on unilaterally creating new stuff. For the most part they are in separate sections, marked as ?new? or ?extended with variants" in the heading; since you made that comment before. I did not want to put that in a separate document, since that would destroy the logical ordering and grouping (instead of the alphabetical/numerical ordering used in ECMA-48 5th edition, which makes it all so hard to read). > 3. Deprecated items, items newly noted as "one should try to avoid," and other new restrictions on existing sequences or existing implementations should be proposed in separate sections, and handled with EXTREME care. > Restricting platforms, for example, from implementing "bold" with zero color change, (I think you meant to write the other way around, i.e. ?as a? instead of ?with zero?; unfortunately some terminal emulators implement bold as a colour change.) Bold and colour change are orthogonal. Specifying bold and get a colour change also is at odds with specifying colour as an RGB colour setting. What, for instance, would be the bold colour for RGB 255:140:0? > or from implementing "italic" or "oblique" at an angle outside the range 8??12?, Sometimes, and that goes for other implementations than those implementing ECMA-48 as well, a default angle for italics/oblique is used that is annoyingly large (using a run-of-the-mill font, not counting especially artistic ones for special effects). That distracts rather than put emphasis on the emphasized text. > or attempting to forbid certain characters beyond what Unicode recommends, I would need a more detailed comment or comments. This mailing list is not the right place for that (even though this particular comment was in direct reference to Unicode). > introduces a strong risk that the proposed new standard may be ignored. Think of the concessions that had to be made for Unicode itself to be adopted. > > 4. Tables that compare existing and proposed ECMA-48 mechanisms, and call attention to the changes, need to be included. ?Noted.? (As the standard committee parlance goes; meaning ?I will think about it?.) > 5. A table of contents and index, and perhaps a glossary, are badly needed for a document anywhere near this size. ?Noted.? While there is no index as such, there is a summary on pages 46 to 50; almost an index. (A ToC would be easy to generate; but I might put it as an appendix, not up front; it is not a book. I did try to make a logical ordering and grouping; that in contrast to ECMA-48 5th ed., which uses alphabetical ordering, breaking all logical grouping.) Kind regards /Kent K > > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Jan 5 23:51:56 2023 From: doug at ewellic.org (Doug Ewell) Date: Fri, 6 Jan 2023 05:51:56 +0000 Subject: =?utf-8?B?UkU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <8458214e-170b-7441-e6c7-737d4d68c565@shoulson.com> Message-ID: Kent Karlsson wrote: >> 2. Completely new inventions, even if they are in the spirit of >> ECMA-48, should be proposed in separate sections and handled with >> care. The argument that ECMA-48 is a time-tested standard, widely >> implemented, loses force in proportion to the amount of emphasis >> placed on unilaterally creating new stuff. > > For the most part they are in separate sections, marked as ?new? or > ?extended with variants" in the heading; since you made that comment > before. I do see that individual functions are marked ?new? or ?extended with variants" in the heading, which does help. I was thinking more of putting all the new items together in one top-level section, and all the extended items together in another top-level section, separate from the unchanged or merely clarified items. This is better than not announcing them at all, though. > I did not want to put that in a separate document, since that would > destroy the logical ordering and grouping (instead of the > alphabetical/numerical ordering used in ECMA-48 5th edition, which > makes it all so hard to read). I didn?t suggest putting them in a separate document, at least not here. That said, I would suggest that you should not necessarily feel constrained to abide by the way ECMA-48 itself is arranged, especially if you aren?t planning to submit this as a formal update to the standard. >> Restricting platforms, for example, from implementing "bold" with >> zero color change, > > (I think you meant to write the other way around, i.e. ?as a? instead > of ?with zero?; unfortunately some terminal emulators implement bold > as a colour change.) Yes, I did phrase it backward. > Bold and colour change are orthogonal. Specifying bold and get a > colour change also is at odds with specifying colour as an RGB colour > setting. What, for instance, would be the bold colour for RGB 255:140:0? I agree that it is far from ideal if a platform is incapable of true bolding and displays ?bold white? as a brighter white (in contrast to ?light gray?). And certainly not every color can be made into ?fake bold? with a brighter color. It?s a hack, to be sure. We will have to disagree as to whether such a platform should be forced to not support CSI 1m at all. > Sometimes, and that goes for other implementations than those > implementing ECMA-48 as well, a default angle for italics/oblique is > used that is annoyingly large (using a run-of-the-mill font, not > counting especially artistic ones for special effects). That distracts > rather than put emphasis on the emphasized text. Some fonts are designed poorly. I don?t think this document is the place to make that aesthetic judgment. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From wjgo_10009 at btinternet.com Fri Jan 6 06:49:06 2023 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 6 Jan 2023 12:49:06 +0000 (GMT) Subject: =?UTF-8?Q?Re:_=E2=80=9Cplain_text_styling=E2=80=9D=E2=80=A6?= In-Reply-To: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> Message-ID: <64e249b4.30399.1858720b8fe.Webtop.96@btinternet.com> Kent Karlsson wrote an interesting post. > More or less regularly there are (informal) requests on this list for > encoding (new) control codes or control code sequences for text > styling (like bold, italics, text colour, ?) also for ?plain text?. > As I've mentioned long before, there is no need to reinvent that > approach (unless you really, really want to...). Well, I want a fresh system designed specifically to be compatible with Unicode please. A way to do this for indicating italics has been proposed using Variation Selector 14. Alas, it was rejected by the Unicode Technical Committee. The method decribed could be extended using other variation selectors for bold, bold italic, and for various colours too. There are hundreds of variation selectors available, so using some of them for this elegant futuristic proposal would not restrict uses of variation selectors for other purposes. The method can be implemented using existing font technology. People who do not want to use the method could simply ignore it. Yet for the people who choose to use it, documents in plain text format could be used to archive text that has features such as italics and colour within the text. Although this method of enhancing plain text could be implemented straightforwardly if the Unicode Technical Committee were to approve it, the method has been rejected and thus it cannot be implemented at the present time and cannot be applied to improve information technology at the present time. But was that rejection a rejection for ever or just a rejection at that time? For example, the Unicode Technical Committee at one time decided not to encode emoji. I hope that the method using variation selectors can be reconsidered please and that the method can be approved by the Unicode Technical Committee so that people who use Unicode can, if they so choose, use the proposed system in their documents and communications. It would be a magnificent decision for progress. William Overington Friday 6 January 2023 -------------- next part -------------- An HTML attachment was scrubbed... URL: From liste at secarica.ro Sat Jan 7 05:37:33 2023 From: liste at secarica.ro (Cristian =?UTF-8?Q?Secar=C4=83?=) Date: Sat, 7 Jan 2023 13:37:33 +0200 Subject: =?UTF-8?Q?=E2=80=9Cplain?= text =?UTF-8?Q?styling=E2=80=9D?= =?UTF-8?Q?=E2=80=A6?= In-Reply-To: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> Message-ID: <20230107133329.00007e18@secarica.ro> ?n data de Thu, 5 Jan 2023 01:53:40 +0100, Kent Karlsson via Unicode a scris: > More or less regularly there are (informal) requests on this list for > encoding (new) control codes or control code sequences for text > styling (like bold, italics, text colour, ?) also for ?plain text?. This seems to overlooks that a "plain text" subjected to such torment can no longer be called "plain". Or, how do you differentiate this plain text from the other plain text ? "I am sending this e-mail in strict plain text" "I am sending this e-mail in a somewhat plain text" "I am sending this e-mail in a complicated plain text" "I am sending this e-mail in a code-controlled plain text" Also in places where the number of characters matters (and supposing the editor knows how to interpret, and therefore, hide the control characters), like a SMS text message sent over a GSM network [1], one may become confused about the strange increase (or decrease, if a limit is imposed) of the characters count. > This instead of using such things RTF, SGML, HTML, ODF, etc. In the > latter, the style (and other) controls are given as strings of > printable characters (like , ), not involving control > characters. Not sure I understand, especially that you later mentioned ECMA-48. >From a simple (basic) text editor perspective that knows nothing about styling, what is the difference between displaying these two examples related to same intended result ? bold versus \x1b[1mbold\x1b[2m Same question if the ~simple text editor *knows* about both of the above styling methods ? Or perhaps from the user perspective ? Cristi [1] https://www.secarica.ro/index.php/eue/sms-story/the-sms-discrimination -- Cristian Secar? https://www.secarica.ro From kent.b.karlsson at bahnhof.se Sun Jan 8 08:15:21 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sun, 8 Jan 2023 15:15:21 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <20230107133329.00007e18@secarica.ro> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> Message-ID: <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> > 7 jan. 2023 kl. 12:37 skrev Cristian Secar? via Unicode : > > ?n data de Thu, 5 Jan 2023 01:53:40 +0100, Kent Karlsson via Unicode a scris: > >> More or less regularly there are (informal) requests on this list for >> encoding (new) control codes or control code sequences for text >> styling (like bold, italics, text colour, ?) also for ?plain text?. > > This seems to overlooks that a "plain text" subjected to such torment can no longer be called "plain". > > Or, how do you differentiate this plain text from the other plain text ? > "I am sending this e-mail in strict plain text" > "I am sending this e-mail in a somewhat plain text" > "I am sending this e-mail in a complicated plain text" > "I am sending this e-mail in a code-controlled plain text? The point is that the ?protocol? is at plain text level. That is why ECMA-48 styling can work for applications like terminal emulators, where higher-level protocols, like HTML, are out of the question. > Also in places where the number of characters matters (and supposing the editor knows how to interpret, and therefore, hide the control characters), like a SMS text message sent over a GSM network [1], one may become confused about the strange increase (or decrease, if a limit is imposed) of the characters count. Apart from that the SMS (and cell broadcast) transmission protocol(s) have the capability of having split messages that are reassembled by the receiver... The SMS (and cell broadcast) 7-bit character encodings (there is a handful of them) all have just four ?control codes?: CR, LF, FF, and SS2 (misnamed(!) as ESC). There is no ESC character nor any CSI character. What is referred to as UCS-2 (read as UTF-16BE) should therefore, for SMS and cell broadcast, be seen as only having CR, LF and FF and no other control characters (though the standard for SMS character encodings is silent on that point). So SMS and cell broadcast messages are out of scope for that simple reason. > >> This instead of using such things RTF, SGML, HTML, ODF, etc. In the >> latter, the style (and other) controls are given as strings of >> printable characters (like , ), not involving control >> characters. > > Not sure I understand, especially that you later mentioned ECMA-48. > > From a simple (basic) text editor perspective that knows nothing about styling, what is the difference between displaying these two examples related to same intended result ? > bold > versus > \x1b[1mbold\x1b[2m The first one is a higher level protocol (interpreting substrings consisting purely of ?printable characters? as controls; counting SP,HT and LF as ?printable"), the second is a text level protocol. > Same question if the ~simple text editor *knows* about both of the above styling methods ? If we are talking about files, the file name suffix is the most common way of dealing with that (like .html vs. .txtf vs. .txt). Kind regards /Kent K > Or perhaps from the user perspective ? > > Cristi > > [1] https://www.secarica.ro/index.php/eue/sms-story/the-sms-discrimination > > -- > Cristian Secar? > https://www.secarica.ro > From sosipiuk at gmail.com Sun Jan 8 11:34:42 2023 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Sun, 08 Jan 2023 17:34:42 +0000 Subject: =?UTF-8?B?4oCccGxhaW4=?= text =?UTF-8?B?c3R5bGluZ+KAneKApg==?= In-Reply-To: <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> References: <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> Message-ID: <1673198273584.646781636.1518457991@gmail.com> On Sunday, 08 January 2023, 09:15:21 (-05:00), Kent Karlsson via Unicode wrote: > > The point is that the ?protocol? is at plain text level. That is why ECMA-48 styling can work for applications like terminal emulators, where higher-level protocols, like HTML, are out of the question. This does not make sense. Both are formats that need to be interpreted by the display software or they just look like junk within the visible text. HTML and ECMA-48 are no different in principle. You can write a terminal emulator that respects basic HTML styling. The only reason it hasn't been done is because there is no demand, and that is because of historical reasons (including that many terminal scripting languages have syntax that would conflict with HTML). > > From a simple (basic) text editor perspective that knows nothing about styling, what is the difference between displaying these two examples related to same intended result ? > > bold > > versus > > \x1b[1mbold\x1b[2m > > The first one is a higher level protocol (interpreting substrings consisting purely of ?printable characters? as controls; counting SP,HT and LF as ?printable"), the second is a text level protocol. No. Whether "<" or \x1b is a special syntax introducer makes no real difference. You need something to recognize it and interpret it. Both standards are about interpreting substrings, with opening and closing characters and formatting information between them. There is nothing inherently special about having the characters be below \x20, certainly not any more than, for example, using the tag characters. From sosipiuk at gmail.com Sun Jan 8 11:46:00 2023 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Sun, 08 Jan 2023 17:46:00 +0000 Subject: =?UTF-8?B?4oCccGxhaW4=?= text =?UTF-8?B?c3R5bGluZ+KAneKApg==?= In-Reply-To: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> Message-ID: <1673199392376.205605608.4273675340@gmail.com> On Wednesday, 04 January 2023, 19:53:40 (-05:00), Kent Karlsson via Unicode wrote: The advantage this approach has is that by using a separate class of characters, no substring of printable characters (including SP, HT), no substring of printable characters can be confused with controls for text styling. I don't see this as a major concern. IMO, what people want from Unicode styling is one or both of these things: 1. Extremely compact styling, made possible by assigning dedicated characters for each style 2. Default-ignorable styling markup that neatly disappears if it cannot be interpreted, made possible by the existing set of default-ignorable Unicode characters. ECMA-48 does the first not very well, and the the second not at all. -------------- next part -------------- An HTML attachment was scrubbed... URL: From harjitmoe at outlook.com Sun Jan 8 15:45:01 2023 From: harjitmoe at outlook.com (Harriet Riddle) Date: Sun, 8 Jan 2023 21:45:01 +0000 Subject: =?UTF-8?Q?State_of_ECMA-48_in_a_Unicode_age_=28was_Re:_=e2=80=9cpla?= =?UTF-8?B?aW4gdGV4dCBzdHlsaW5n4oCd4oCmKQ==?= In-Reply-To: <93C72EA1-D44E-4826-8B3E-A45E8B17F2B1@bahnhof.se> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <8458214e-170b-7441-e6c7-737d4d68c565@shoulson.com> <93C72EA1-D44E-4826-8B3E-A45E8B17F2B1@bahnhof.se> Message-ID: Kent Karlsson via Unicode wrote: > Well, yes... But the problem is that, IIUC, the ECMA-48 committee is > currently the empty set of people? > > /K --- Tangentially to this, I do much believe that a new edition of ECMA-48 which clarifies and addresses relationship both to Unicode and to established convention would be of practical benefit, with the following points standing out to me: ? The penultimate edition of ECMA-48 (fourth edition, December 1986, still archived at the bottom of ECMA's page for ECMA-48) deprecates a number of mode flags and control functions in Appendix E (pages 84?87).? Notable amongst these deprecations is LF/NLM (E.1.3, bottom of page 84 / top of page 85), i.e. the mode flag that toggles whether linefeeds imply a carriage return.? The note (E.1) attached to that section explains the deprecation, stating essentially that full line breaks should use either NEL or CR+LF moving forward; the IND control for explicit bare linefeed is also in the deprecated features appendix as E.2.3.? In the fifth and current edition (June 1991), per the 1998 reprint available as PDF from ECMA's page, both LF/NLM and IND were removed altogether, announced in annex F.5.2 (page numbered 88 / 102nd PDF page) and annex F.8.2 (page numbered 89 / 103rd PDF page) respectively.? The definition of LF (section 8.3.74, page numbered 49 / 63rd PDF page) unambiguously specifies a move to the "corresponding character position" (as opposed to the start) of the following line.? Therefore, /most terminal emulators (which accept bare LF as a full newline in their default modes) are actually in violation of the current edition of ECMA-48 (and exhibit deprecated but permitted behaviour per the edition before), as are virtually all modern text editors, for example/.? Honestly, the elimination of the LF/NLM mode comes across as wishful thinking on the part of the committee, but hindsight is 20:20.? A mode such as LF/NLM probably ought to be restored so as to align the standard with the by-now-set-in-stone reality. ? Speaking of ECMA-48 and the CR vs LF vs CRLF vs NEL vs LSEP issue, better co?rdination between UAX 14 and ECMA-48 might be in order.? This doesn't cause as much of an issue in practice, since the contexts where ECMA-48 is actually implemented (monospaced terminal emulators) are largely disjoint with these where UAX 14 is implemented, but it should be clearer how an implementation can be concordant with both (for example, whether an ECMA-48 conformant implementation of CR or VT is sufficient to count as a line break for UAX 14 purposes).? This is particularly relevant should one wish to use ECMA-48 in a non-terminal context, as seems to be part of the present discussion. ? Section 5 needs reworking to address how it interacts with Unicode Transformation Formats (other than the erstwhile abortive UTF-1, which it works fine with, for all this has any effect on anything).? The representation of the C0 and C1 codes is given in 7-bit and 8-bit column/line bit combinations.? I believe ISO/IEC 10646 briefly addresses how these translate to UTF-16 or UTF-32 (padding to code unit width), but this would be ideal to have addressed in ECMA-48 itself in this day and age; furthermore, even with that provision, "bit combinations from 08/00 to 09/15" in the context of UTF-8 arguably prescribes fragmentary or invalid UTF-8 sequences rather than the UTF-8 representations of the C1 code points. ? Also in section 5: command strings (DCS, OSC, PM and APC) are limited to 0x08?0D (the ASCII FEx format effectors) and 0x20?7E (the ASCII printing characters including space).? This is contrasted with character strings (SOS) which have no such restriction, with only SOS itself being forbidden (and ST not includable due to being the terminator).? In practice, not only ASCII printing characters but arbitrary Unicode characters?other than Cc control codes outside of the aforementioned 0x08?0D range?are permitted in OSC sequences recognised by terminal emulators, which often contain text.? For example, "\u{9D}0;flamb?\u{9C}" will set a terminal window title to "flamb?", even though "?" is not an ASCII character.? This is another area which probably needs updating to align it with both industry practice and a Unicode age. ? The characters listed as affected by the FEAM mode in section 7.2.5 needs looking at?for instance, it lists BPH (equivalent to ZWSP) but not its opposite NBH (equivalent to WJ).? It also lists CR and NEL but not LF, all of which are format effectors per section 8.2.4.? The interaction with Unicode general categories should also be addressed: presumably it would apply to the format category (Cf), and possibly also Zl and Zp, in addition to the specific listed Cc characters and CSI sequences, but this should be addressed in the FEAM definition, annex A.1, or both. ? Speaking of annex A, annex A.2 and the GCC sequence might deserve addressing as to their relation to Unicode.? Certainly, ECMA-43 (conformed to by ISO 8859?and yes, the graphical resemblence between "ECMA-43" and "ECMA-48" can be confusing whenever these two standards are discussed together) puts significant limitations on the ECMA-48 codes used for composition, prohibiting any such composition that creates a new character rather than merely a ligature of existing characters (see annex C of ECMA-43, contrast with annex A.2 of ECMA-48).? This both bans backspace composition, and constrains the use of GCC to discretionary ligatures (which is not explicitly constrained by ECMA-48 itself?indeed, annex A seems to prescribe GCC as a migration path from backspace composition?although the note on the definition of GCC itself in section 8.3.54 mentions CJK square ligatures as the simple case, not diacritic composition or APL composition).? Backspace composition is similarly not really compatible with the Unicode model of base characters, combining characters and pre-composed diacritic-bearing characters (and composed-symbol APL operators without decompositions), although discretionary ligatures are manifestly compatible with the Unicode character model (see e.g. the OpenType dlig feature and the CSS font-variant-ligatures property). --Har. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Sun Jan 8 16:08:49 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sun, 8 Jan 2023 23:08:49 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <1673198273584.646781636.1518457991@gmail.com> References: <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <1673198273584.646781636.1518457991@gmail.com> Message-ID: <0E3F4645-6943-411A-B1AA-4E0C191712A2@bahnhof.se> > 8 jan. 2023 kl. 18:34 skrev S?awomir Osipiuk via Unicode : > > On Sunday, 08 January 2023, 09:15:21 (-05:00), Kent Karlsson via Unicode wrote: >> >> The point is that the ?protocol? is at plain text level. That is why ECMA-48 styling can work for applications like terminal emulators, where higher-level protocols, like HTML, are out of the question. > > This does not make sense. Both are formats that need to be interpreted by the display software or they just look like junk within the visible text. Yes? > HTML and ECMA-48 are no different in principle. On this point they are wildly different. One is possible to use in contexts such as terminal emulators, indeed intended for such use. The other one cannot be used in such contexts. And the precise reason is that one is a plain text protocol, and the other a higher level protocol. One cannot make a HTML(like) based terminal emulator, since the controls in HTML are purely printable characters (which in turn requires that certain characters *must* be represented via character escapes, like <, otherwise risk being part of a control). Now, using ECMA-48 styling controls for styling text (that may be stored in a file) is not vitally dependent on that. It is, for that use, just a question of reuse of an already existing mechanism for specifying styling. That mechanism need not be locked in to be used only for terminal emulators. (Though some of the proposed addition may be useful also for terminal emulators, and indeed some already are; I ?grabbed? some suggestions from already implemented (in some terminal emulators) additions, with the intent of not compromising those implementations.) > You can write a terminal emulator that respects basic HTML styling. Nope. Violates the plain text principle of terminal emulators. (Besides, HTML has a nesting structure, but that is a different obstacle for your suggestion here.) > The only reason it hasn't been done is because there is no demand, and that is because of historical reasons (including that many terminal scripting languages have syntax that would conflict with HTML). > >>> From a simple (basic) text editor perspective that knows nothing about styling, what is the difference between displaying these two examples related to same intended result ? >>> bold >>> versus >>> \x1b[1mbold\x1b[2m >> >> The first one is a higher level protocol (interpreting substrings consisting purely of ?printable characters? as controls; counting SP,HT and LF as ?printable"), the second is a text level protocol. > > No. Whether "<" or \x1b is a special syntax introducer makes no real difference. Except that it does. See above. > You need something to recognize it and interpret it. Yes, but that is not ?it?. > Both standards are about interpreting substrings, with opening and closing characters and formatting information between them. There is nothing inherently special about having the characters be below \x20, certainly not any more than, for example, using the tag characters. There very much is a difference between control characters and printable characters (including SP,LF,CR,HT), in that the latter are ?normal text? to be printed, while control characters are, well control characters not to be printed. True, the distinction is somewhat ?muddled? by that SP/CR/LF/HT/VT/NEL aren?t all that ?pure control?, but characters like SHY actually are control characters but not formally counted as such. Plus the various control characters introduced by Unicode (like bidi controls; note that HTML has it?s separate way of doing bidi controls, using printable characters, not the Unicode bidi controls). So I agree that it is not straight-forward, but there really is a difference. Kind regards /Kent K -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Sun Jan 8 16:08:45 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sun, 8 Jan 2023 23:08:45 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <1673199392376.205605608.4273675340@gmail.com> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <1673199392376.205605608.4273675340@gmail.com> Message-ID: > 8 jan. 2023 kl. 18:46 skrev S?awomir Osipiuk : > > On Wednesday, 04 January 2023, 19:53:40 (-05:00), Kent Karlsson via Unicode wrote: > > The advantage this approach has is that by using a separate class of characters, no substring of printable characters (including SP, HT), no substring of printable characters can be confused with controls for text styling. > > > > I don't see this as a major concern. IMO, what people want from Unicode styling is one or both of these things: > > 1. Extremely compact styling, made possible by assigning dedicated characters for each style Well, depending on you ambition, you may have to define very many "dedicated characters for each style?. > 2. Default-ignorable styling markup that neatly disappears if it cannot be interpreted, made possible by the existing set of default-ignorable Unicode characters. In an earlier draft (not so widely circulated), I hinted at the possibility of mapping ASCII characters within a control sequence to TAG characters to make the entire sequence ?default ignorable?. But I removed that little hint, since I don?t think it would be a good idea (it would further overload the TAG characters; would not be compatible with ECMA-48; and may prevent future extensions unduly, indeed for SCI and math expressions (that otherwise has nothing to do with ECMA-48) I sometimes use a non-ASCII character after the SCI). > ECMA-48 does the first not very well, I?d say it does it quite well, at least compared with many of the existing alternatives. And in addition, parameters can be used to get quite a lot of formatting specified shortly, and fairly generally, for instance for colours use RGB values. > and the the second not at all. I did hint at one way of doing that (based on ECMA-48) at one point, as I just mentioned. But I deleted that hint. Kind regards /Kent K -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Sun Jan 8 16:26:58 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sun, 8 Jan 2023 23:26:58 +0100 Subject: =?utf-8?B?UmU6IFN0YXRlIG9mIEVDTUEtNDggaW4gYSBVbmljb2RlIGFnZSAo?= =?utf-8?B?d2FzIFJlOiDigJxwbGFpbiB0ZXh0IHN0eWxpbmfigJ3igKYp?= In-Reply-To: References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <8458214e-170b-7441-e6c7-737d4d68c565@shoulson.com> <93C72EA1-D44E-4826-8B3E-A45E8B17F2B1@bahnhof.se> Message-ID: <692BC82C-2A1C-4C5C-9671-E1351A2E45DB@bahnhof.se> I will likely not have a response to all the comments below. But just two things: 1) In the proposal I wrote, all the so-called modes are deprecated. As far as I know, none were ever implemented anywhere, and all were a bad idea. 2) ?most terminal emulators (which accept bare LF as a full newline in their default modes? I don?t think that is true. Firstly we have the ?cooked? vs. ?raw? modes (when ?echo mode? is on) in Unix/Linux ttys; that is a setting for the tty, not the terminal emulator. Then there are such things as shell command line tools (like bash) which does their own ?echoing? (tty echo off), and does so in quite complicated ways, allowing for editing in the command line, as well as command history. But all that is out of scope for ECMA-48 except that ECMA-48 control sequences (not the styling ones, but other control sequences for ?screen editing?) are used to implement their behaviour. Kind regards /Kent K > 8 jan. 2023 kl. 22:45 skrev Harriet Riddle via Unicode : > > Kent Karlsson via Unicode wrote: >> Well, yes... But the problem is that, IIUC, the ECMA-48 committee is currently the empty set of people? >> >> /K > > --- > > Tangentially to this, I do much believe that a new edition of ECMA-48 which clarifies and addresses relationship both to Unicode and to established convention would be of practical benefit, with the following points standing out to me: > > ? The penultimate edition of ECMA-48 (fourth edition, December 1986, still archived at the bottom of ECMA's page for ECMA-48) deprecates a number of mode flags and control functions in Appendix E (pages 84?87).? Notable amongst these deprecations is LF/NLM (E.1.3, bottom of page 84 / top of page 85), i.e. the mode flag that toggles whether linefeeds imply a carriage return.? The note (E.1) attached to that section explains the deprecation, stating essentially that full line breaks should use either NEL or CR+LF moving forward; the IND control for explicit bare linefeed is also in the deprecated features appendix as E.2.3.? In the fifth and current edition (June 1991), per the 1998 reprint available as PDF from ECMA's page, both LF/NLM and IND were removed altogether, announced in annex F.5.2 (page numbered 88 / 102nd PDF page) and annex F.8.2 (page numbered 89 / 103rd PDF page) respectively.? The definition of LF (section 8.3.74, page numbered 49 / 63rd PDF page) unambiguously specifies a move to the "corresponding character position" (as opposed to the start) of the following line.? Therefore, most terminal emulators (which accept bare LF as a full newline in their default modes) are actually in violation of the current edition of ECMA-48 (and exhibit deprecated but permitted behaviour per the edition before), as are virtually all modern text editors, for example.? Honestly, the elimination of the LF/NLM mode comes across as wishful thinking on the part of the committee, but hindsight is 20:20.? A mode such as LF/NLM probably ought to be restored so as to align the standard with the by-now-set-in-stone reality. > > ? Speaking of ECMA-48 and the CR vs LF vs CRLF vs NEL vs LSEP issue, better co?rdination between UAX 14 and ECMA-48 might be in order.? This doesn't cause as much of an issue in practice, since the contexts where ECMA-48 is actually implemented (monospaced terminal emulators) are largely disjoint with these where UAX 14 is implemented, but it should be clearer how an implementation can be concordant with both (for example, whether an ECMA-48 conformant implementation of CR or VT is sufficient to count as a line break for UAX 14 purposes).? This is particularly relevant should one wish to use ECMA-48 in a non-terminal context, as seems to be part of the present discussion. > > ? Section 5 needs reworking to address how it interacts with Unicode Transformation Formats (other than the erstwhile abortive UTF-1, which it works fine with, for all this has any effect on anything).? The representation of the C0 and C1 codes is given in 7-bit and 8-bit column/line bit combinations.? I believe ISO/IEC 10646 briefly addresses how these translate to UTF-16 or UTF-32 (padding to code unit width), but this would be ideal to have addressed in ECMA-48 itself in this day and age; furthermore, even with that provision, "bit combinations from 08/00 to 09/15" in the context of UTF-8 arguably prescribes fragmentary or invalid UTF-8 sequences rather than the UTF-8 representations of the C1 code points. > > ? Also in section 5: command strings (DCS, OSC, PM and APC) are limited to 0x08?0D (the ASCII FEx format effectors) and 0x20?7E (the ASCII printing characters including space).? This is contrasted with character strings (SOS) which have no such restriction, with only SOS itself being forbidden (and ST not includable due to being the terminator).? In practice, not only ASCII printing characters but arbitrary Unicode characters?other than Cc control codes outside of the aforementioned 0x08?0D range?are permitted in OSC sequences recognised by terminal emulators, which often contain text.? For example, "\u{9D}0;flamb?\u{9C}" will set a terminal window title to "flamb?", even though "?" is not an ASCII character.? This is another area which probably needs updating to align it with both industry practice and a Unicode age. > > ? The characters listed as affected by the FEAM mode in section 7.2.5 needs looking at?for instance, it lists BPH (equivalent to ZWSP) but not its opposite NBH (equivalent to WJ).? It also lists CR and NEL but not LF, all of which are format effectors per section 8.2.4.? The interaction with Unicode general categories should also be addressed: presumably it would apply to the format category (Cf), and possibly also Zl and Zp, in addition to the specific listed Cc characters and CSI sequences, but this should be addressed in the FEAM definition, annex A.1, or both. > > ? Speaking of annex A, annex A.2 and the GCC sequence might deserve addressing as to their relation to Unicode.? Certainly, ECMA-43 (conformed to by ISO 8859?and yes, the graphical resemblence between "ECMA-43" and "ECMA-48" can be confusing whenever these two standards are discussed together) puts significant limitations on the ECMA-48 codes used for composition, prohibiting any such composition that creates a new character rather than merely a ligature of existing characters (see annex C of ECMA-43, contrast with annex A.2 of ECMA-48).? This both bans backspace composition, and constrains the use of GCC to discretionary ligatures (which is not explicitly constrained by ECMA-48 itself?indeed, annex A seems to prescribe GCC as a migration path from backspace composition?although the note on the definition of GCC itself in section 8.3.54 mentions CJK square ligatures as the simple case, not diacritic composition or APL composition).? Backspace composition is similarly not really compatible with the Unicode model of base characters, combining characters and pre-composed diacritic-bearing characters (and composed-symbol APL operators without decompositions), although discretionary ligatures are manifestly compatible with the Unicode character model (see e.g. the OpenType dlig feature and the CSS font-variant-ligatures property). > > --Har. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marius.spix at web.de Mon Jan 9 02:22:30 2023 From: marius.spix at web.de (Marius Spix) Date: Mon, 9 Jan 2023 09:22:30 +0100 Subject: =?UTF-8?Q?Aw=3A_Re=3A_=E2=80=9Cplain_text_styling=E2=80=9D=E2=80=A6?= In-Reply-To: <0E3F4645-6943-411A-B1AA-4E0C191712A2@bahnhof.se> References: <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <1673198273584.646781636.1518457991@gmail.com> <0E3F4645-6943-411A-B1AA-4E0C191712A2@bahnhof.se> Message-ID: An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Mon Jan 9 08:46:29 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Mon, 9 Jan 2023 15:46:29 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: References: <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <1673198273584.646781636.1518457991@gmail.com> <0E3F4645-6943-411A-B1AA-4E0C191712A2@bahnhof.se> Message-ID: > 9 jan. 2023 kl. 09:22 skrev Marius Spix : > > We should also be aware that plain text styling has many potential security risks. For example, if we had the characters and someone could create two different strings which look exactly the same like "Hallo World" and "Hllo World". This may allow identity spoofing, bypassing regex filters, phishing or even hash collision attacks. ((Not sure where the ?a? went?)) This is true for any kind of ?edit? (automatic, semi-automatic, manual) that may be done between such a security check and any kind of ?execution?; including but not limited to: Unicode normalisation Replace malformed UTF-8 or UTF-16 with some kind of replacement, like ?, SUB, REPLACEMENT CHARACTER, or deleting them Case mapping Encoding mapping (like mapping to ASCII), which may ?loose?/replace non-convertible characters ?Drop accents? mapping Adding or removing HTML tags Expanding (or inserting) any kind of character or string references (like \uNNNN, &xNNNN;, <, \xNN, ?) Replacing quote marks by other quote marks Removing/replacing any ?undesirable? control character or default ignorable character Spell corrections Correcting syntax errors (if it is some kind of command or query) Do bidi reordering to get a ?visual order? string; or use bidi controls to confuse the character order; or any kind of inverse bidi reordering and lots more So there is nothing new or unique with using ECMA-48 text styling in this regard. > Styling is supposed to be done at application layer. Yes? (Not sure what you are aiming at here.) Kind regards /Kent K > Gesendet: Sonntag, 08. Januar 2023 um 23:08 Uhr > Von: "Kent Karlsson via Unicode" > An: "S?awomir Osipiuk" > Cc: unicode at corp.unicode.org > Betreff: Re: ?plain text styling?? > > > 8 jan. 2023 kl. 18:34 skrev S?awomir Osipiuk via Unicode >: > > On Sunday, 08 January 2023, 09:15:21 (-05:00), Kent Karlsson via Unicode wrote: > > The point is that the ?protocol? is at plain text level. That is why ECMA-48 styling can work for applications like terminal emulators, where higher-level protocols, like HTML, are out of the question. > > This does not make sense. Both are formats that need to be interpreted by the display software or they just look like junk within the visible text. > > Yes? > > HTML and ECMA-48 are no different in principle. > > On this point they are wildly different. One is possible to use in contexts such as terminal emulators, indeed intended for such use. The other one cannot be used in such contexts. > And the precise reason is that one is a plain text protocol, and the other a higher level protocol. One cannot make a HTML(like) based terminal emulator, since the controls in HTML are purely printable characters (which in turn requires that certain characters *must* be represented via character escapes, like <, otherwise risk being part of a control). > > Now, using ECMA-48 styling controls for styling text (that may be stored in a file) is not vitally dependent on that. It is, for that use, just a question of reuse of an already existing mechanism for specifying styling. That mechanism need not be locked in to be used only for terminal emulators. (Though some of the proposed addition may be useful also for terminal emulators, and indeed some already are; I ?grabbed? some suggestions from already implemented (in some terminal emulators) additions, with the intent of not compromising those implementations.) > > You can write a terminal emulator that respects basic HTML styling. > > Nope. Violates the plain text principle of terminal emulators. (Besides, HTML has a nesting structure, but that is a different obstacle for your suggestion here.) > > The only reason it hasn't been done is because there is no demand, and that is because of historical reasons (including that many terminal scripting languages have syntax that would conflict with HTML). > > From a simple (basic) text editor perspective that knows nothing about styling, what is the difference between displaying these two examples related to same intended result ? > bold > versus > \x1b[1mbold\x1b[2m > > The first one is a higher level protocol (interpreting substrings consisting purely of ?printable characters? as controls; counting SP,HT and LF as ?printable"), the second is a text level protocol. > > No. Whether "<" or \x1b is a special syntax introducer makes no real difference. > > Except that it does. See above. > > You need something to recognize it and interpret it. > > Yes, but that is not ?it?. > > Both standards are about interpreting substrings, with opening and closing characters and formatting information between them. There is nothing inherently special about having the characters be below \x20, certainly not any more than, for example, using the tag characters. > There very much is a difference between control characters and printable characters (including SP,LF,CR,HT), in that the latter are ?normal text? to be printed, while control characters are, well control characters not to be printed. True, the distinction is somewhat ?muddled? by that SP/CR/LF/HT/VT/NEL aren?t all that ?pure control?, but characters like SHY actually are control characters but not formally counted as such. Plus the various control characters introduced by Unicode (like bidi controls; note that HTML has it?s separate way of doing bidi controls, using printable characters, not the Unicode bidi controls). So I agree that it is not straight-forward, but there really is a difference. > > Kind regards > /Kent K > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steffen at sdaoden.eu Mon Jan 9 12:39:39 2023 From: steffen at sdaoden.eu (Steffen Nurpmeso) Date: Mon, 09 Jan 2023 19:39:39 +0100 Subject: =?utf-8?Q?=E2=80=9Cplain?= text =?utf-8?B?c3R5bGluZ+KAneKApg==?= In-Reply-To: References: <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <1673198273584.646781636.1518457991@gmail.com> <0E3F4645-6943-411A-B1AA-4E0C191712A2@bahnhof.se> Message-ID: <20230109183939.3-9We%steffen@sdaoden.eu> Kent Karlsson via Unicode wrote in : https://raw.githubusercontent.com/kent-karlsson/control/main/ecma-48-style-modernisation-2022.pdf Very interesting (whether it will fly .. who can tell). I only wanted to note that the OSC-8 sequence has become a carrier for hyperlinks and IDs ([1]); the next groff(1) release will bring support for the grotty(1) driver, and makes use of it for manual pages. Actually mutilated support, i had rewritten my 2014 idea to bring fully interactive references to UNIX manual pages, including an extension to the less(1) pager, to OSC-8 sequences in i think 2020, but just mid of last year it was quite well, and there are enhancement requests[2] (though all that still lingering likely due to personal issues, at least for groff). So just to come here and say that 100 bytes limit is surely not large enough to hold a "modern" URL with all the bells and whistles, plus possibly an ID= to identify it in the document. So maybe either exclude OSC-8, or rise the limit for it. [1] https://gist.github.com/egmontkob/eb114294efbcd5adb1944c9f3cb5feda [2] https://www.sdaoden.eu/code.html#mdocmx --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From mark at kli.org Mon Jan 9 20:13:05 2023 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 9 Jan 2023 21:13:05 -0500 Subject: =?UTF-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <20230107133329.00007e18@secarica.ro> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> Message-ID: <925a5d79-557d-24ec-1089-9ed0f7952681@shoulson.com> On 1/7/23 06:37, Cristian Secar? via Unicode wrote: > ?n data de Thu, 5 Jan 2023 01:53:40 +0100, Kent Karlsson via Unicode a scris: > >> More or less regularly there are (informal) requests on this list for >> encoding (new) control codes or control code sequences for text >> styling (like bold, italics, text colour, ?) also for ?plain text?. > This seems to overlooks that a "plain text" subjected to such torment can no longer be called "plain". That was sort of my question at the outset.? It doesn't make sense to call this "plain text" anymore, when it's formatted and styled.? Styling is almost the *definition* of non-plain text. Unicode is all about plain text, where characters represent glyphs (or spaces) that represent text.? There are some exceptions to this: 1. Use of ZWJ/ZWNJ to affect shaping/ligaturing.? I do not consider the shaping/ligaturing itself to be an exception; that's just characters/glyphs affecting one another. 2. BiDi controls, and BiDi in general.? (Does strong directionality count as "formatting," especially with regard to LRM/RLM characters, or is it "just characters/glyphs affecting one another"?? Not sure.)? Stuff like enabling/disabling local digits and whatever is related. 3. Emoji vs text presentation. 4. "Extreme" ligaturing involving emoji ZWJ sequences, regional tags becoming flags, and other pseudo-encoding. Are there other exceptions?? There are probably things with CGJ which fall into the same category as #1, tweaking the interactions of adjacent characters/glyphs.? Is there really anything like the kind of formatting you're talking about that we have considered "plain text"?? Perhaps #3 is closest. Mind you, I think improving and upgrading ECMA-48 is a dandy idea, and your suggestions for it are as good as any I've seen (which is faint praise because I haven't seen any, but even from my own opinion, your ideas are pretty good.)? And using it in "text" files is a thing people have already been doing and will continue to do, though it is a bit of an abuse of the term "text file."? But I still don't really see how it has to do with Unicode.? What would you have Unicode do?? Define a whole set of "formatting commands" as part of the Unicode standard? I think your ideas are good and I'd support them (mostly), just that this isn't the place that decides such things. ~mark From mark at kli.org Mon Jan 9 20:19:40 2023 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 9 Jan 2023 21:19:40 -0500 Subject: =?UTF-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <8458214e-170b-7441-e6c7-737d4d68c565@shoulson.com> Message-ID: <5799dfff-f0c5-c331-6d4a-e5e819be897e@shoulson.com> On 1/4/23 21:06, Doug Ewell via Unicode wrote: > Mark E. Shoulson replied to Kent Karlsson: > >>> It is, however, a while ago since the last update to ECMA-48, and >>> that shows. I?ve compiled a proposed update for the text styling part >>> of ECMA-48: >>> https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2022.pdf >> Actually not necessarily a bad idea, at least at first browsing, but >> it's kind of out of scope for Unicode, isn't it? It sounds like an >> update to ECMA-48 (which isn't part of Unicode), and they're the >> people you'd have to convince. > Actually, Kent's document does include updates and clarifications that are specific to Unicode. So there is certainly something for readers of this list. So I guess maybe we should restrict the discussion to those "updates and clarifications that are specific to Unicode" which you mention.? What aspects would you consider those to be?? Things like what characters are valid to use in the codes or something? ~mark From sosipiuk at gmail.com Mon Jan 9 21:12:48 2023 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Tue, 10 Jan 2023 03:12:48 +0000 Subject: =?UTF-8?B?4oCccGxhaW4=?= text =?UTF-8?B?c3R5bGluZ+KAneKApg==?= In-Reply-To: <925a5d79-557d-24ec-1089-9ed0f7952681@shoulson.com> References: <925a5d79-557d-24ec-1089-9ed0f7952681@shoulson.com> Message-ID: <1673319383761.781522686.1862251621@gmail.com> On Monday, 09 January 2023, 21:13:05 (-05:00), Mark E. Shoulson via Unicode wrote: > > I think your ideas are good and I'd support them (mostly), just that this isn't the place that decides such things. > As I understand it, this mailing list isn't a place that decides anything. EMCA-48, as ISO/IEC 6429, falls within the scope of the JTC 1/SC 2 committee, which is also responsible for ISO/IEC 10646, and we know what that is. That makes it, if not Unicode-related, then at least Unicode-adjacent. A moderator may clarify, but I think this is sufficiently on-topic and interesting to a large portion of subscribers here. It certainly is to me. From doug at ewellic.org Mon Jan 9 23:09:51 2023 From: doug at ewellic.org (Doug Ewell) Date: Tue, 10 Jan 2023 05:09:51 +0000 Subject: =?utf-8?B?UkU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <925a5d79-557d-24ec-1089-9ed0f7952681@shoulson.com> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <925a5d79-557d-24ec-1089-9ed0f7952681@shoulson.com> Message-ID: Mark E. Shoulson wrote: > That was sort of my question at the outset. It doesn't make sense to > call this "plain text" anymore, when it's formatted and styled. > Styling is almost the *definition* of non-plain text. Unicode is all > about plain text, where characters represent glyphs (or spaces) that > represent text. There are some exceptions to this: [...] > > 3. Emoji vs text presentation. > > 4. "Extreme" ligaturing involving emoji ZWJ sequences, regional tags > becoming flags, and other pseudo-encoding. I would actually consider things like bold, italics, and color to be less of an affront to ?plain text? than an emoji presentation form or a sequence that adds up to ?woman firefighter with medium-dark skin tone.? Granted ECMA-48 can be used for effects that are less plain-texty than bold, italics, and color. > So I guess maybe we should restrict the discussion to those "updates > and clarifications that are specific to Unicode" which you mention. > What aspects would you consider those to be? Things like what > characters are valid to use in the codes or something? Well, for one, redefining ECMA-48 in terms of Unicode characters instead of bytes, so that (say) one can have UTF-16 with styling, where the Escape character and the bracket and all that are 16 bits wide. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From asmusf at ix.netcom.com Tue Jan 10 00:07:22 2023 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 9 Jan 2023 22:07:22 -0800 Subject: =?UTF-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <925a5d79-557d-24ec-1089-9ed0f7952681@shoulson.com> Message-ID: <91bd4eba-6205-633a-c21d-989fe4847f04@ix.netcom.com> On 1/9/2023 9:09 PM, Doug Ewell via Unicode wrote: >> 3. Emoji vs text presentation. to me that's more clearly pseudo-encoding than some of the other things now possible with emoji. It's because the wrong presentation is nearly always really wrong, so there's no common fallback. And add to that, that the introduction of the wrong default made existing applications and texts suddenly fail, and you have one of the worst blunders in Unicode's encoding history. >> >> 4. "Extreme" ligaturing involving emoji ZWJ sequences, regional tags >> becoming flags, and other pseudo-encoding. > I would actually consider things like bold, italics, and color to be less of an affront to ?plain text? than an emoji presentation form or a sequence that adds up to ?woman firefighter with medium-dark skin tone.? Granted ECMA-48 can be used for effects that are less plain-texty than bold, italics, and color. > In some ways most of the emoji sequences are really more akin to making new characters by adding diacritic marks, or making new shapes in context, the way shapes fuse in Indic conjuncts. A skintone in some sense has more similarity to a diacritic on a vowel; just because it's not a mark, but a shade, doesn't erase the similarity. The whole visual design space for emoji is different. While color is simply an attribute on text, skintone hews closer to a semantic component in the way it works. The same goes for other colors as well, a "black cat" and a generic kitty have distinct, if overlapping semantic space, and on the level of an individual symbol. The concept of semantic ligatures, like the female astronaut, is interesting, it's a departure from purely graphical constructs like stacks, conjuncts and ligatures, but while most Latin ligatures are optional, many conjuncts are not, and using a fallback will alter meaning, again on the individual grapheme level. Formatting / styling to me is distinguished by something that's conceptually always applied to a run of text, and usually not on runs of length one. The main exception to that was mathematical notation, and we opted to make a principled exception, precisely because semantic mapping to highly specific shapes for an individual symbol is or should not be the task of "styling". Flag sequences and the like are true examples of pseudo coding. Introducing a scheme that maps arbitrary code point sequences to a symbol in a way that depends on definitions maintained outside the Unicode Standard. It's the clearest case of injecting another character set (or a lego system to representing one) into the Standard that I've seen. We could have done the same with three-letter codes for currency symbols, but we didn't, and that marks the difference. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Jan 10 00:22:46 2023 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 9 Jan 2023 22:22:46 -0800 Subject: =?UTF-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <925a5d79-557d-24ec-1089-9ed0f7952681@shoulson.com> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <925a5d79-557d-24ec-1089-9ed0f7952681@shoulson.com> Message-ID: On 1/9/2023 6:13 PM, Mark E. Shoulson via Unicode wrote: > On 1/7/23 06:37, Cristian Secar? via Unicode wrote: >> ?n data de Thu, 5 Jan 2023 01:53:40 +0100, Kent Karlsson via Unicode >> a scris: >> >>> More or less regularly there are (informal) requests on this list for >>> encoding (new) control codes or control code sequences for text >>> styling (like bold, italics, text colour, ?) also for ?plain text?. >> This seems to overlooks that a "plain text" subjected to such torment >> can no longer be called "plain". > > That was sort of my question at the outset.? It doesn't make sense to > call this "plain text" anymore, when it's formatted and styled.? > Styling is almost the *definition* of non-plain text. Unicode is all > about plain text, where characters represent glyphs (or spaces) that > represent text.? There are some exceptions to this: > I concur, and my conclusion is that an ECMA-48 data stream is not plain text. It's just a different type of markup language, where there's less overlap between the character-subsets for the syntax characters and the content characters. > Mind you, I think improving and upgrading ECMA-48 is a dandy idea, and > your suggestions for it are as good as any I've seen (which is faint > praise because I haven't seen any, but even from my own opinion, your > ideas are pretty good.)? And using it in "text" files is a thing > people have already been doing and will continue to do, though it is a > bit of an abuse of the term "text file." But I still don't really see > how it has to do with Unicode.? What would you have Unicode do?? > Define a whole set of "formatting commands" as part of the Unicode > standard? A very reasonable question is to ask: what changes if the content character-subset changes to something that maps Unicode (with a few exceptions either disallowed or reserved for exclusive use in syntax). There's certainly an audience here that understands the question and my have useful feedback. > > I think your ideas are good and I'd support them (mostly), just that > this isn't the place that decides such things. > However, as pointed out repeatedly and in different ways, real progress to where this effort produces something that is actually useful (not just theoretically usable), comes from involving people and teams that have an interest in wanting to conform to (and implement) such an updated standard. ECMA-48 originally came out of ECMA, which, like Unicode, is (or was?) a forum that is based in and supported by industry and implementers. ECMA's preferred method to launch standards was to get them started and then pass them off to ISO at some stage of completeness. That approach avoids design efforts being driven by people who have no stake in the details because they don't or can't be part of the implementation and rollout of software that provides these new features to users. Unicode, we are all agreed, cannot be that forum, because styled text is not part of the remit, and neither is solving every possible extension of some other specifications to more fully use Unicode. So, while interested people can give well-meaning feedback, we can't really help move this forward - not unless we happen to also be part of some other organizations. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From liste at secarica.ro Tue Jan 10 19:05:27 2023 From: liste at secarica.ro (Cristian =?UTF-8?Q?Secar=C4=83?=) Date: Wed, 11 Jan 2023 03:05:27 +0200 Subject: =?UTF-8?Q?=E2=80=9Cplain?= text =?UTF-8?Q?styling=E2=80=9D?= =?UTF-8?Q?=E2=80=A6?= In-Reply-To: <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> Message-ID: <20230111030516.00004933@secarica.ro> ?n data de Sun, 8 Jan 2023 15:15:21 +0100, Kent Karlsson via Unicode a scris: > The point is that the ?protocol? is at plain text level. That is why > ECMA-48 styling can work for applications like terminal emulators, > where higher-level protocols, like HTML, are out of the question. By human convention, yes. From an abstract technical perspective, whatever protocol and syntax is used, in the end it comes down to just an ON/.../OFF switch. > The SMS (and cell broadcast) 7-bit character encodings (there is a > handful of them) all have just four ?control codes?: CR, LF, FF, and > SS2 (misnamed(!) as ESC). There is no ESC character nor any CSI > character. Actually, the GSM 7 bit default alphabet contains the CR, LF and ESC codes, placed at their "traditional" hex positions (i.e. 0x0D, 0x0A and 0x1B respectively). A single ESC is used to 'trigger' the extension of the GSM 7 bit default alphabet or a character from a national language single shift table. It is the extension of the GSM 7 bit default alphabet where a 0x1B 0x0A sequence generates 0x0C code (FF, i.e. Form Feed, aka Page Break) and where a 0x1B 0x1B sequence generates another 0x1B code (SS2, which is "reserved for the extension to another extension table"). > So SMS and cell broadcast messages are out of scope for that simple > reason. Probably now useless and out of question in year 2023 for practical reasons, but ? in theory ? future revisions of the 3GPP TS 23.038 standard can include whatever character might be needed in those reserved-for-future-expansion places. * Back on topic: funny how the not-so-distant past is so quickly forgotten: during end 198x / beginning 199x period of time I used extensively and with great success a lot of "plain text styling" on at least two impact printers (one being a Citizen 120D+, which I still have today). While in direct print mode (as opposed to graphics mode), there were a lot of font styles modifiers for the printing result (well, a lot for that time), triggered with ESC or CTRL sequences. Examples: ESC E / ESC F > sets / cancels emphasized print ESC G / ESC H > sets / cancels doublestrike print ESC 4 / ESC 5 > sets / cancels italic character (Epson only) CTRL-O / CTRL-R > sets / cancels compressed print ESC k 0 > sets Courier character pitch ESC k 1 > sets Citizen Display character pitch etc. Then, in the word processor I used at the time, these codes were allocated to visual control letters or symbols specific to that word processor and ready to be inserted, where required, during text editing. This is what a code-controlled printing looked like in 8 bit computing (Z80-based): https://www.secarica.ro/misc/text_print_style_via_ctrl_codes_-_tw_cpc.png https://www.secarica.ro/misc/text_print_style_via_ctrl_codes_-_tw_zxs.png Even if such a text was no longer "plain", for me that was just "text", with no particular type designation and no desire to give one. In today text editors, a text containing such escape codes will display some random garbage in those places, but they can be easily removed (or even converted to whatever modern-days styling syntax) with a Python script or something similar. Cristi -- Cristian Secar? https://www.secarica.ro From kent.b.karlsson at bahnhof.se Wed Jan 11 06:25:34 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 11 Jan 2023 13:25:34 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <20230111030516.00004933@secarica.ro> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <20230111030516.00004933@secarica.ro> Message-ID: <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> > 11 jan. 2023 kl. 02:05 skrev Cristian Secar? via Unicode : > > ?n data de Sun, 8 Jan 2023 15:15:21 +0100, Kent Karlsson via Unicode a scris: > >> The point is that the ?protocol? is at plain text level. That is why >> ECMA-48 styling can work for applications like terminal emulators, >> where higher-level protocols, like HTML, are out of the question. > > By human convention, yes. From an abstract technical perspective, whatever protocol and syntax is used, in the end it comes down to just an ON/.../OFF switch. Yes, but there are different kinds of on/off switches, syntaxwise. Some fit in an otherwise plain text context, others don?t. >> The SMS (and cell broadcast) 7-bit character encodings (there is a >> handful of them) all have just four ?control codes?: CR, LF, FF, and >> SS2 (misnamed(!) as ESC). There is no ESC character nor any CSI >> character. > > Actually, the GSM 7 bit default alphabet contains the CR, LF and ESC codes, (There is a handful of 7-bit codepages for SMS and cell broadcast messages. Not only for a kind of ?extended ASCII?, but for several Indic scripts, and one for Arabic.) Actually there is no ESC. There are CR, LF, FF. And then a code ***called*** ESC, but it is not at all ESC, it is SS2, SINGLE SHIFT 2, it works exactly as SS2. There is no real ESC character, hence no escape sequences, no CSI character (not even as an ESC sequence) and hence no control sequences. Teletext has a similar issue, where the ESC actually is an SS2. > placed at their "traditional" hex positions (i.e. 0x0D, 0x0A and 0x1B respectively). A single ESC is used to 'trigger' the extension of the GSM 7 bit default alphabet or a character from a national language single shift table. It is the extension of the GSM 7 bit default alphabet where a 0x1B 0x0A sequence generates 0x0C code (FF, i.e. Form Feed, aka Page Break) and where a 0x1B 0x1B sequence generates another 0x1B code (SS2, which is "reserved for the extension to another extension table?). That would be an SS3? > >> So SMS and cell broadcast messages are out of scope for that simple >> reason. > > Probably now useless and out of question in year 2023 for practical reasons, but ? in theory ? future revisions of the 3GPP TS 23.038 standard can include whatever character might be needed in those reserved-for-future-expansion places. > > * > Back on topic: funny how the not-so-distant past is so quickly forgotten: during end 198x / beginning 199x period of time I used extensively and with great success a lot of "plain text styling" on at least two impact printers (one being a Citizen 120D+, which I still have today). While in direct print mode (as opposed to graphics mode), there were a lot of font styles modifiers for the printing result (well, a lot for that time), triggered with ESC or CTRL sequences. > > Examples: > ESC E / ESC F > sets / cancels emphasized print > ESC G / ESC H > sets / cancels doublestrike print > ESC 4 / ESC 5 > sets / cancels italic character (Epson only) > CTRL-O / CTRL-R > sets / cancels compressed print > ESC k 0 > sets Courier character pitch > ESC k 1 > sets Citizen Display character pitch > etc. This cannot be in any of the SMS/cell broadcast charsets, since they have no (real) ESC; the ?ESC" of SMS 7-bit charsets is actually a misnamed SS2. Nor is this ECMA-48. But yes, historically there have been other control/escape sequence definitions for various types of equipments from different manufacturers. I think ECMA-48 was, in part, intended to bring some order to that old mess. I see no reason to bring back various messy definitions. But ECMA-48 control sequences are still relevant, and still used. (I see ECMA-48 styled text every day (that are not my doing)? In a modern setting!) /Kent K > Then, in the word processor I used at the time, these codes were allocated to visual control letters or symbols specific to that word processor and ready to be inserted, where required, during text editing. > > This is what a code-controlled printing looked like in 8 bit computing (Z80-based): > https://www.secarica.ro/misc/text_print_style_via_ctrl_codes_-_tw_cpc.png > https://www.secarica.ro/misc/text_print_style_via_ctrl_codes_-_tw_zxs.png > > Even if such a text was no longer "plain", for me that was just "text", with no particular type designation and no desire to give one. In today text editors, a text containing such escape codes will display some random garbage in those places, but they can be easily removed (or even converted to whatever modern-days styling syntax) with a Python script or something similar. > > Cristi > > -- > Cristian Secar? > https://www.secarica.ro > From sosipiuk at gmail.com Wed Jan 11 10:20:01 2023 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Wed, 11 Jan 2023 16:20:01 +0000 Subject: =?UTF-8?B?4oCccGxhaW4=?= text =?UTF-8?B?c3R5bGluZ+KAneKApg==?= In-Reply-To: <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> References: <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> Message-ID: <1673452505839.1379722402.1114942410@gmail.com> On Wednesday, 11 January 2023, 07:25:34 (-05:00), Kent Karlsson via Unicode wrote: > > Yes, but there are different kinds of on/off switches, syntaxwise. Some fit in an otherwise plain text context, others don?t. > I still think the distinction you're drawing ? that codes below U+0020 are not "plain text" ? is arbitrary. What special quality do they have? Can't be typed on a keyboard? Don't have visible glyphs? Affect the display of other characters? Are default-ignorable in Unicode? None of these things are unique to them. "Plain text" is a loose definition because "formatted text" is equally loose. Context matters. It reminds me of "paying cash", which can mean different things when you're buying a hamburger and buying a corporation. > > Actually there is no ESC. There are CR, LF, FF. And then a code ***called*** ESC, but it is not at all ESC, it is SS2, SINGLE SHIFT 2, it works exactly as SS2. > It rather works like SS1, which we sadly never got in ECMA-48 or ECMA-35. Then SS2 actually is SS2. From beckiergb at gmail.com Wed Jan 11 14:11:01 2023 From: beckiergb at gmail.com (Rebecca Bettencourt) Date: Wed, 11 Jan 2023 12:11:01 -0800 Subject: =?UTF-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <20230111030516.00004933@secarica.ro> <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> Message-ID: > Actually there is no ESC. There are CR, LF, FF. And then a code ***called*** ESC, but it is not at all ESC, it is SS2, SINGLE SHIFT 2, it works exactly as SS2. Just because the ESC in GSM does not work the same way as the ESC in ECMA-48 does not mean it's not ESC. By definition, any control code that changes the meaning of the characters after it can be called ESC. You can't just apply the semantics of ECMA-48 to GSM and then claim ESC is "misnamed" because the semantics don't match; GSM is not ECMA-48. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Jan 11 08:12:45 2023 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 11 Jan 2023 14:12:45 +0000 (GMT) Subject: =?UTF-8?Q?Re:_=E2=80=9Cplain_text_styling=E2=80=9D=E2=80=A6?= Message-ID: <7c193a10.39f80.185a12d1842.Webtop.96@btinternet.com> Kent Karlsson wrote as follows. > But yes, historically there have been other control/escape sequence > definitions for various types of equipments from different > manufacturers. I remember that back in the late 1980s or early 1990s in the computing laboratory where I worked at the time, where there were at that time mostly ordinary monochrome text terminals each linked to a mainframe computer, a then rather expensive colour graphics terminal was purchased. This connected to the mainframe computer in exactly the same way as an ordinary monochrome text terminal. The way that one used the colour graphics was by programming, in a program written in, say, Pascal, that was compiled and run on the mainframe computer, software that would send a (base 10 used then and here) character 27 followed by a sequence of characters. If I remember correctly, one such sequence started with a [ character and ended with a ] character. I do not remember whether each sequence type was between [ and ] or whether sequences each started and finished each differently, such as ( to ) and { to }. I think some graphics commands might have been just a character 27 followed by a single character so as, to, say, change the colour of the drawing pen to a particular preset colour, but I am not sure of that. I remember that it was very straightforward to use those sequences in a computer program and various people produced some very good results using the colour graphics terminal. I do not know whether the sequences used were specific to that particular product or if they were part of a standard, whether a de jure standard or a de facto standard, or just some informal quasi-standard generated in a Usenet newsgroup or the like. William Overington Wednesday 11 January 2023 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Thu Jan 12 10:57:44 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 12 Jan 2023 17:57:44 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <91bd4eba-6205-633a-c21d-989fe4847f04@ix.netcom.com> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <925a5d79-557d-24ec-1089-9ed0f7952681@shoulson.com> <91bd4eba-6205-633a-c21d-989fe4847f04@ix.netcom.com> Message-ID: <3C79366F-20C7-4C14-BF24-7096CDAA3B19@bahnhof.se> > 10 jan. 2023 kl. 07:07 skrev Asmus Freytag via Unicode : > > On 1/9/2023 9:09 PM, Doug Ewell via Unicode wrote: >>> 3. Emoji vs text presentation. > to me that's more clearly pseudo-encoding than some of the other things now possible with emoji. It's because the wrong presentation is nearly always really wrong, so there's no common fallback. > > And add to that, that the introduction of the wrong default made existing applications and texts suddenly fail, and you have one of the worst blunders in Unicode's encoding history. I currently try to stay out of emoji stuff. Mostly. But I must point out that labelling (the emoji for) poisonous mushrooms, and some mushrooms are deadly poisonous, as ?food? or ?vegetables? is highly inappropriate. > [?] > Formatting / styling to me is distinguished by something that's conceptually always applied to a run of text, and usually not on runs of length one. > Technically, yes, but conceptually no. Styling can for the most part be though of as applying to individual characters. This is in contrast to such things as bidi, which, even without bidi controls, just based on the bidi categories of individual characters, must be seen as applying to runs of characters, due to the resulting reordering. (Sorry for the long sentence.) > The main exception to that was mathematical notation, and we opted to make a principled exception, precisely because semantic mapping to highly specific shapes for an individual symbol is or should not be the task of "styling?. > 1. That styling(!) is lost when doing normalizing to NFKD or NFKC. 2. MathML still considers it a styling. LaTeX has always considered it a styling. 3. It is not general enough. (See my proposal on math expression representation.) /Kent K > > Flag sequences and the like are true examples of pseudo coding. Introducing a scheme that maps arbitrary code point sequences to a symbol in a way that depends on definitions maintained outside the Unicode Standard. It's the clearest case of injecting another character set (or a lego system to representing one) into the Standard that I've seen. > > We could have done the same with three-letter codes for currency symbols, but we didn't, and that marks the difference. > > A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Thu Jan 12 10:57:42 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 12 Jan 2023 17:57:42 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <1673452505839.1379722402.1114942410@gmail.com> References: <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> <1673452505839.1379722402.1114942410@gmail.com> Message-ID: <23D49799-1F25-451F-9131-EEC8379E9A46@bahnhof.se> > 11 jan. 2023 kl. 17:20 skrev S?awomir Osipiuk : > > On Wednesday, 11 January 2023, 07:25:34 (-05:00), Kent Karlsson via Unicode wrote: >> >> Yes, but there are different kinds of on/off switches, syntaxwise. Some fit in an otherwise plain text context, others don?t. >> > I still think the distinction you're drawing ? that codes below U+0020 are not "plain text" ? is arbitrary. I did not quite say that. The thing is that the escape sequences and control sequences are (intended to be) ?default ignorable?. But ECMA-48 was developed long before the formal concept of ?default ignorable? (which is a Unicode concept, and Unicode does not say much about C0 and C1) was invented. And there are exceptions (like LF, HT). And? few applications, other than terminal emulators, actually handle them as ?default ignorable? beyond the ESC or CSI itself (which are in practice default ignorable). So, it is imperfect, but that is the basic idea and what is given in an already existing standard (instead of defining something completely new that is ?default ignorable?). Using this, there is also no need for some printable characters to by necessity be represented as a character reference (in HTML, for instance, a real ? What special quality do they have? Can't be typed on a keyboard? Don't have visible glyphs? Affect the display of other characters? Are default-ignorable in Unicode? None of these things are unique to them. > > "Plain text" is a loose definition because "formatted text" is equally loose. Note that the subject line for this thread has quote marks for that reason. Ask anyone 50 or so years back ?show me an example of plain text? and likely they would have pointed out any ordinary newspaper article (printed in plain black ink) without fotos, no matter if it used bold, italics, or different sized characters in the text. > Context matters. It reminds me of "paying cash", which can mean different things when you're buying a hamburger and buying a corporation. > >> >> Actually there is no ESC. There are CR, LF, FF. And then a code ***called*** ESC, but it is not at all ESC, it is SS2, SINGLE SHIFT 2, it works exactly as SS2. >> > > It rather works like SS1, which we sadly never got in ECMA-48 or ECMA-35. Then SS2 actually is SS2. And SS1 would be a no-op? A bit like NULL was intended to be? SS2 is ?jump to secondary codepage?, SS3 is ?jump to tertiary codepage?. /Kent K From kent.b.karlsson at bahnhof.se Thu Jan 12 10:57:39 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 12 Jan 2023 17:57:39 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <20230111030516.00004933@secarica.ro> <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> Message-ID: <3ED72385-6611-4DEF-A342-59F0EBA082C3@bahnhof.se> > 11 jan. 2023 kl. 21:11 skrev Rebecca Bettencourt via Unicode : > > > Actually there is no ESC. There are CR, LF, FF. And then a code ***called*** ESC, but it is not at all ESC, it is SS2, SINGLE SHIFT 2, it works exactly as SS2. > > Just because the ESC in GSM does not work the same way as the ESC in ECMA-48 does not mean it's not ESC. You can call it MAMA if you like (but that would also be confusing). It still works just like SS2, not at all like ESC, not even close (i.e. not even like the ESC of old equipments, like that Cristian referred to). > By definition, any control code that changes the meaning of the characters after it can be called ESC. You can't just apply the semantics of ECMA-48 to GSM Cristian tried to do that (to some extent), and I said no? > and then claim ESC is ?misnamed" I did. It is misnamed there and for Teletext. And yes, I know there were other ESC sequence definitions before ECMA-48, which still were ESC sequences, not ?jumping? to another codepage. > because the semantics don't match; GSM is not ECMA-48. That?s what I said (though I said SMS and cell broadcast 7-bit charsets; GSM (2G) is somewhat outdated, we're (mostly) on 4G and 5G now). /Kent K From doug at ewellic.org Thu Jan 12 11:26:14 2023 From: doug at ewellic.org (Doug Ewell) Date: Thu, 12 Jan 2023 17:26:14 +0000 Subject: =?utf-8?B?UkU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <3C79366F-20C7-4C14-BF24-7096CDAA3B19@bahnhof.se> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <925a5d79-557d-24ec-1089-9ed0f7952681@shoulson.com> <91bd4eba-6205-633a-c21d-989fe4847f04@ix.netcom.com> <3C79366F-20C7-4C14-BF24-7096CDAA3B19@bahnhof.se> Message-ID: Kent Karlsson replied to Asmus Freytag: >> The main exception to that was mathematical notation, and we opted to >> make a principled exception, precisely because semantic mapping to >> highly specific shapes for an individual symbol is or should not be >> the task of "styling?. > > 1. That styling(!) is lost when doing normalizing to NFKD or NFKC. To be fair, a lot of content may be lost when normalizing to NFKD and NFKC. For example, superscript and subscript digits are normalized to Basic Latin, so 2? becomes 23. I think it is safe to say that is an important semantic change, and UAX #15 agrees: ?Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate. They can be applied more freely to domains with restricted character sets.? -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From doug at ewellic.org Thu Jan 12 11:39:20 2023 From: doug at ewellic.org (Doug Ewell) Date: Thu, 12 Jan 2023 17:39:20 +0000 Subject: =?utf-8?B?UkU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <23D49799-1F25-451F-9131-EEC8379E9A46@bahnhof.se> References: <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> <1673452505839.1379722402.1114942410@gmail.com> <23D49799-1F25-451F-9131-EEC8379E9A46@bahnhof.se> Message-ID: Kent Karlsson replied to S?awomir Osipiuk: >> I still think the distinction you're drawing ? that codes below >> U+0020 are not "plain text" ? is arbitrary. > > [...] > > Using this, there is also no need for some printable characters to by > necessity be represented as a character reference (in HTML, for > instance, a real ? reference, like <). I do feel this distinction is important. Text that most of us would consider ?plain text??which could include just about any printable character, but no C0 control characters other than CR, LF, HT, and maybe a couple of others?needs no escaping or other modification to conform to the ECMA-48 model. This is not entirely unlike the observation that plain ASCII text is also valid UTF-8. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From sosipiuk at gmail.com Thu Jan 12 12:16:40 2023 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Thu, 12 Jan 2023 18:16:40 +0000 Subject: =?UTF-8?B?4oCccGxhaW4=?= text =?UTF-8?B?c3R5bGluZ+KAneKApg==?= In-Reply-To: <23D49799-1F25-451F-9131-EEC8379E9A46@bahnhof.se> References: <23D49799-1F25-451F-9131-EEC8379E9A46@bahnhof.se> Message-ID: <1673544325901.1651898979.2689869098@gmail.com> On Thursday, 12 January 2023, 11:57:42 (-05:00), Kent Karlsson wrote: > > Note that the subject line for this thread has quote marks for that reason. > Fair enough. But the point stands that what counts as a special character is decided by context. A "<" is not "plain" when writing HTML, just as ESC is not plain in an ECMA-48-aware context. In HTML you must represent "<" as "<" but what if you want to represent ESC literally in ECMA-48? The most likely response is "you can't" or "why would you want that?" but that already presupposes that "<" is more plain than ESC ? that wanting to be able to include it is more "legitimate" ? and that presupposition is more historical baggage than a solid definition. I'll concede that, by popular opinion (as Doug just pointed out in another message), the C0s are perceived as less plain-text than the common printable characters. > > And SS1 would be a no-op? A bit like NULL was intended to be? > No, you're thinking of SS0, but even that wouldn't necessarily be a no-op. In the framework provided by EMCA-35 (ISO 2022) there are (up to) four "code pages" for graphic characters: G0, G1, G2, and G3. SS2 invokes a character from G2, and SS3 from G3. G1 is typically invoked simply by using the high bit in 8-bit environments, (e.g. the extended part of extended ASCII) so a single-shift would be mostly useless. In 7-bit environments though, invoking from G1 needs locking shifts. Hence my musing that an SS1 would be nice to have if you're dealing in 7 bits only and ECMA-35 is incomplete without it. > SS2 is ?jump to secondary codepage?, SS3 is ?jump to tertiary codepage?. SS2 is actually "jump to tertiary" ? the wonders of zero-based indexing. The GSM-7 "ESC" functions like what ECMA-48 would call SS1, if it existed there. From liste at secarica.ro Thu Jan 12 12:23:31 2023 From: liste at secarica.ro (Cristian =?UTF-8?Q?Secar=C4=83?=) Date: Thu, 12 Jan 2023 20:23:31 +0200 Subject: =?UTF-8?Q?=E2=80=9Cplain?= text =?UTF-8?Q?styling=E2=80=9D?= =?UTF-8?Q?=E2=80=A6?= In-Reply-To: <3ED72385-6611-4DEF-A342-59F0EBA082C3@bahnhof.se> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <20230111030516.00004933@secarica.ro> <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> <3ED72385-6611-4DEF-A342-59F0EBA082C3@bahnhof.se> Message-ID: <20230112202331.00002f45@secarica.ro> ?n data de Thu, 12 Jan 2023 17:57:39 +0100, Kent Karlsson via Unicode a scris: > > Just because the ESC in GSM does not work the same way as the ESC > > in ECMA-48 does not mean it's not ESC. > > You can call it MAMA if you like (but that would also be confusing). > It still works just like SS2, not at all like ESC, not even close > (i.e. not even like the ESC of old equipments, like that Cristian > referred to). Well, it is the 3GPP 23.038 specification [1] that calls it "ESC" (not me or anyone else here). As for the "not even like the ESC of old equipments" I am not sure how this is *not* similar: ESC e gives ? ESC < gives [ ESC ( gives { ... and so on (not that many more, though) While not mentioned anywhere in the specification, in terms of SS that should probably be SS1 (only with the ESC ESC sequence as SS2). Anyway, this strict GSM-specific discussion became off topic now; what I wanted to say initially, was that *in certain cases* ? even if not that many, as I can imagine ? a ~plain text styling may mislead ordinary users when physical (low level) characters count matters on something presumed to be strictly plain (as opposed to higher levels of text styling, where even a few dozen characters can go unnoticed, usually due to the nature of the target application). Cristi [1] https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=745 -- Cristian Secar? https://www.secarica.ro From harjitmoe at outlook.com Thu Jan 12 12:44:42 2023 From: harjitmoe at outlook.com (Harriet Riddle) Date: Thu, 12 Jan 2023 18:44:42 +0000 Subject: =?UTF-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <23D49799-1F25-451F-9131-EEC8379E9A46@bahnhof.se> References: <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> <1673452505839.1379722402.1114942410@gmail.com> <23D49799-1F25-451F-9131-EEC8379E9A46@bahnhof.se> Message-ID: Kent Karlsson via Unicode wrote: > And SS1 would be a no-op? A bit like NULL was intended to be? > > SS2 is ?jump to secondary codepage?, SS3 is ?jump to tertiary codepage?. > > /Kent K --- Not quite.? ECMA-48 is written strictly within the confines of ECMA-35, and the semantics of SS2 and SS3 are described in more detail there.? But in brief: it specifies that the next code, encoded using either the 0x20?7F range or the 0xA0?FF range (confusingly, it specifies this twice, once for 7-bit encodings (with only the 0x20?7F range) and again for 8-bit encodings (permitting either but not a mixture), but I digress), is sourced from the set designated as the G2 set.? This is as opposed to the G0 (Shift In), G1 (Shift Out) or G3 (SS3) sets.? So SS1, in the context of ECMA-35, would be a non-locking shift to the same codepage that Shift Out / SO is a locking shift to.? Although even SS0 wouldn't be a no-op, since it would allow accessing characters in the G0 set without leaving Shift Out state. (For reasons I can't quite fathom, both ECMA-35 and ECMA-48 make a punt at pretending that LS1 and LS0 (Locking Shifts 1 and 0) in an 8-bit code aren't the same thing as SO and SI in a 7-bit code, even though they do exactly the same thing and are coded at the same positions.? While I cannot read the committee's mind on that, my only guess is that it's either to better correlate them with LS1R, to clarify that they only operate on ASCII bytes rather than all non?control bytes (as opposed to in EBCDIC, where SO and SI operate on the entire 0x41?FE range), or both.) I suppose you could think of G2 (the SS2 set) as the "second supplementary codepage", where the Shift Out (G1) set is the "first supplementary codepage".? Treating the G0 set as a "primary codepage" versus the three "supplementary codepages" is not /explicitly/ done by ECMA-35, but the G0 set is effectively treated differently than the other three, since it cannot include 0x20 or 0x7F, and cannot be shift-invoked (locking or otherwise) over 0xA0?FF, while the other three can be 96-code sets and can be shift-invoked over either range. As a sidenote: nominally (in accordance with their derivation from ECMA-43), ISO 8859 encodings have ASCII in G0 and their supplements in G1.? In practice, this might not be the case in a Unix terminal contexts, since software may expect Shift Out to switch to DEC Special Graphics ("DECgraphics"), which can be worked around by including the supplement in G2 instead, and invoking G2 over 0xA0?FF (i.e. in LS2R state rather than LS1R state). --Har. -------------- next part -------------- An HTML attachment was scrubbed... URL: From harjitmoe at outlook.com Thu Jan 12 13:05:36 2023 From: harjitmoe at outlook.com (Harriet Riddle) Date: Thu, 12 Jan 2023 19:05:36 +0000 Subject: =?UTF-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <3ED72385-6611-4DEF-A342-59F0EBA082C3@bahnhof.se> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <20230111030516.00004933@secarica.ro> <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> <3ED72385-6611-4DEF-A342-59F0EBA082C3@bahnhof.se> Message-ID: Kent Karlsson via Unicode wrote: > I did. It is misnamed there and for Teletext. And yes, I know there were other ESC sequence definitions before ECMA-48, which still were ESC sequences, not ?jumping? to another codepage. EBCDIC's single-shift is called GE (graphic escape).? Where EBCDIC's SI and SO are mostly used for switching between single-byte and double-byte pages in a CJK encoding, GE seems to have been used for accessing a single-byte page of extended symbols (such as code page 310 for APL) while using a more conventional EBCDIC page as the main set.? Arguably, the GSM escape is a graphic escape (GE). From an ECMA-35 perspective, it doesn't really matter if 0x1B in Teletext and GSM is (a) ESC with a different behaviour to that specified in ECMA-35 or (b) something other than ESC.? Since ECMA-35 explicitly reserves 0x1B for ESC and forbids C0 sets from redefining it, and also defines the behaviour of ESC including the general structure of ESC sequences (which ECMA-48 conforms to), either is equally non-conformant.? In the case of GSM, it is further non-conformant by encoding glyphs over the CL area, which is reserved for C0 controls. --- > That?s what I said (though I said SMS and cell broadcast 7-bit charsets; GSM (2G) is somewhat outdated, we're (mostly) on 4G and 5G now). And yet, when I open my (Android 6.0) SMS app, with an active 4G connection, in the UK, and type a ' (ASCII apostrophe) character, it reports I have 159 characters remaining until it has to send a multi-part SMS.? When I delete that character and type a ~ (tilde) instead, it reports only 158 characters remaining.? When I delete that and type a ` (backtick), it reports only 69 characters remaining.? And as one might have guessed, if I delete that and paste in a ?, it reports 68 characters remaining. The amount of text that fits in 1120 bits under either GSM 7-bit (if within its repertoire) or UTF-16 (otherwise) is still a relevant metric, it seems. --Har. From kent.b.karlsson at bahnhof.se Thu Jan 12 18:59:28 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Fri, 13 Jan 2023 01:59:28 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: <20230112202331.00002f45@secarica.ro> References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <20230111030516.00004933@secarica.ro> <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> <3ED72385-6611-4DEF-A342-59F0EBA082C3@bahnhof.se> <20230112202331.00002f45@secarica.ro> Message-ID: > 12 jan. 2023 kl. 19:23 skrev Cristian Secar? via Unicode : > > ?n data de Thu, 12 Jan 2023 17:57:39 +0100, Kent Karlsson via Unicode a scris: > >>> Just because the ESC in GSM does not work the same way as the ESC >>> in ECMA-48 does not mean it's not ESC. >> >> You can call it MAMA if you like (but that would also be confusing). >> It still works just like SS2, not at all like ESC, not even close >> (i.e. not even like the ESC of old equipments, like that Cristian >> referred to). > > Well, it is the 3GPP 23.038 specification [1] that calls it "ESC" (not me or anyone else here). I know. (I should review the updates done during 2022; last I looked close was in 2020?) > As for the "not even like the ESC of old equipments" I am not sure how this is *not* similar: > > ESC e gives ? > ESC < gives [ > ESC ( gives { > ... and so on (not that many more, though) There are other ?national language? tables that are more filled. This is wildly different from ESC in ?old equipment? as well as ECMA-48. ESC with follow character(s) generate ?controls?; whereas the above ?generates? graphic characters (which is the purpose of SS2 and SS3). (Yes, I did suggest, in the referenced paper, to use a control sequence for character references? So I am violating the ?rule? myself? But I don?t see another way of having character references in ECMA-48 style.) > While not mentioned anywhere in the specification, in terms of SS that should probably be SS1 (only with the ESC ESC sequence as SS2). Note that there is no SS1 (nor any SS0). But we do have SS2 and SS3 (both invalid to use with Unicode of course). /Kent K > Anyway, this strict GSM-specific discussion became off topic now; what I wanted to say initially, was that *in certain cases* ? even if not that many, as I can imagine ? a ~plain text styling may mislead ordinary users when physical (low level) characters count matters on something presumed to be strictly plain (as opposed to higher levels of text styling, where even a few dozen characters can go unnoticed, usually due to the nature of the target application). > > Cristi > > [1] https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=745 > > -- > Cristian Secar? > https://www.secarica.ro > From kent.b.karlsson at bahnhof.se Thu Jan 12 19:01:01 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Fri, 13 Jan 2023 02:01:01 +0100 Subject: =?utf-8?B?UmU6IOKAnHBsYWluIHRleHQgc3R5bGluZ+KAneKApg==?= In-Reply-To: References: <3ED0F608-F006-4835-A621-85053C4BBB50@bahnhof.se> <20230107133329.00007e18@secarica.ro> <3FE11BCE-EEE2-4036-BACF-62FE9FFFFDC6@bahnhof.se> <20230111030516.00004933@secarica.ro> <0555F007-88DB-4E2B-90E3-3E90FD578F48@bahnhof.se> <3ED72385-6611-4DEF-A342-59F0EBA082C3@bahnhof.se> Message-ID: <91FCB8E0-CF48-41BE-821C-4B019081BF6B@bahnhof.se> This is getting too off-topic. But just two small remarks. (After this I will not comment more on SMS stuff in this thread.) > 12 jan. 2023 kl. 20:05 skrev Harriet Riddle via Unicode : > ? > From an ECMA-35 perspective, it doesn't really matter if 0x1B in Teletext and GSM is (a) ESC with a different behaviour to that specified in ECMA-35 or (b) something other than ESC.? Since ECMA-35 explicitly reserves 0x1B for ESC and forbids C0 sets from redefining it, and also defines the behaviour of ESC including the general structure of ESC sequences (which ECMA-48 conforms to), either is equally non-conformant.? In the case of GSM, it is further non-conformant by encoding glyphs over the CL area, which is reserved for C0 controls. There is no notion of C0, G0, etc. in these 7-bit charsets. But the 7-bit charsets do have a ?secondary codepage? (by another name) and are prepared for having a ?tertiary codepage? (but that is not (yet) used). > --- > >> That?s what I said (though I said SMS and cell broadcast 7-bit charsets; GSM (2G) is somewhat outdated, we're (mostly) on 4G and 5G now). > > > And yet, when I open my (Android 6.0) SMS app, with an active 4G connection, in the UK, and type a ' (ASCII apostrophe) character, it reports I have 159 characters remaining until it has to send a multi-part SMS.? When I delete that character and type a ~ (tilde) instead, it reports only 158 characters remaining.? When I delete that and type a ` (backtick), it reports only 69 characters remaining.? And as one might have guessed, if I delete that and paste in a ?, it reports 68 characters remaining. > > The amount of text that fits in 1120 bits under either GSM 7-bit (if within its repertoire) or UTF-16 (otherwise) is still a relevant metric, it seems. Backwards compatibility is a big issue here of course. If no new-fangled extension is used, everything should work as before also for ?old? user equipment (usually mobile phones). Both w.r.t. the charsets, but also w.r.t. the protocol itself. If something new-fangled is used, ?old? equipment may display ?mojibake". And, if the text cannot be represented in (one of, there are now several) the 7-charsets, a switch to ?USC-2? (3GPP still does not call it ?UTF-16BE??) can be done (though the 3GPP standards do not require that, it is application defined). /Kent K > --Har. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgcon6 at msn.com Fri Jan 13 15:31:19 2023 From: pgcon6 at msn.com (Peter Constable) Date: Fri, 13 Jan 2023 21:31:19 +0000 Subject: New registration and a request for a new smiley In-Reply-To: <0E56DA9D-9A60-420E-AE4B-BC0B4128015C@gmail.com> References: <0E56DA9D-9A60-420E-AE4B-BC0B4128015C@gmail.com> Message-ID: Valeria, You might find this useful: https://unicode.org/emoji/proposals.html Peter -----Original Message----- From: Unicode On Behalf Of Valeria Greco via Unicode Sent: December 24, 2022 10:28 AM To: unicode at corp.unicode.org Subject: New registration and a request for a new smiley Hi everyone, I?m very happy to subscribe to this mailing list. I would like to know if the Consortium thinks to create a Nativity emoji. In the emoji requests I don?t find anything. Merry Christmas to everyone, Valeria