From unicode at unicode.org Sat Jun 1 16:03:30 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 1 Jun 2019 15:03:30 -0600 Subject: Proposal to extend the U+1F4A9 Symbol Message-ID: <000001d518bd$779c5f80$66d51e80$@ewellic.org> bristol_poo wrote: > This would produce 7 variants of the U+1F4A9 emoji, including existing > (Which I believe is about Type 4 on the scale). > > Why? I think this would really benefit the medical profession, with a > large uptick in e-doctor/text only conversations towards the medical > profession. If physicians and other medical professionals are relying on emoji, in any way and at any time, to determine diagnosis and treatment, the state of health care is much worse than I thought. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Jun 1 17:11:08 2019 From: unicode at unicode.org (Tex via Unicode) Date: Sat, 1 Jun 2019 15:11:08 -0700 Subject: Proposal to extend the U+1F4A9 Symbol In-Reply-To: <000001d518bd$779c5f80$66d51e80$@ewellic.org> References: <000001d518bd$779c5f80$66d51e80$@ewellic.org> Message-ID: <000301d518c6$ea823250$bf8696f0$@xencraft.com> What I would find useful is an emoji for when my phone falls into the toilet. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell via Unicode Sent: Saturday, June 1, 2019 2:04 PM To: unicode at unicode.org Subject: Re: Proposal to extend the U+1F4A9 Symbol bristol_poo wrote: > This would produce 7 variants of the U+1F4A9 emoji, including existing > (Which I believe is about Type 4 on the scale). > > Why? I think this would really benefit the medical profession, with a > large uptick in e-doctor/text only conversations towards the medical > profession. If physicians and other medical professionals are relying on emoji, in any way and at any time, to determine diagnosis and treatment, the state of health care is much worse than I thought. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Jun 1 17:30:51 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 1 Jun 2019 16:30:51 -0600 Subject: Proposal to extend the U+1F4A9 Symbol In-Reply-To: <000301d518c6$ea823250$bf8696f0$@xencraft.com> References: <000001d518bd$779c5f80$66d51e80$@ewellic.org> <000301d518c6$ea823250$bf8696f0$@xencraft.com> Message-ID: <000701d518c9$abe2fa90$03a8efb0$@ewellic.org> Tex wrote: > What I would find useful is an emoji for when my phone falls into the > toilet. I would have thought ????? would be sufficient. But I didn't include any variation selectors and combining sequences for the gender, skin color, hair style, profession, and current state of mind of the phone's owner, and there are none for the brand and model of phone and toilet. So the sequence above is clearly inadequate for people to express themselves. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Jun 1 17:42:37 2019 From: unicode at unicode.org (Andrew West via Unicode) Date: Sat, 1 Jun 2019 23:42:37 +0100 Subject: Proposal to extend the U+1F4A9 Symbol In-Reply-To: <000701d518c9$abe2fa90$03a8efb0$@ewellic.org> References: <000001d518bd$779c5f80$66d51e80$@ewellic.org> <000301d518c6$ea823250$bf8696f0$@xencraft.com> <000701d518c9$abe2fa90$03a8efb0$@ewellic.org> Message-ID: On Sat, 1 Jun 2019 at 23:32, Doug Ewell via Unicode wrote: > > Tex wrote: > > > What I would find useful is an emoji for when my phone falls into the > > toilet. > > I would have thought ? would be sufficient. Don't worry, a brand new foolproof method of defining emoji for anything in the universe using Wikidata QIDs is coming to a phone near you soon (http://www.unicode.org/L2/L2019/19082r-qid-emoji.pdf) ... oh, there is no Wikidata QID for phone dropped in the toilet. Andrew From unicode at unicode.org Sat Jun 1 19:26:17 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 1 Jun 2019 18:26:17 -0600 Subject: Proposal to extend the U+1F4A9 Symbol In-Reply-To: References: <000001d518bd$779c5f80$66d51e80$@ewellic.org> <000301d518c6$ea823250$bf8696f0$@xencraft.com> <000701d518c9$abe2fa90$03a8efb0$@ewellic.org> Message-ID: <000001d518d9$cbad9be0$6308d3a0$@ewellic.org> Andrew West wrote: > oh, there is no Wikidata QID for phone dropped in the toilet. It's Wikidata, right? Pretty much anyone can create an item for pretty much anything, right? Problem solved. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Jun 11 03:02:30 2019 From: unicode at unicode.org (Fred Brennan via Unicode) Date: Tue, 11 Jun 2019 16:02:30 +0800 Subject: Question about U+170D, which I hope will become TAGALOG LETTER RA Message-ID: <50334777.k4CkLEVrbN@pc> Greetings, I write this letter with questions regarding a proposal I hope to make for the encoding of TAGALOG LETTER RA, which we locally know as the baybayin letter "ra", at U+170D. Many fonts are already using this unencoded codepoint for TAGALOG LETTER RA in breach of the standard. TAGALOG LETTER RA looks like TAGALOG LETTER DA, U+1707, with an extra stroke. For examples, see Norman de los Santos' Unicode baybayin fonts.[2] Paul Morrow's fonts, which are used on the Philippine peso, also include "ra" outside of the ones meant to be exact digitizations of the first baybayin fonts.[4] I had previously assumed that this space had been left open in anticipation of the future encoding of TAGALOG LETTER RA, but that this hadn't happened due to apathy; however I've since been informed that the space was left open as an oversight of sorts, considering that four Philippine scripts were encoded at once as a result of WG2 proposal N1933.[1] I hope to request this as the Google Noto developers will not follow the de facto standard unless it is given the Consortium's approval.[3] My questions are: ? How old do I need to prove the letter is? Baybayin "ra" is not used in writing Old Tagalog and is not used in the earliest Tagalog texts. However, it certainly has existed since at least 1985,[4; under heading Bikol Mintz] and perhaps decades earlier. ? May I use signs and fonts as evidence? What types of documents may I use? ? Would anyone volunteer to help me write this proposal, or check it over before I send it? Thank you. [1]: https://www.unicode.org/L2/L1999/n1933.pdf [2]: http://nordenx.blogspot.com/p/downloads.html [3]: https://github.com/googlefonts/noto-fonts/issues/1185 [4]: http://paulmorrow.ca/fonts.htm From unicode at unicode.org Mon Jun 17 18:27:41 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 17 Jun 2019 19:27:41 -0400 Subject: Watermarking with Apostrophes Message-ID: <84499338-8eb9-5e68-0052-0ba3149c1d71@kli.org> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 21 19:14:07 2019 From: unicode at unicode.org (=?iso-8859-2?Q?S=B3awomir_Osipiuk?= via Unicode) Date: Fri, 21 Jun 2019 20:14:07 -0400 Subject: Unicode "no-op" Character? Message-ID: <002401d5288f$6919cab0$3b4d6010$@gmail.com> Does Unicode include a character that does nothing at all? I'm talking about something that can be used for padding data without affecting interpretation of other characters, including combining chars and ligatures. I.e. a character that could hypothetically be inserted between a latin E and a combining acute and still produce ?. The historical description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what I want. It only has one slight disadvantage: it doesn't work. All software I've tried displays it as an unknown character and it definitely breaks up combinations. And U+0000 NULL seems even worse. I can imagine the answer is that this thing I'm looking for isn't a character at all and so should be the business of "a higher-level protocol" and not what Unicode was made for. but Unicode does include some odd things so I wonder if there is something like that regardless. Can anyone offer any suggestions? S?awomir Osipiuk -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jun 21 23:51:52 2019 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Sat, 22 Jun 2019 04:51:52 +0000 Subject: Unicode "no-op" Character? In-Reply-To: <002401d5288f$6919cab0$3b4d6010$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> Message-ID: I'm curious what you'd use it for? From: Unicode On Behalf Of Slawomir Osipiuk via Unicode Sent: Friday, June 21, 2019 5:14 PM To: unicode at unicode.org Subject: Unicode "no-op" Character? Does Unicode include a character that does nothing at all? I'm talking about something that can be used for padding data without affecting interpretation of other characters, including combining chars and ligatures. I.e. a character that could hypothetically be inserted between a latin E and a combining acute and still produce ?. The historical description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what I want. It only has one slight disadvantage: it doesn't work. All software I've tried displays it as an unknown character and it definitely breaks up combinations. And U+0000 NULL seems even worse. I can imagine the answer is that this thing I'm looking for isn't a character at all and so should be the business of "a higher-level protocol" and not what Unicode was made for... but Unicode does include some odd things so I wonder if there is something like that regardless. Can anyone offer any suggestions? S?awomir Osipiuk -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 22 00:03:13 2019 From: unicode at unicode.org (J Decker via Unicode) Date: Fri, 21 Jun 2019 22:03:13 -0700 Subject: Unicode "no-op" Character? In-Reply-To: <002401d5288f$6919cab0$3b4d6010$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> Message-ID: Sounds like a great use for ZWNBSP (zero width non-breaking space) 0xFEFF (Also used as BOM) or that doesn't break; maybe 'ZERO WIDTH SPACE' (U+200B) On Fri, Jun 21, 2019 at 9:48 PM S?awomir Osipiuk via Unicode < unicode at unicode.org> wrote: > Does Unicode include a character that does nothing at all? I?m talking > about something that can be used for padding data without affecting > interpretation of other characters, including combining chars and > ligatures. I.e. a character that could hypothetically be inserted between a > latin E and a combining acute and still produce ?. The historical > description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what > I want. It only has one slight disadvantage: it doesn?t work. All software > I?ve tried displays it as an unknown character and it definitely breaks up > combinations. And U+0000 NULL seems even worse. > > > > I can imagine the answer is that this thing I?m looking for isn?t a > character at all and so should be the business of ?a higher-level protocol? > and not what Unicode was made for? but Unicode does include some odd things > so I wonder if there is something like that regardless. Can anyone offer > any suggestions? > > > > S?awomir Osipiuk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 22 02:37:22 2019 From: unicode at unicode.org (Alex Plantema via Unicode) Date: Sat, 22 Jun 2019 09:37:22 +0200 Subject: Unicode "no-op" Character? References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> Message-ID: <90FE46C591CA4BADAC5EEC84886EA0A6@p4> Op zaterdag 22 juni 2019 02:14 schreef S?awomir Osipiuk via Unicode: > Does Unicode include a character that does nothing at all? I'm > talking about something that can be used for padding data without > affecting interpretation of other characters, including combining > chars and ligatures. I.e. a character that could hypothetically be > inserted between a latin E and a combining acute and still produce ?. > The historical description of U+0016 SYNCHRONOUS IDLE seems like > pretty much exactly what I want. It only has one slight disadvantage: > it doesn't work. All software I've tried displays it as an unknown > character and it definitely breaks up combinations. And U+0000 NULL > seems even worse. DEL was used as such on papertape to replace errors. Alex. From unicode at unicode.org Sat Jun 22 03:37:12 2019 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sat, 22 Jun 2019 10:37:12 +0200 Subject: Unicode "no-op" Character? In-Reply-To: <002401d5288f$6919cab0$3b4d6010$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> Message-ID: There nothing like what you are describing. Examples: 1. Display ? There are a few of the Default Ignorables that are always treated as invisible, and have little effect on other characters. However, even those will generally interfere with the display of sequences (be between 'q' and U+0308 ( q? ); within emoji sequences, within ligatures, etc), line break, etc. 2. Interpretation ? There is no character that would always be ignored by all processes. Some processes may ignore some characters (eg a search indexer may ignore most default ignorables), but there is nothing that all processes will ignore. The only exception would be if some cooperating processes that had agreed beforehand to strip some particular character. Mark On Sat, Jun 22, 2019 at 6:49 AM S?awomir Osipiuk via Unicode < unicode at unicode.org> wrote: > Does Unicode include a character that does nothing at all? I?m talking > about something that can be used for padding data without affecting > interpretation of other characters, including combining chars and > ligatures. I.e. a character that could hypothetically be inserted between a > latin E and a combining acute and still produce ?. The historical > description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what > I want. It only has one slight disadvantage: it doesn?t work. All software > I?ve tried displays it as an unknown character and it definitely breaks up > combinations. And U+0000 NULL seems even worse. > > > > I can imagine the answer is that this thing I?m looking for isn?t a > character at all and so should be the business of ?a higher-level protocol? > and not what Unicode was made for? but Unicode does include some odd things > so I wonder if there is something like that regardless. Can anyone offer > any suggestions? > > > > S?awomir Osipiuk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 22 09:38:07 2019 From: unicode at unicode.org (Rebecca T via Unicode) Date: Sat, 22 Jun 2019 07:38:07 -0700 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> Message-ID: Perhaps a codepoint from a private use area and another processing step to add/ remove them would work for you? On Sat, Jun 22, 2019, 1:39 AM Mark Davis ?? via Unicode wrote: > There nothing like what you are describing. Examples: > > 1. Display ? There are a few of the Default Ignorables that are always > treated as invisible, and have little effect on other characters. However, > even those will generally interfere with the display of sequences (be > between 'q' and U+0308 ( q? ); within emoji sequences, within > ligatures, etc), line break, etc. > 2. Interpretation ? There is no character that would always be ignored > by all processes. Some processes may ignore some characters (eg a search > indexer may ignore most default ignorables), but there is nothing that all > processes will ignore. > > The only exception would be if some cooperating processes that had agreed > beforehand to strip some particular character. > > Mark > > > On Sat, Jun 22, 2019 at 6:49 AM S?awomir Osipiuk via Unicode < > unicode at unicode.org> wrote: > >> Does Unicode include a character that does nothing at all? I?m talking >> about something that can be used for padding data without affecting >> interpretation of other characters, including combining chars and >> ligatures. I.e. a character that could hypothetically be inserted between a >> latin E and a combining acute and still produce ?. The historical >> description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what >> I want. It only has one slight disadvantage: it doesn?t work. All software >> I?ve tried displays it as an unknown character and it definitely breaks up >> combinations. And U+0000 NULL seems even worse. >> >> >> >> I can imagine the answer is that this thing I?m looking for isn?t a >> character at all and so should be the business of ?a higher-level protocol? >> and not what Unicode was made for? but Unicode does include some odd things >> so I wonder if there is something like that regardless. Can anyone offer >> any suggestions? >> >> >> >> S?awomir Osipiuk >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 22 13:18:00 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 22 Jun 2019 12:18:00 -0600 Subject: Unicode "no-op" Character? Message-ID: <000001d52926$d3acc930$7b065b90$@ewellic.org> S?awomir Osipiuk wrote: > Does Unicode include a character that does nothing at all? I'm talking > about something that can be used for padding data without affecting > interpretation of other characters, including combining chars and > ligatures. I join Shawn Steele in wondering what your "data padding" use case is for such a character. Most modern protocols don't require string fields to be exactly N characters long, or have their own mechanism for storing the real string length and ignoring any padding characters. If you just need to fill up space at the end of a line, and need a character that has as little disruptive meaning as possible, I agree that U+FEFF is probably the closest you'll get. NULL, of course, was intended to serve exactly this purpose, but everyone has decided for themselves what the C0 code points should be used for, and "display a .notdef glyph" is one of the popular choices. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Jun 22 16:02:01 2019 From: unicode at unicode.org (=?iso-8859-2?Q?S=B3awomir_Osipiuk?= via Unicode) Date: Sat, 22 Jun 2019 17:02:01 -0400 Subject: Unicode "no-op" Character? In-Reply-To: <002401d5288f$6919cab0$3b4d6010$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> Message-ID: <001201d5293d$bd30bf10$37923d30$@gmail.com> I see there is no such character, which I pretty much expected after Google didn't help. The original problem I had was solved long ago but the recent article about watermarking reminded me of it, and my question was mostly out of curiosity. The task wasn't, strictly speaking, about "padding", but about marking - injecting "flag" characters at arbitrary points in a string without affecting the resulting visible text. I think we ended up using ESC, which is a dumb choice in retrospect, though the whole approach was a bit of a hack anyway and the process it was for isn't being used anymore. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 22 16:19:14 2019 From: unicode at unicode.org (J Decker via Unicode) Date: Sat, 22 Jun 2019 14:19:14 -0700 Subject: Unicode "no-op" Character? In-Reply-To: <001201d5293d$bd30bf10$37923d30$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> Message-ID: On Sat, Jun 22, 2019 at 2:04 PM S?awomir Osipiuk via Unicode < unicode at unicode.org> wrote: > I see there is no such character, which I pretty much expected after > Google didn?t help. > > > > The original problem I had was solved long ago but the recent article > about watermarking reminded me of it, and my question was mostly out of > curiosity. The task wasn?t, strictly speaking, about ?padding?, but about > marking ? injecting ?flag? characters at arbitrary points in a string > without affecting the resulting visible text. I think we ended up using > ESC, which is a dumb choice in retrospect, though the whole approach was a > bit of a hack anyway and the process it was for isn?t being used anymore. > The spec would suggest that there are escape codes like that, which can be used. APC, U+009F ST, String Terminator, U+009C which is supposed to be a sequence of characters that should not be displayed, but may be used to control the application displaying them. (assuming they understand them) https://www.aivosto.com/articles/control-characters.html 156$9CSTString Terminator 234 9/12 ST ESC \ Closes a string opened by APC, DCS, OSC, PM or SOS. 159$9FAPCApplication Program Command 237 9/15 AC ESC _ Starts an application program command string. ST will end the command. The interpretation of the command is subject to the program in question. But it doesn't appear anything actually 'supports' that. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 22 16:50:49 2019 From: unicode at unicode.org (=?UTF-8?Q?S=C5=82awomir_Osipiuk?= via Unicode) Date: Sat, 22 Jun 2019 17:50:49 -0400 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> Message-ID: <002601d52944$8ed18b70$ac74a250$@gmail.com> Indeed. There are plenty of control characters that seem useful, but they really aren?t, due to lack of support from common software. Unicode is deliberately silent about most of them, which is fair, but not always convenient. If faced with the same problem today, I?d probably just go with U+FEFF (really only need a single char, not a whole delimited substring) or a different C0 control (maybe SI/LS0) and clean up the string if it needs to be presented to the user. I still think an ?idle?/?null tag?/?noop? character would be a neat addition to Unicode, but I doubt I can make a convincing enough case for it. From: J Decker [mailto:d3ck0r at gmail.com] Sent: Saturday, June 22, 2019 17:19 To: S?awomir Osipiuk Cc: Unicode Discussion Subject: Re: Unicode "no-op" Character? But it doesn't appear anything actually 'supports' that. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 22 18:16:38 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 23 Jun 2019 00:16:38 +0100 Subject: Unicode "no-op" Character? In-Reply-To: <002601d52944$8ed18b70$ac74a250$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> Message-ID: <20190623001638.262a79ff@JRWUBU2> On Sat, 22 Jun 2019 17:50:49 -0400 S?awomir Osipiuk via Unicode wrote: > If faced with the same problem today, I?d > probably just go with U+FEFF (really only need a single char, not a > whole delimited substring) or a different C0 control (maybe SI/LS0) > and clean up the string if it needs to be presented to the user. You'd really want an intelligent choice between U+FEFF (ZWNBSP) (better U+2060 WJ) and U+200B (ZWSP). > I still think an ?idle?/?null tag?/?noop? character would be a neat > addition to Unicode, but I doubt I can make a convincing enough case > for it. You'd still only be able to insert it between characters, not between code units, unless you were using UTF-32. Richard. From unicode at unicode.org Sat Jun 22 18:56:11 2019 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Sat, 22 Jun 2019 23:56:11 +0000 Subject: Unicode "no-op" Character? In-Reply-To: <20190623001638.262a79ff@JRWUBU2> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> <20190623001638.262a79ff@JRWUBU2> Message-ID: Assuming you were using any of those characters as "markup", how would you know when they were intentionally in the string and not part of your marking system? -----Original Message----- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Saturday, June 22, 2019 4:17 PM To: unicode at unicode.org Subject: Re: Unicode "no-op" Character? On Sat, 22 Jun 2019 17:50:49 -0400 S?awomir Osipiuk via Unicode wrote: > If faced with the same problem today, I?d probably just go with U+FEFF > (really only need a single char, not a whole delimited substring) or a > different C0 control (maybe SI/LS0) and clean up the string if it > needs to be presented to the user. You'd really want an intelligent choice between U+FEFF (ZWNBSP) (better U+2060 WJ) and U+200B (ZWSP). > I still think an ?idle?/?null tag?/?noop? character would be a neat > addition to Unicode, but I doubt I can make a convincing enough case > for it. You'd still only be able to insert it between characters, not between code units, unless you were using UTF-32. Richard. From unicode at unicode.org Sat Jun 22 18:56:50 2019 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Sat, 22 Jun 2019 23:56:50 +0000 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> Message-ID: + the list. For some reason the list's reply header is confusing. From: Shawn Steele Sent: Saturday, June 22, 2019 4:55 PM To: S?awomir Osipiuk Subject: RE: Unicode "no-op" Character? The original comment about putting it between the base character and the combining diacritic seems peculiar. I'm having a hard time visualizing how that kind of markup could be interesting? From: Unicode > On Behalf Of Slawomir Osipiuk via Unicode Sent: Saturday, June 22, 2019 2:02 PM To: unicode at unicode.org Subject: RE: Unicode "no-op" Character? I see there is no such character, which I pretty much expected after Google didn't help. The original problem I had was solved long ago but the recent article about watermarking reminded me of it, and my question was mostly out of curiosity. The task wasn't, strictly speaking, about "padding", but about marking - injecting "flag" characters at arbitrary points in a string without affecting the resulting visible text. I think we ended up using ESC, which is a dumb choice in retrospect, though the whole approach was a bit of a hack anyway and the process it was for isn't being used anymore. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 22 19:07:17 2019 From: unicode at unicode.org (Marius Spix via Unicode) Date: Sun, 23 Jun 2019 02:07:17 +0200 Subject: Aw: Unicode "no-op" Character? In-Reply-To: <002401d5288f$6919cab0$3b4d6010$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> Message-ID: Combining Grapheme Joiner (U+034F) is probably what you want as it is default ignorable and keeps the acute on top of the E. However it nay break languages with di- and trigraphs or complex diacritics. Best regards Marius > Gesendet: Samstag, 22. Juni 2019 um 02:14 Uhr > Von: "S?awomir Osipiuk via Unicode" > An: unicode at unicode.org > Betreff: Unicode "no-op" Character? > > Does Unicode include a character that does nothing at all? I'm talking about > something that can be used for padding data without affecting interpretation > of other characters, including combining chars and ligatures. I.e. a > character that could hypothetically be inserted between a latin E and a > combining acute and still produce ?. The historical description of U+0016 > SYNCHRONOUS IDLE seems like pretty much exactly what I want. It only has one > slight disadvantage: it doesn't work. All software I've tried displays it as > an unknown character and it definitely breaks up combinations. And U+0000 > NULL seems even worse. > > > > I can imagine the answer is that this thing I'm looking for isn't a > character at all and so should be the business of "a higher-level protocol" > and not what Unicode was made for. but Unicode does include some odd things > so I wonder if there is something like that regardless. Can anyone offer any > suggestions? > > > > S?awomir Osipiuk > > From unicode at unicode.org Sat Jun 22 19:26:09 2019 From: unicode at unicode.org (=?iso-8859-2?Q?S=B3awomir_Osipiuk?= via Unicode) Date: Sat, 22 Jun 2019 20:26:09 -0400 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> Message-ID: <004301d5295a$4213ffa0$c63bfee0$@gmail.com> I assure you, it wasn't very interesting. :-) Headache-y, more like. The diacritic thing was completely inapplicable anyway, as all our text was plain English. I really don't want to get into what the thing was, because it sounds stupider the more I try to explain it. But it got the wheels spinning in my head, and now that I've been reading up a lot about Unicode and older standards like 2022/6429, it got me thinking whether there might already be an elegant solution. But, as an example I'm making up right now, imagine you want to packetize a large string. The packets are not all equal sized, the sizes are determined by some algorithm. And the packet boundary may occur between a base char and a diacritic. You insert markers into the string at the packet boundaries. You can then store the string, copy it, display it, or pass it to the sending function which will scan the string and know to send the next packet when it reaches the marker. And you can now do all that without the need to pass around extra metadata (like a list of ints of where the packet boundaries are supposed to be) or to re-calculate the boundaries; it's still just a big string. If a different application sees the string, it will know to completely ignore the packet markers; it can even strip them out if it wants to (the canonical equivalent of the noop character is the absence of a character). As should be obvious, I'm not recommending this as good practice. From: Shawn Steele [mailto:Shawn.Steele at microsoft.com] Sent: Saturday, June 22, 2019 19:57 To: S?awomir Osipiuk; unicode at unicode.org Subject: RE: Unicode "no-op" Character? + the list. For some reason the list's reply header is confusing. From: Shawn Steele Sent: Saturday, June 22, 2019 4:55 PM To: S?awomir Osipiuk Subject: RE: Unicode "no-op" Character? The original comment about putting it between the base character and the combining diacritic seems peculiar. I'm having a hard time visualizing how that kind of markup could be interesting? From: Unicode On Behalf Of Slawomir Osipiuk via Unicode Sent: Saturday, June 22, 2019 2:02 PM To: unicode at unicode.org Subject: RE: Unicode "no-op" Character? I see there is no such character, which I pretty much expected after Google didn't help. The original problem I had was solved long ago but the recent article about watermarking reminded me of it, and my question was mostly out of curiosity. The task wasn't, strictly speaking, about "padding", but about marking - injecting "flag" characters at arbitrary points in a string without affecting the resulting visible text. I think we ended up using ESC, which is a dumb choice in retrospect, though the whole approach was a bit of a hack anyway and the process it was for isn't being used anymore. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 22 19:59:16 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 23 Jun 2019 01:59:16 +0100 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> <20190623001638.262a79ff@JRWUBU2> Message-ID: <20190623015916.3cbce1f1@JRWUBU2> On Sat, 22 Jun 2019 23:56:11 +0000 Shawn Steele via Unicode wrote: > Assuming you were using any of those characters as "markup", how > would you know when they were intentionally in the string and not > part of your marking system? If they're conveying an invisible message, one would have to strip out original ZWNBSP/WJ/ZWSP that didn't affect line-breaking. The weak point is that that assumes that line-break opportunities are well-defined. For example, they aren't for SE Asian text. Richard. From unicode at unicode.org Sat Jun 22 20:10:08 2019 From: unicode at unicode.org (=?iso-8859-2?Q?S=B3awomir_Osipiuk?= via Unicode) Date: Sat, 22 Jun 2019 21:10:08 -0400 Subject: Unicode "no-op" Character? In-Reply-To: <20190623015916.3cbce1f1@JRWUBU2> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> <20190623001638.262a79ff@JRWUBU2> <20190623015916.3cbce1f1@JRWUBU2> Message-ID: <004e01d52960$669b4a30$33d1de90$@gmail.com> That's the key to the no-op idea. The no-op character could not ever be assumed to survive interchange with another process. It'd be canonically equivalent to the absence of character. It could be added or removed at any position by a Unicode-conformant process. A program could wipe all the no-ops from a string it has received, and insert its own for its own purposes. (In fact, it should wipe the old ones so as not to confuse itself.) It's "another process's discardable junk" unless known, internally-only, to be meaningful at a particular stage. While all the various (non)joiners/ignorables are interesting, none of them have this property. In fact, that might be the best description: It's not just an "ignorable", it's a "discardable". Unicode doesn't have that, does it? -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham via Unicode Sent: Saturday, June 22, 2019 20:59 To: unicode at unicode.org Cc: Shawn Steele Subject: Re: Unicode "no-op" Character? If they're conveying an invisible message, one would have to strip out original ZWNBSP/WJ/ZWSP that didn't affect line-breaking. The weak point is that that assumes that line-break opportunities are well-defined. For example, they aren't for SE Asian text. Richard. From unicode at unicode.org Sun Jun 23 03:24:50 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 23 Jun 2019 09:24:50 +0100 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> Message-ID: <20190623092450.54576fdf@JRWUBU2> On Sat, 22 Jun 2019 23:56:50 +0000 Shawn Steele via Unicode wrote: > + the list. For some reason the list's reply header is confusing. > > From: Shawn Steele > Sent: Saturday, June 22, 2019 4:55 PM > To: S?awomir Osipiuk > Subject: RE: Unicode "no-op" Character? > > The original comment about putting it between the base character and > the combining diacritic seems peculiar. I'm having a hard time > visualizing how that kind of markup could be interesting? There are a number of possible interesting scenarios: 1) Chopping the string into user perceived characters. For example, the Khmer sequences of COENG plus letter are named sequences. Akin to this is identifying resting places for a simple cursor, e.g. allowing it to be positioned between a base character and a spacing, unreordered subscript. (This last possibility overlaps with rendering.) 2) Chopping the string into collating elements. (This can require renormalisation, and may raise a rendering issue with HarfBuzz, where renomalisation is required to get marks into a suitable order for shaping. I suspect no-op characters would disrupt this renormalisation; CGJ may legitimately be used to affect rendering this way, even though it is supposed to have no other effect* on rendering.) 3) Chopping the string into default grapheme clusters. That separates a coeng from the following character with which it interacts. *Is a Unicode-compliant *renderer* allowed to distinguish diaeresis from the umlaut mark? Richard. From unicode at unicode.org Sun Jun 23 03:37:04 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 23 Jun 2019 09:37:04 +0100 Subject: Unicode "no-op" Character? In-Reply-To: <004e01d52960$669b4a30$33d1de90$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> <20190623001638.262a79ff@JRWUBU2> <20190623015916.3cbce1f1@JRWUBU2> <004e01d52960$669b4a30$33d1de90$@gmail.com> Message-ID: <20190623093704.5617fa18@JRWUBU2> On Sat, 22 Jun 2019 21:10:08 -0400 S?awomir Osipiuk via Unicode wrote: > In fact, that might be the best description: It's not just an > "ignorable", it's a "discardable". Unicode doesn't have that, does it? No, though the byte order mark at the start of a file comes close. Discardables are a security risk, as security filters may find it hard to take them into account. Richard. From unicode at unicode.org Sun Jun 23 10:33:59 2019 From: unicode at unicode.org (=?utf-8?B?5qKB5rW3IExpYW5nIEhhaQ==?= via Unicode) Date: Sun, 23 Jun 2019 17:33:59 +0200 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: <20190514204925.661f1ff6@JRWUBU2> References: <20190514030804.17e1b37b@JRWUBU2> <20190514204925.661f1ff6@JRWUBU2> Message-ID: <83153250-B8F3-4120-BF7A-8C447549E563@gmail.com> > (1) When can we anticipate that the USE spec will be updated to provide support for subjoined consonants below vowels (as required for TAI THAM) ? ? The exact scope is actually about allowing conjoined consonant forms (either encoded with a stacker, or encoded atomically?) after vowel signs in an encoded cluster. > ** A good use case is the Tai Tham word U+1A27 U+1A6A U+1A60 U+1A37 , transcribed to Central Thai script as ???, (to kiss). Currently, people are writing this as U+1A27 U+1A60 U+1A37 U+1A6A ("???") which violates the "phonetic ordering" but is the current workaround because USE is still broken for TAI THAM. ? I agree with Richard on that this is really not a good use case. This word (as long as it is written with the vowel sign Uu either under or after the conjoined consonant sign B) should really be encoded as , according to our best understanding today. ? The ?phonetic ordering? principle of Unicode is a frequently misinterpreted one. Note that when there are multiple ways of interpreting the phonetic order of a written structure, we try to stick to the more graphically apparent order, in order to have a stable encoding order. > An example of the contrast is shown in the attached files luynam.png, with first orthographic syllable , and yukya.png, with the first orthographic syllable . ? Right. I was always wondering to what extent this distinction happens as an orthographically conscious choice. ? Generally I feel, when at least one of the interacting signs (usually a consonant one and a vowel one) has inline advance, it should be safe to take a graphic order approach. The ?6th preliminary recommendation? doesn?t have the luynam vs yukya case taken into consideration mostly only because we wasn?t sure about what good attestations are there. > * Create new SAKOT class SAKOT (Sk) based on UISC = Invisible_Stacker > * Reduced HALANT class Now only HALANT (H) based on UISC = Virama ? This feels like an undesirable Tham-specific relaxation. Note the artificial distinction between UISC Invisible_Stacker and Virama has nothing to do with whether graphically writing a consonant sign after a vowel sign is attested for a script. (??) ? At least we need to look into USE-applicable (existing and future) scripts encoded with a Virama and see if any of them does need the relaxation. > * Updated Standard cluster mode [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > [VS] (CMAbv)* (CMBlw)) [MPre] [MAbv] [MBlw] [MPst] (VPre)* (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B)* (FAbv)* (FBlw)* (FPst)* [FM] ? I?m still trying to think about the possibility of only relaxing the cluster when either/both of has post-base advance? ? The artificial distinction made between < H | Sk > B, SUB, and CM really needs to be resolved together with the relaxation. > * Updated Halant-terminated cluster [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > [VS] (CMAbv)* (CMBlw)) < H | Sk > ? So, the intention of allowing Sk at the end is only about allowing the glyph of Sk to be positioned on the preceding character(s), right? > * New Sakot-terminated cluster [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > [VS] (CMAbv)* (CMBlw)) [MPre] [MAbv] [MBlw] [MPst] (VPre)* (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B [VS] (CMAbv)* (CMBlw)) Sk ? The ?(Sk B [VS] (CMAbv)* (CMBlw)) Sk? part doesn?t seem to align with the updated Standard cluster?s ?(Sk B)*?? > I trust you'll be reclassifying U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA and U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA into the category SUB so that we can write about bananas forever (?????????????): /kluai/ 'banana' /t?al??t/ 'for ever' The issues here are that WA in a medial r?le is indistinguishable from a coda ('sakot') consonant and that MEDIAL RA can act as a consonant aspirator. ? The issues here are: ? Medial consonant sign characters of Tham are not encoded based on a clear phono-orthographical distinction. ? Tham allows syllable chaining that does not rely on a preceding inline coda letter. ? Consonant sign Medial Ra being a consonant aspirator here is not relevant to its appearance before a non-medial consonant sign here. Best, ?? Liang Hai https://lianghai.github.io -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jun 23 19:27:37 2019 From: unicode at unicode.org (=?UTF-8?Q?S=C5=82awomir_Osipiuk?= via Unicode) Date: Sun, 23 Jun 2019 20:27:37 -0400 Subject: Unicode "no-op" Character? In-Reply-To: <20190623093704.5617fa18@JRWUBU2> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> <20190623001638.262a79ff@JRWUBU2> <20190623015916.3cbce1f1@JRWUBU2> <004e01d52960$669b4a30$33d1de90$@gmail.com> <20190623093704.5617fa18@JRWUBU2> Message-ID: <002a01d52a23$a0a72c30$e1f58490$@gmail.com> On the subject of security, I've read through: https://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters which says: "The issue is the following: A gateway might be checking for a sensitive sequence of characters, say "delete". If what is passed in is "deXlete", where X is a noncharacter, the gateway lets it through: the sequence "deXlete" may be in and of itself harmless. However, suppose that later on, past the gateway, an internal process invisibly deletes the X. In that case, the sensitive sequence of characters is formed, and can lead to a security breach." Checking a string for a sequence of characters, then passing the string to a different function which (potentially) modifies it, then using it in a context where the security checks mater, just screams bad practice to me. There should be no modification permitted between a security check and security-sensitive use. The string should always be sanitized before being checked for exploits. Any function which modifies the characters in any way (and is not itself security-aware) should implicitly mark the string as unsafe again. Or am I off base? Security is not really my specialty, but the approach described in the TR stinks horribly to me. And in my idea, noops would be stripped as part of string sanitization. But the more I consider it, the more I understand such a thing would have had to have be built into Unicode at the earliest stages. Basically, it's too late now. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham via Unicode Sent: Sunday, June 23, 2019 04:37 To: unicode at unicode.org Subject: Re: Unicode "no-op" Character? Discardables are a security risk, as security filters may find it hard to take them into account. Richard. From unicode at unicode.org Sun Jun 23 19:30:37 2019 From: unicode at unicode.org (=?UTF-8?Q?S=C5=82awomir_Osipiuk?= via Unicode) Date: Sun, 23 Jun 2019 20:30:37 -0400 Subject: Unicode "no-op" Character? In-Reply-To: <002a01d52a23$a0a72c30$e1f58490$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> <20190623001638.262a79ff@JRWUBU2> <20190623015916.3cbce1f1@JRWUBU2> <004e01d52960$669b4a30$33d1de90$@gmail.com> <20190623093704.5617fa18@JRWUBU2> <002a01d52a23$a0a72c30$e1f58490$@gmail.com> Message-ID: <002b01d52a24$0b8718d0$22954a70$@gmail.com> Ah, sorry. I meant to say that the string should always be normalized (not "sanitized") before being checked for exploits (i.e. sanitized). -----Original Message----- From: S?awomir Osipiuk [mailto:sosipiuk at gmail.com] Sent: Sunday, June 23, 2019 20:28 To: unicode at unicode.org Cc: 'Richard Wordingham' Subject: RE: Unicode "no-op" Character? The string should always be sanitized before being checked for exploits From unicode at unicode.org Mon Jun 24 00:39:15 2019 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Mon, 24 Jun 2019 05:39:15 +0000 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> <20190623001638.262a79ff@JRWUBU2> <20190623015916.3cbce1f1@JRWUBU2> <004e01d52960$669b4a30$33d1de90$@gmail.com> Message-ID: But... it's not actually discardable. The hypothetical "packet" architecture (using the term architecture somewhat loosely) needed the information being tunneled in by this character. If it was actually discardable, then the "noop" character wouldn't be required as it would be discarded. Since the character conveys meaning to some parts of the system, then it's not actually a "noop" and it's not actually "discardable". What is actually being requested isn't a character that nobody has meaning for, but rather a character that has no PUBLIC meaning. Which leads us to the key. The desire is for a character that has no public meaning, but has some sort of private meaning. In other words it has a private use. Oddly enough, there is a group of characters intended for private use, in the PUA ;-) Of course if the PUA characters interfered with the processing of the string, they'd need to be stripped, but you're sort of already in that position by having a private flag in the middle of a string. -Shawn -----Original Message----- From: Unicode On Behalf Of Slawomir Osipiuk via Unicode Sent: Saturday, June 22, 2019 6:10 PM To: unicode at unicode.org Cc: 'Richard Wordingham' Subject: RE: Unicode "no-op" Character? That's the key to the no-op idea. The no-op character could not ever be assumed to survive interchange with another process. It'd be canonically equivalent to the absence of character. It could be added or removed at any position by a Unicode-conformant process. A program could wipe all the no-ops from a string it has received, and insert its own for its own purposes. (In fact, it should wipe the old ones so as not to confuse itself.) It's "another process's discardable junk" unless known, internally-only, to be meaningful at a particular stage. While all the various (non)joiners/ignorables are interesting, none of them have this property. In fact, that might be the best description: It's not just an "ignorable", it's a "discardable". Unicode doesn't have that, does it? -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham via Unicode Sent: Saturday, June 22, 2019 20:59 To: unicode at unicode.org Cc: Shawn Steele Subject: Re: Unicode "no-op" Character? If they're conveying an invisible message, one would have to strip out original ZWNBSP/WJ/ZWSP that didn't affect line-breaking. The weak point is that that assumes that line-break opportunities are well-defined. For example, they aren't for SE Asian text. Richard. From unicode at unicode.org Mon Jun 24 11:47:58 2019 From: unicode at unicode.org (=?iso-8859-2?Q?S=B3awomir_Osipiuk?= via Unicode) Date: Mon, 24 Jun 2019 12:47:58 -0400 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> <20190623001638.262a79ff@JRWUBU2> <20190623015916.3cbce1f1@JRWUBU2> <004e01d52960$669b4a30$33d1de90$@gmail.com> Message-ID: <000a01d52aac$94affa40$be0feec0$@gmail.com> It's discardable outside of the context/process that created it. For a receiving process there is a difference between "this character has a meaning you don't understand" and "this character had a transitory meaning that has been exhausted". The first implies that it needs to be preserved and survive round-trip transmission (in fact the Unicode standard requires that). The second implies that it can be discarded. The first implies that it should be displayed to the user even if only as an "unknown something here". The second implies it should be ignored completely in display. Noncharacters have a use as internal-only sentinels, but they are difficult for an intermediate process to use if the text it receives already contains them (http://www.unicode.org/faq/private_use.html#nonchar10) and they break up combinations (they have a display effect, even if it's a subtle one). Private Use Characters are nice but they are still "part of" the text; if they are removed, the text is semantically changed. And they too display as something. I have to go back to how the SYN control character is defined. ECMA16/ISO1745 says "SYN is generally removed at the receiving Terminal Installation." It has a transitory purpose that is exhausted as soon as it is received. I wish Unicode hadn't shied away from either formalizing SYN or providing some kind of equivalent. I know it wasn't part of the scope Unicode set for itself, but I can still dream. -----Original Message----- From: Shawn Steele [mailto:Shawn.Steele at microsoft.com] Sent: Monday, June 24, 2019 01:39 To: S?awomir Osipiuk; unicode at unicode.org Cc: 'Richard Wordingham' Subject: RE: Unicode "no-op" Character? But... it's not actually discardable. The hypothetical "packet" architecture (using the term architecture somewhat loosely) needed the information being tunneled in by this character. If it was actually discardable, then the "noop" character wouldn't be required as it would be discarded. Since the character conveys meaning to some parts of the system, then it's not actually a "noop" and it's not actually "discardable". What is actually being requested isn't a character that nobody has meaning for, but rather a character that has no PUBLIC meaning. Which leads us to the key. The desire is for a character that has no public meaning, but has some sort of private meaning. In other words it has a private use. Oddly enough, there is a group of characters intended for private use, in the PUA ;-) Of course if the PUA characters interfered with the processing of the string, they'd need to be stripped, but you're sort of already in that position by having a private flag in the middle of a string. -Shawn From unicode at unicode.org Mon Jun 24 19:31:55 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Mon, 24 Jun 2019 17:31:55 -0700 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> <20190623001638.262a79ff@JRWUBU2> <20190623015916.3cbce1f1@JRWUBU2> <004e01d52960$669b4a30$33d1de90$@gmail.com> Message-ID: On Sun, Jun 23, 2019 at 10:41 PM Shawn Steele via Unicode wrote: > Which leads us to the key. The desire is for a character that has no public meaning, but has some sort of private meaning. In other words it has a private use. Oddly enough, there is a group of characters intended for private use, in the PUA ;-) Who's private use? If you have a stream of data that is being packetted for transmission, using a Private Use character is likely to mangle data that is being transmitted at some point. A NUL is likely to be the best option, IMO, since it's unlikely that anyone expects that they can transmit a NUL through an arbitrary channel, unlike a random private use character. -- Kie ekzistas vivo, ekzistas espero. From unicode at unicode.org Mon Jun 24 22:46:02 2019 From: unicode at unicode.org (J Decker via Unicode) Date: Mon, 24 Jun 2019 20:46:02 -0700 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <002601d52944$8ed18b70$ac74a250$@gmail.com> <20190623001638.262a79ff@JRWUBU2> <20190623015916.3cbce1f1@JRWUBU2> <004e01d52960$669b4a30$33d1de90$@gmail.com> Message-ID: On Mon, Jun 24, 2019 at 5:35 PM David Starner via Unicode < unicode at unicode.org> wrote: > On Sun, Jun 23, 2019 at 10:41 PM Shawn Steele via Unicode > wrote: > > IMO, since it's unlikely that anyone expects > that they can transmit a NUL through an arbitrary channel, unlike a > random private use character. You would be wrong. NUL is a valid codepoint like any other; except like in the C standard library and descendants. And, I expect it to be maintained. And, for the most part is, (except for emscripten) > > -- > Kie ekzistas vivo, ekzistas espero. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jun 25 19:27:47 2019 From: unicode at unicode.org (=?iso-8859-2?Q?S=B3awomir_Osipiuk?= via Unicode) Date: Tue, 25 Jun 2019 20:27:47 -0400 Subject: New control characters! (was: Re: Unicode "no-op" Character?) Message-ID: <001501d52bb5$fb86d450$f2947cf0$@gmail.com> All right. Thanks to everyone who offered suggestions. I think the final choice will depend on the specific application, if I ever face this puzzle again. If nothing else, this discussion has helped me formulate what exactly it is I'm imagining, which is actually a bit different that was I started with. So, just to put it out there for the internet to archive (with the likes of the various proposed "unofficial" UTFs I've been reading about), here are my two proposed control characters (why just one when you can have two at twice the price?) Implementors, feel free to jump right on this. :-) I chose to assign them to 0xE and 0xF because the use of ISO2022-style stateful shifts is expressly not permitted by ISO 10646, so by my reading the existence of those code points inside a UCS stream is a roundabout error. Therefore I'm reclaiming them for something useful. EP1 - EPHEMERAL PRIVATE SENTINEL 1 (0x0E) EP1 is executed as a null operation at the presentation layer. The formation of ligatures, the behavior of combining characters, and similar presentation mechanisms, must proceed as if EP1 were not present even when it occurs within sequences that effect such mechanisms. EP1 is intended to be used as a private process-internal sentinel or flag character. EP1 may be added at any positions in the character stream. EP1 may be removed from the stream by any receiving process that has not established an agreement for special handling of EP1. EP1 should be removed from the stream prior to any security validation. It must not interfere with the recognition of security-sensitive keywords, sequences, or credentials. EP2 - EPHEMERAL PRIVATE SENTINEL 2 (0x0F) EP2 is executed as a null operation at the presentation layer. The formation of ligatures, the behavior of combining characters, and similar presentation mechanisms, must proceed as if EP2 were not present even when it occurs within sequences that effect such mechanisms. EP2 is intended to be used as a private process-internal sentinel or flag character. EP2 may be added at any positions in the character stream. EP2 may be removed from the stream by any receiving process that has not established an agreement for special handling of EP2. EP2 should be removed from the stream prior to any security validation. It must not interfere with the recognition of security-sensitive keywords, sequences, or credentials. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jun 29 14:46:57 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 29 Jun 2019 21:46:57 +0200 Subject: Unicode "no-op" Character? In-Reply-To: <004301d5295a$4213ffa0$c63bfee0$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> Message-ID: If you want to "packetize" arbitrarily long Unicode text, you don't need any new magic character. Just prepend your packet with a base character used as a syntaxic delimiter, that does not combine with what follows in any normalization. There's a fine character for that: the TAB control. Except that during transmission it may turn into a SPACE that would combine. (the same will happen with "=" which can combine with a combining slash). But look at the normalization data (and consider that Unicode warranties that there will not be any addition of new combining pair starting by the same base character) there are LOT of suitable base characters in Unicode, which you can use as a syntaxic delimiter. Some examples (in the ASCII subset) include the hyphen-minus, the apostrophe-quote, the double quotation mark... So it's easy to split an arbitrarily long text at arbitrary character position, even in the middle of any cluster or combining sequence. It does not matter that this character may create a "cluster" with the following character, your "packetized" stream is still not readable text, but only a transport syntax (just like quoted-printable, or Base64). You can also freely choose the base character at end of each packet (the newlines are not safe as lines may be merged, but like Base64, "=" is fine to terminate each packet, as well as two ASCII quotation marks, and in fact all punctuations and symbols from ASCII (you can even use the ASCII letters and digits). If your packets have variable lengths, you may need to use escaping, or you may prepend the length (in characters or in combining sequences) of your packet before the expected terminator. All this is used in MIME for attachments in emails (with the two common transport syntaxes: Quoted Printable using escaping, or Base64 which does not require any length but requires a distinctive terminator (not used to encode the data part of the "packet") for variable length "packets". Le dim. 23 juin 2019 ? 02:35, S?awomir Osipiuk via Unicode < unicode at unicode.org> a ?crit : > I assure you, it wasn?t very interesting. :-) Headache-y, more like. The > diacritic thing was completely inapplicable anyway, as all our text was > plain English. I really don?t want to get into what the thing was, because > it sounds stupider the more I try to explain it. But it got the wheels > spinning in my head, and now that I?ve been reading up a lot about Unicode > and older standards like 2022/6429, it got me thinking whether there might > already be an elegant solution. > > > > But, as an example I?m making up right now, imagine you want to packetize > a large string. The packets are not all equal sized, the sizes are > determined by some algorithm. And the packet boundary may occur between a > base char and a diacritic. You insert markers into the string at the packet > boundaries. You can then store the string, copy it, display it, or pass it > to the sending function which will scan the string and know to send the > next packet when it reaches the marker. And you can now do all that without > the need to pass around extra metadata (like a list of ints of where the > packet boundaries are supposed to be) or to re-calculate the boundaries; > it?s still just a big string. If a different application sees the string, > it will know to completely ignore the packet markers; it can even strip > them out if it wants to (the canonical equivalent of the noop character is > the absence of a character). > > > > As should be obvious, I?m not recommending this as good practice. > > > > > > *From:* Shawn Steele [mailto:Shawn.Steele at microsoft.com] > *Sent:* Saturday, June 22, 2019 19:57 > *To:* S?awomir Osipiuk; unicode at unicode.org > *Subject:* RE: Unicode "no-op" Character? > > > > + the list. For some reason the list?s reply header is confusing. > > > > *From:* Shawn Steele > *Sent:* Saturday, June 22, 2019 4:55 PM > *To:* S?awomir Osipiuk > *Subject:* RE: Unicode "no-op" Character? > > > > The original comment about putting it between the base character and the > combining diacritic seems peculiar. I?m having a hard time visualizing how > that kind of markup could be interesting? > > > > *From:* Unicode *On Behalf Of *Slawomir > Osipiuk via Unicode > *Sent:* Saturday, June 22, 2019 2:02 PM > *To:* unicode at unicode.org > *Subject:* RE: Unicode "no-op" Character? > > > > I see there is no such character, which I pretty much expected after > Google didn?t help. > > > > The original problem I had was solved long ago but the recent article > about watermarking reminded me of it, and my question was mostly out of > curiosity. The task wasn?t, strictly speaking, about ?padding?, but about > marking ? injecting ?flag? characters at arbitrary points in a string > without affecting the resulting visible text. I think we ended up using > ESC, which is a dumb choice in retrospect, though the whole approach was a > bit of a hack anyway and the process it was for isn?t being used anymore. > -------------- next part -------------- An HTML attachment was scrubbed... URL: