From unicode at unicode.org Tue Jul 2 23:08:59 2019 From: unicode at unicode.org (=?utf-8?Q?S=C5=82awomir_Osipiuk?= via Unicode) Date: Wed, 3 Jul 2019 00:08:59 -0400 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> Message-ID: <000001d53155$0ad28e50$2077aaf0$@gmail.com> I don?t think you understood me at all. I can packetize a string with any character that is guaranteed not to appear in the text. Suggestions of TAB or EQUALS don?t even meet that simple criterion; they often appear in text. They require some kind of special escaping mechanism. But assume my string has a chosen character for indicating packets. But before I send it out, I want to show the string to the user. I can?t just throw it into a display method. I?d have TABs or EQUALs or UNKNOWN GLYPHs all over the place visible to the user. I don?t want that. So now I have to make a new copy of the string with my special boundary-char removed, then display that copied string. Or I could keep the original string, from before I added the packet boundaries, but that?s if I predict or assume ahead of time that I will need to display it, which in reality I might not. But that still means two copies of the string, one of which might be a waste. More code. More processing. I can do all that. But why? This thread is about a tool for convenience. I don?t ?need? it, in the sense that a task is insoluble without it. I?m a programmer, I know how to code. I ?want? it, because a tool like that would make some tasks much faster and simpler. Your proposed solution doesn?t. From: Philippe Verdy [mailto:verdy_p at wanadoo.fr] Sent: Saturday, June 29, 2019 15:47 To: S?awomir Osipiuk Cc: Shawn Steele; unicode Unicode Discussion Subject: Re: Unicode "no-op" Character? If you want to "packetize" arbitrarily long Unicode text, you don't need any new magic character. Just prepend your packet with a base character used as a syntaxic delimiter, that does not combine with what follows in any normalization. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 03:49:01 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 3 Jul 2019 10:49:01 +0200 Subject: Unicode "no-op" Character? In-Reply-To: <000001d53155$0ad28e50$2077aaf0$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> <000001d53155$0ad28e50$2077aaf0$@gmail.com> Message-ID: Le mer. 3 juil. 2019 ? 06:09, S?awomir Osipiuk a ?crit : > I don?t think you understood me at all. I can packetize a string with any > character that is guaranteed not to appear in the text. > Your goal is **impossible** to reach with Unicode. Assume sich character is "added" to the UCS, then it can appear in the text. Your goal being that it should be "warrantied" not to be used in any text, means that your "character" cannot be encoded at all. Unicode and ISO **require** that the any proposed character can be used in text without limitation. Logivally it would be rejected becauyse your character would not be usable at all from the start. So you have no choice: you must use some transport format for your "packeting", jsut like what is used in MIME for emails, in HTTP(S) for streaming, or in internationalized domain names. For your escaping mechanism you have a very large choice already of characters considered special only for your chosen transport syntax. Your goal shows a chicken and egg problem. It is not solvable without creating self-contradictions immediately (and if you attempt to add some restriction to avoid the contradiction, then you'll fall on cases where you can no longer transport your message and your protocol will become unusable. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 03:55:23 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 3 Jul 2019 10:55:23 +0200 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> <000001d53155$0ad28e50$2077aaf0$@gmail.com> Message-ID: Also consider that C0 controls (like STX and ETX) can already be used for packetizing, but immediately comes the need for escaping (DLE has been used for that goal, jsut before the character to preserve in the stream content, notably before DLE itself, or STX and ETX). There's then no need at all of any new character in Unicode. But if your protoclol does not allow any fom of escaping, then it is broken as it cannot transport **all** valid Unicode text. Le mer. 3 juil. 2019 ? 10:49, Philippe Verdy a ?crit : > Le mer. 3 juil. 2019 ? 06:09, S?awomir Osipiuk a > ?crit : > >> I don?t think you understood me at all. I can packetize a string with any >> character that is guaranteed not to appear in the text. >> > > Your goal is **impossible** to reach with Unicode. Assume sich character > is "added" to the UCS, then it can appear in the text. Your goal being that > it should be "warrantied" not to be used in any text, means that your > "character" cannot be encoded at all. Unicode and ISO **require** that the > any proposed character can be used in text without limitation. Logivally it > would be rejected becauyse your character would not be usable at all from > the start. > > So you have no choice: you must use some transport format for your > "packeting", jsut like what is used in MIME for emails, in HTTP(S) for > streaming, or in internationalized domain names. > > For your escaping mechanism you have a very large choice already of > characters considered special only for your chosen transport syntax. > > Your goal shows a chicken and egg problem. It is not solvable without > creating self-contradictions immediately (and if you attempt to add some > restriction to avoid the contradiction, then you'll fall on cases where you > can no longer transport your message and your protocol will become unusable. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 04:31:47 2019 From: unicode at unicode.org (Marius Spix via Unicode) Date: Wed, 3 Jul 2019 11:31:47 +0200 Subject: Aw: Re: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> <000001d53155$0ad28e50$2077aaf0$@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 08:02:47 2019 From: unicode at unicode.org (via Unicode) Date: Wed, 3 Jul 2019 15:02:47 +0200 Subject: Unihan definitions bug report Message-ID: <07681CCD-2B36-41B1-BB9F-009F1FDC02B2@ouvaton.org> I have prepared a Markdown document available on GitHub which is a list of corrections of typos that I found using a spellchecker in the kDefinition field of the Unihan_Readings.txt data file. Unfortunately, there are several typos I don't know the best correction of, and flagged them with a question mark. I'm hoping to get some feedback through the GitHub repository issues or from this list, then will submit a bug report to the Unicode office using the Contact Form , unless there is a more appropriate channel I'm not aware of... Best regards, --Michel MARIANI -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 10:44:30 2019 From: unicode at unicode.org (=?UTF-8?Q?S=C5=82awomir_Osipiuk?= via Unicode) Date: Wed, 3 Jul 2019 11:44:30 -0400 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> <000001d53155$0ad28e50$2077aaf0$@gmail.com> Message-ID: <002701d531b6$349a6440$9dcf2cc0$@gmail.com> I?m frustrated at how badly you seem to be missing the point. There is nothing impossible nor self-contradictory here. There is only the matter that Unicode requires all scalar values to be preserved during interchange. This is in many ways a good idea, and I don?t expect it to change, but something else would be possible if this requirement were explicitly dropped for a well-defined small subset of characters (even just one character). A modern-day SYN. Let?s say it?s U+000F. The standard takes my proposal and makes it a discardable, null-displayable character. What does this mean? U+000F may appear in any text. It has no (external) semantic value. But it may appear. It may appear a lot. Display routines (which are already dealing with combining, ligaturing, non-/joiners, variations, initial/medial/finals forms) understand that U+000F is to be processed as a no-op. Do nothing with this. Drop it. Move to the next character. Simple. Security gateways filter it out completely, as a matter of best practice and security-in-depth. A process, let?s call it Process W, adds a bunch of U+000F to a string it received, or built, or a user entered via keyboard. Maybe it?s to packetize. Maybe to mark every word that is an anagram of the name of a famous 19th-century painter, or that represents a pizza topping. Maybe something else. This is a versatile character. Process W is done adding U+000F to the string. It stores in it a database UTF-8 encoded field. Encoding isn?t a problem. The database is happy. Now Process X runs. Process X is meant to work with Process W and it?s well-aware of how U+000F is used. It reads the string from the database. It sees U+000F and interprets it. It chops the string into packets, or does a websearch for each famous painter, or it orders pizza. The private meaning of U+000F is known to both Process X and Process W. There is useful information encoded in-band, within a limited private context. But now we have Process Y. Process Y doesn?t care about packets or painters or pizza. Process Y runs outside of the private context that X and W had. Process Y translates strings into Morse code for transmission. As part of that, it replaces common words with abbreviations. Process Y doesn?t interpret U+000F. Why would it? It has no semantic value to Process Y. Process Y reads the string from the database. Internally, it clears all instances of U+000F from the string. They?re just taking up space. They?re meaningless to Y. It compiles the Morse code sequence into an audio file. But now we have Process Z. Process Z wants to take a string and mark every instance of five contiguous Latin consonants. It scrapes the database looking for text strings. It finds the string Process W created and marked. Z has no obligation to W. It?s not part of that private context. Process Z clears all instances of U+000F it finds, then inserts its own wherever it finds five-consonant clusters. It stores its results in a UTF-16LE text file. It?s allowed to do that. Nothing impossible happened here. Let?s summarize: Processes W and X established a private meaning for U+000F by agreement and interacted based on that meaning. Process Y ignored U+000F completely because it assigned no meaning to it. Process Z assigned a completely new meaning to U+000F. That?s permitted because U+000F is special and is guaranteed to have no semantics without private agreement and doesn?t need to be preserved. There is no need to escape anything. Escaping is used when a character must have more than one meaning (i.e. it is overloaded, as when it is both text and markup). U+000F only gets one meaning in any context. In a new context, the meaning gets overridden, not overloaded. That?s what makes it special. I don?t expect to see any of this in official Unicode. But I take exception to the idea that I?m suggesting something impossible. From: Philippe Verdy [mailto:verdy_p at wanadoo.fr] Sent: Wednesday, July 03, 2019 04:49 To: S?awomir Osipiuk Cc: unicode Unicode Discussion Subject: Re: Unicode "no-op" Character? Your goal is **impossible** to reach with Unicode. Assume sich character is "added" to the UCS, then it can appear in the text. Your goal being that it should be "warrantied" not to be used in any text, means that your "character" cannot be encoded at all. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 11:03:04 2019 From: unicode at unicode.org (jenkins via Unicode) Date: Wed, 03 Jul 2019 10:03:04 -0600 Subject: Unihan definitions bug report In-Reply-To: <07681CCD-2B36-41B1-BB9F-009F1FDC02B2@ouvaton.org> References: <07681CCD-2B36-41B1-BB9F-009F1FDC02B2@ouvaton.org> Message-ID: <4D97E4D0-A8DA-40CB-8F8C-7F5E35E7F9C8@apple.com> Thank you. I?ve forwarded this to Unihan experts and hopefully we?ll get some feedback to you very soon. > On Jul 3, 2019, at 7:02 AM, via Unicode wrote: > > I have prepared a Markdown document available on GitHub which is a list of corrections of typos that I found using a spellchecker in the kDefinition field of the Unihan_Readings.txt data file. Unfortunately, there are several typos I don't know the best correction of, and flagged them with a question mark. I'm hoping to get some feedback through the GitHub repository issues or from this list, then will submit a bug report to the Unicode office using the Contact Form , unless there is a more appropriate channel I'm not aware of... > > Best regards, > > --Michel MARIANI -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 12:32:36 2019 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 3 Jul 2019 19:32:36 +0200 Subject: Unicode "no-op" Character? In-Reply-To: <002701d531b6$349a6440$9dcf2cc0$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> <000001d53155$0ad28e50$2077aaf0$@gmail.com> <002701d531b6$349a6440$9dcf2cc0$@gmail.com> Message-ID: Your goal is not achievable. We can't wave a magic wand, and suddenly (or even within decades) all processes everywhere ignore U+000F in all processing will not happen. This thread is pointless and should be terminated. Mark On Wed, Jul 3, 2019 at 5:48 PM S?awomir Osipiuk via Unicode < unicode at unicode.org> wrote: > I?m frustrated at how badly you seem to be missing the point. There is > nothing impossible nor self-contradictory here. There is only the matter > that Unicode requires all scalar values to be preserved during interchange. > This is in many ways a good idea, and I don?t expect it to change, but > something else would be possible if this requirement were explicitly > dropped for a well-defined small subset of characters (even just one > character). A modern-day SYN. > > > > Let?s say it?s U+000F. The standard takes my proposal and makes it a > discardable, null-displayable character. What does this mean? > > > > U+000F may appear in any text. It has no (external) semantic value. But it > may appear. It may appear a lot. > > > > Display routines (which are already dealing with combining, ligaturing, > non-/joiners, variations, initial/medial/finals forms) understand that > U+000F is to be processed as a no-op. Do nothing with this. Drop it. Move > to the next character. Simple. > > > > Security gateways filter it out completely, as a matter of best practice > and security-in-depth. > > > > A process, let?s call it Process W, adds a bunch of U+000F to a string it > received, or built, or a user entered via keyboard. Maybe it?s to > packetize. Maybe to mark every word that is an anagram of the name of a > famous 19th-century painter, or that represents a pizza topping. Maybe > something else. This is a versatile character. Process W is done adding > U+000F to the string. It stores in it a database UTF-8 encoded field. > Encoding isn?t a problem. The database is happy. > > > > Now Process X runs. Process X is meant to work with Process W and it?s > well-aware of how U+000F is used. It reads the string from the database. It > sees U+000F and interprets it. It chops the string into packets, or does a > websearch for each famous painter, or it orders pizza. The private meaning > of U+000F is known to both Process X and Process W. There is useful > information encoded in-band, within a limited private context. > > > > But now we have Process Y. Process Y doesn?t care about packets or > painters or pizza. Process Y runs outside of the private context that X and > W had. Process Y translates strings into Morse code for transmission. As > part of that, it replaces common words with abbreviations. Process Y > doesn?t interpret U+000F. Why would it? It has no semantic value to Process > Y. > > > > Process Y reads the string from the database. Internally, it clears all > instances of U+000F from the string. They?re just taking up space. They?re > meaningless to Y. It compiles the Morse code sequence into an audio file. > > > > But now we have Process Z. Process Z wants to take a string and mark every > instance of five contiguous Latin consonants. It scrapes the database > looking for text strings. It finds the string Process W created and marked. > Z has no obligation to W. It?s not part of that private context. Process Z > clears all instances of U+000F it finds, then inserts its own wherever it > finds five-consonant clusters. It stores its results in a UTF-16LE text > file. It?s allowed to do that. > > > > Nothing impossible happened here. Let?s summarize: > > > > Processes W and X established a private meaning for U+000F by agreement > and interacted based on that meaning. > > > > Process Y ignored U+000F completely because it assigned no meaning to it. > > > > Process Z assigned a completely new meaning to U+000F. That?s permitted > because U+000F is special and is guaranteed to have no semantics without > private agreement and doesn?t need to be preserved. > > > > There is no need to escape anything. Escaping is used when a character > must have more than one meaning (i.e. it is overloaded, as when it is both > text and markup). U+000F only gets one meaning in any context. In a new > context, the meaning gets overridden, not overloaded. That?s what makes it > special. > > > > I don?t expect to see any of this in official Unicode. But I take > exception to the idea that I?m suggesting something impossible. > > > > > > *From:* Philippe Verdy [mailto:verdy_p at wanadoo.fr] > *Sent:* Wednesday, July 03, 2019 04:49 > *To:* S?awomir Osipiuk > *Cc:* unicode Unicode Discussion > *Subject:* Re: Unicode "no-op" Character? > > > > Your goal is **impossible** to reach with Unicode. Assume sich character > is "added" to the UCS, then it can appear in the text. Your goal being that > it should be "warrantied" not to be used in any text, means that your > "character" cannot be encoded at all. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 12:47:16 2019 From: unicode at unicode.org (=?UTF-8?Q?S=C5=82awomir_Osipiuk?= via Unicode) Date: Wed, 3 Jul 2019 13:47:16 -0400 Subject: Unicode "no-op" Character? In-Reply-To: References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> <000001d53155$0ad28e50$2077aaf0$@gmail.com> <002701d531b6$349a6440$9dcf2cc0$@gmail.com> Message-ID: <005a01d531c7$5b4a5a30$11df0e90$@gmail.com> The fact that this would require a change that is unlikely to occur is a fact I have stated repeatedly. It is pointless to tell me that. The rest of the thread, after my initial question was answered, was a thought experiment, and while I strongly disagree that such posts are ?pointless? (actually, reading through the archives of this mailing list it is those ideas that have fascinated me the most and I found most engaging and enlightening) I admit I?m new here, so I will defer. Is my idea unrealistic at this point in time? Yes. I have admitted so. Is my idea impossible, useless, or contradictory? Not at all. From: Mark Davis ?? [mailto:mark at macchiato.com] Sent: Wednesday, July 03, 2019 13:33 To: S?awomir Osipiuk Cc: verdy_p; unicode Unicode Discussion Subject: Re: Unicode "no-op" Character? Your goal is not achievable. We can't wave a magic wand, and suddenly (or even within decades) all processes everywhere ignore U+000F in all processing will not happen. This thread is pointless and should be terminated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 13:06:06 2019 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Wed, 3 Jul 2019 11:06:06 -0700 Subject: Unicode "no-op" Character? In-Reply-To: <002701d531b6$349a6440$9dcf2cc0$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> <000001d53155$0ad28e50$2077aaf0$@gmail.com> <002701d531b6$349a6440$9dcf2cc0$@gmail.com> Message-ID: On Wed, Jul 3, 2019 at 8:47 AM S?awomir Osipiuk via Unicode < unicode at unicode.org> wrote: > Security gateways filter it out completely, as a matter of best practice > and security-in-depth. > > > > A process, let?s call it Process W, adds a bunch of U+000F to a string it > received, or built, or a user entered via keyboard. ... > It stores in it a database UTF-8 encoded field... > And the database driver filters out the U+000F completely as a matter of best practice and security-in-depth. You can't say "this character should be ignored everywhere" and "this character should be preserved everywhere" at the same time. That's the contradiction. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 13:22:55 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 3 Jul 2019 11:22:55 -0700 Subject: Unicode "no-op" Character? In-Reply-To: <005a01d531c7$5b4a5a30$11df0e90$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> <000001d53155$0ad28e50$2077aaf0$@gmail.com> <002701d531b6$349a6440$9dcf2cc0$@gmail.com> <005a01d531c7$5b4a5a30$11df0e90$@gmail.com> Message-ID: <1347a6e7-c0d1-6170-5964-9442026c42c0@sonic.net> On 7/3/2019 10:47 AM, S?awomir Osipiuk via Unicode wrote: > > Is my idea impossible, useless, or contradictory? Not at all. > What you are proposing is in the realm of higher-level protocols. You could develop such a protocol, and then write processes that honored it, or try to convince others to write processes to honor it. You could use PUA characters, or non-characters, or existing control codes -- the implications for use of any of those would be slightly different, in practice, but in any case would be an HLP. But your idea is not a feasible part of the Unicode Standard. There are no "discardable" characters in Unicode -- *by definition*. The discussion of "ignorable" characters in the standard is nuanced and complicated, because there are some characters which are carefully designed to be transparent to some, well-specified processes, but not to others. But no characters in the standard are (or can be) ignorable by *all* processes, nor can a "discardable" character ever be defined as part of the standard. The fact that there are a myriad of processes implemented (and distributed who knows where) that do 7-bit ASCII (or 8-bit 8859-1) conversion to/from UTF-16 by integral type conversion is a simple existence proof that U+000F is never, ever, ever, ever going to be defined to be "discardable" in the Unicode Standard. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 16:44:26 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 3 Jul 2019 17:44:26 -0400 Subject: Unicode "no-op" Character? In-Reply-To: <002701d531b6$349a6440$9dcf2cc0$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> <000001d53155$0ad28e50$2077aaf0$@gmail.com> <002701d531b6$349a6440$9dcf2cc0$@gmail.com> Message-ID: <597c5e47-6b36-db19-7f4e-1e884fe7f109@kli.org> Um... How could you be sure that process X would get the no-ops that process W wrote?? After all, it's *discardable*, like you said, and the database programs and libraries aren't in on the secret.? The database API functions might well strip it out, because it carries no meaning to them. Unless you can count on _certain_ programs not discarding it, and then you'd need either specialty libraries or some kind of registry or terminology for "this program does NOT strip no-ops" vs ones that do... But then they wouldn't be discardable, would they?? Not by non-discarding programs.? Which would have to have ways to pass them around between themselves. Moreover, as you say, what about when Process Z (or its companions) comes along and is using THE SAME MECHANISM for something utterly different?? How does it know that process W wasn't writing no-ops for it, but was writing them for Process X?? And of course, Z will trash them and insert its own there, and when process X comes to read it, they won't be there. You'd need to make sure that NOBODY is allowed to touch the string between *pairs* of generators and consumers of no-ops, specifically designated for each other. Yes, this is about consensual acts between responsible processes W and X, but that's exactly what the PUA is for: being assigned meaning between consenting processes. And they are not discardable by non-consenting processes, precisely because they mean something to someone.? If your no-ops carry meaning, they are going to need to be preserved and passed around and not thrown away.? If they carry no meaning, why are you dealing with them?? Yes, PUA characters are annoying and break up grapheme clusters and stuff.? But they're the only way to do what you're trying to do. ~mark On 7/3/19 11:44 AM, S?awomir Osipiuk via Unicode wrote: > > A process, let?s call it Process W, adds a bunch of U+000F to a string > it received, or built, or a user entered via keyboard. Maybe it?s to > packetize. Maybe to mark every word that is an anagram of the name of > a famous 19^th -century painter, or that represents a pizza topping. > Maybe something else. This is a versatile character. Process W is done > adding U+000F to the string. It stores in it a database UTF-8 encoded > field. Encoding isn?t a problem. The database is happy. > > Now Process X runs. Process X is meant to work with Process W and it?s > well-aware of how U+000F is used. It reads the string from the > database. It sees U+000F and interprets it. It chops the string into > packets, or does a websearch for each famous painter, or it orders > pizza. The private meaning of U+000F is known to both Process X and > Process W. There is useful information encoded in-band, within a > limited private context. > > But now we have Process Y. Process Y doesn?t care about packets or > painters or pizza. Process Y runs outside of the private context that > X and W had. Process Y translates strings into Morse code for > transmission. As part of that, it replaces common words with > abbreviations. Process Y doesn?t interpret U+000F. Why would it? It > has no semantic value to Process Y. > > Process Y reads the string from the database. Internally, it clears > all instances of U+000F from the string. They?re just taking up space. > They?re meaningless to Y. It compiles the Morse code sequence into an > audio file. > > But now we have Process Z. Process Z wants to take a string and mark > every instance of five contiguous Latin consonants. It scrapes the > database looking for text strings. It finds the string Process W > created and marked. Z has no obligation to W. It?s not part of that > private context. Process Z clears all instances of U+000F it finds, > then inserts its own wherever it finds five-consonant clusters. It > stores its results in a UTF-16LE text file. It?s allowed to do that. > > Nothing impossible happened here. Let?s summarize: > > Processes W and X established a private meaning for U+000F by > agreement and interacted based on that meaning. > > Process Y ignored U+000F completely because it assigned no meaning to it. > > Process Z assigned a completely new meaning to U+000F. That?s > permitted because U+000F is special and is guaranteed to have no > semantics without private agreement and doesn?t need to be preserved. > > There is no need to escape anything. Escaping is used when a character > must have more than one meaning (i.e. it is overloaded, as when it is > both text and markup). U+000F only gets one meaning in any context. In > a new context, the meaning gets overridden, not overloaded. That?s > what makes it special. > > I don?t expect to see any of this in official Unicode. But I take > exception to the idea that I?m suggesting something impossible. > > *From:*Philippe Verdy [mailto:verdy_p at wanadoo.fr] > *Sent:* Wednesday, July 03, 2019 04:49 > *To:* S?awomir Osipiuk > *Cc:* unicode Unicode Discussion > *Subject:* Re: Unicode "no-op" Character? > > Your goal is **impossible** to reach with Unicode. Assume sich > character is "added" to the UCS, then it can appear in the text. Your > goal being that it should be "warrantied" not to be used in any text, > means that your "character" cannot be encoded at all. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 16:51:29 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 3 Jul 2019 17:51:29 -0400 Subject: Unicode "no-op" Character? In-Reply-To: <20190623092450.54576fdf@JRWUBU2> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <20190623092450.54576fdf@JRWUBU2> Message-ID: <472fbf05-8ad5-c462-bacd-8e0b1bb7c45e@kli.org> I think the idea being considered at the outset was not so complex as these (and indeed, the point of the character was to avoid making these kinds of decisions). There was a desire for some reason to be able to chop up a string into equal-length pieces or something, and some of those divisions might wind up between bases and diacritics or who knows where else.? Rather than have to work out acceptable places to place the characters, the request was for a no-op character that could safely be plopped *anywhere*, even in the middle of combinations like that. ~mark On 6/23/19 4:24 AM, Richard Wordingham via Unicode wrote: > On Sat, 22 Jun 2019 23:56:50 +0000 > Shawn Steele via Unicode wrote: > >> + the list. For some reason the list's reply header is confusing. >> >> From: Shawn Steele >> Sent: Saturday, June 22, 2019 4:55 PM >> To: S?awomir Osipiuk >> Subject: RE: Unicode "no-op" Character? >> >> The original comment about putting it between the base character and >> the combining diacritic seems peculiar. I'm having a hard time >> visualizing how that kind of markup could be interesting? > There are a number of possible interesting scenarios: > > 1) Chopping the string into user perceived characters. For example, > the Khmer sequences of COENG plus letter are named sequences. Akin to > this is identifying resting places for a simple cursor, e.g. allowing it > to be positioned between a base character and a spacing, unreordered > subscript. (This last possibility overlaps with rendering.) > > 2) Chopping the string into collating elements. (This can require > renormalisation, and may raise a rendering issue with HarfBuzz, where > renomalisation is required to get marks into a suitable order for > shaping. I suspect no-op characters would disrupt this > renormalisation; CGJ may legitimately be used to affect rendering this > way, even though it is supposed to have no other effect* on rendering.) > > 3) Chopping the string into default grapheme clusters. That > separates a coeng from the following character with which it > interacts. > > *Is a Unicode-compliant *renderer* allowed to distinguish diaeresis > from the umlaut mark? > > Richard. From unicode at unicode.org Wed Jul 3 16:54:31 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 3 Jul 2019 17:54:31 -0400 Subject: Unicode "no-op" Character? In-Reply-To: <1347a6e7-c0d1-6170-5964-9442026c42c0@sonic.net> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <004301d5295a$4213ffa0$c63bfee0$@gmail.com> <000001d53155$0ad28e50$2077aaf0$@gmail.com> <002701d531b6$349a6440$9dcf2cc0$@gmail.com> <005a01d531c7$5b4a5a30$11df0e90$@gmail.com> <1347a6e7-c0d1-6170-5964-9442026c42c0@sonic.net> Message-ID: What you're asking for, then, is completely possible and achievable?but not in the Unicode Standard.? It's out of scope for Unicode, it sounds like.? You've said you realize it won't happen in Unicode, but it still can happen.? Go forth and implement it, then: make your higher-level protocol and show its usefulness and get the industry to use and honor it because of how handy it is, and best of luck with that. ~mark On 7/3/19 2:22 PM, Ken Whistler via Unicode wrote: > > > On 7/3/2019 10:47 AM, S?awomir Osipiuk via Unicode wrote: >> >> Is my idea impossible, useless, or contradictory? Not at all. >> > What you are proposing is in the realm of higher-level protocols. > > You could develop such a protocol, and then write processes that > honored it, or try to convince others to write processes to honor it. > You could use PUA characters, or non-characters, or existing control > codes -- the implications for use of any of those would be slightly > different, in practice, but in any case would be an HLP. > > But your idea is not a feasible part of the Unicode Standard. There > are no "discardable" characters in Unicode -- *by definition*. The > discussion of "ignorable" characters in the standard is nuanced and > complicated, because there are some characters which are carefully > designed to be transparent to some, well-specified processes, but not > to others. But no characters in the standard are (or can be) ignorable > by *all* processes, nor can a "discardable" character ever be defined > as part of the standard. > > The fact that there are a myriad of processes implemented (and > distributed who knows where) that do 7-bit ASCII (or 8-bit 8859-1) > conversion to/from UTF-16 by integral type conversion is a simple > existence proof that U+000F is never, ever, ever, ever going to be > defined to be "discardable" in the Unicode Standard. > > --Ken > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 3 18:20:24 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 4 Jul 2019 00:20:24 +0100 Subject: Unicode "no-op" Character? In-Reply-To: <472fbf05-8ad5-c462-bacd-8e0b1bb7c45e@kli.org> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <20190623092450.54576fdf@JRWUBU2> <472fbf05-8ad5-c462-bacd-8e0b1bb7c45e@kli.org> Message-ID: <20190704002024.4a0600fe@JRWUBU2> On Wed, 3 Jul 2019 17:51:29 -0400 "Mark E. Shoulson via Unicode" wrote: > I think the idea being considered at the outset was not so complex as > these (and indeed, the point of the character was to avoid making > these kinds of decisions). Shawn Steele appeared to be claiming that there was no good, interesting reason for separating base character and combining mark. I was refuting that notion. Natural text boundaries can get very messy - some languages have word boundaries that can be *within* an indecomposable combining mark. Richard. From unicode at unicode.org Wed Jul 3 18:58:37 2019 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Wed, 3 Jul 2019 23:58:37 +0000 Subject: Unicode "no-op" Character? In-Reply-To: <20190704002024.4a0600fe@JRWUBU2> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> <20190623092450.54576fdf@JRWUBU2> <472fbf05-8ad5-c462-bacd-8e0b1bb7c45e@kli.org> <20190704002024.4a0600fe@JRWUBU2> Message-ID: I think you're overstating my concern :) I meant that those things tend to be particular to a certain context and often aren't interesting for interchange. A text editor might find it convenient to place word boundaries in the middle of something another part of the system thinks is a single unit to be rendered. At the same time, a rendering engine might find it interesting that there's an ff together and want to mark it to be shown as a ligature though that text editor wouldn't be keen on that at all. As has been said, these are private mechanisms for things that individual processes find interesting. It's not useful to mark those for interchange as the text editors word breaking marks would interfere with the graphics engines glyph breaking marks. Not to mention the transmission buffer size marks originally mentioned, which could be anywhere. The "right" thing to do here is to use an internal higher level mechanism to keep track of these things however the component needs. That can even be interchanged with another component designed to the same principles, via mechanisms like the PUA. However, those components can't expect their private mechanisms are useful or harmless to other processes. Even more complicated is that, as pointed out by others, it's pretty much impossible to say "these n codepoints should be ignored and have no meaning" because some process would try to use codepoints 1-3 for some private meaning. Another would use codepoint 1 for their own thing, and there'd be a conflict. As a thought experiment, I think it's certainly decent to ask the question "could such a mechanism be useful?" It's an intriguing thought and a decent hypothesis that this kind of system could be privately useful to an application. I also think that the conversation has pretty much proven that such a system is mathematically impossible. (You can't have a "private" no-meaning codepoint that won't conflict with other "private" uses in a public space). It might be worth noting that this kind of thing used to be fairly common in early computing. Word processers would inject a "CTRL-I" token to toggle italics on or off. Old printers used to use sequences to define the start of bold or italic or underlined or whatever sequences. Those were private and pseudo-private mechanisms that were used internally &/or documented for others that wanted to interoperate with their systems. (The printer folks would tell the word processers how to make italics happen, then other printer folks would use the same or similar mechanisms for compatibility - except for the dude that didn't get the memo and made their own scheme.) Unicode was explicitly intended *not* to encode any of that kind of markup, and, instead, be "plain text," leaving other interesting metadata to other higher level protocols. Whether those be word breaking, sentence parsing, formatting, buffer sizing or whatever. -Shawn -----Original Message----- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Wednesday, July 3, 2019 4:20 PM To: unicode at unicode.org Subject: Re: Unicode "no-op" Character? On Wed, 3 Jul 2019 17:51:29 -0400 "Mark E. Shoulson via Unicode" wrote: > I think the idea being considered at the outset was not so complex as > these (and indeed, the point of the character was to avoid making > these kinds of decisions). Shawn Steele appeared to be claiming that there was no good, interesting reason for separating base character and combining mark. I was refuting that notion. Natural text boundaries can get very messy - some languages have word boundaries that can be *within* an indecomposable combining mark. Richard. From unicode at unicode.org Thu Jul 4 14:34:51 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 4 Jul 2019 13:34:51 -0600 Subject: Unicode "no-op" Character? Message-ID: <000601d5329f$8c9e6f80$a5db4e80$@ewellic.org> Shawn Steele wrote: > Even more complicated is that, as pointed out by others, it's pretty > much impossible to say "these n codepoints should be ignored and have > no meaning" because some process would try to use codepoints 1-3 for > some private meaning. Another would use codepoint 1 for their own > thing, and there'd be a conflict. That's pretty much what happened with NUL. It was originally intended (long, long before Unicode) to be ignorable and have no meaning, but then other processes were designed that gave it specific meaning, and that was pretty much that. While the Unix/C "end of string" convention was not the only case in which NUL was hijacked, it is certainly the best-known, and the greatest impediment to any current attempt to use it with its original meaning. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Jul 9 13:50:06 2019 From: unicode at unicode.org (Matt Black via Unicode) Date: Tue, 9 Jul 2019 19:50:06 +0100 Subject: Green ecofriendly emoji In-Reply-To: <927504AF-FA78-40A4-BE06-BB53B1784BB4@ninjatune.net> References: <927504AF-FA78-40A4-BE06-BB53B1784BB4@ninjatune.net> Message-ID: <6A262CB3-2E92-478A-9BC6-8716D936F9A2@ninjatune.net> Hi there Im new to the Emoji discussion world. Im a musician on label Ninja Tune. Also part of Music Declares Climate Emergency, the UK music biz group. I want to propose some green ecofriendly-emoji. These could just be green versions of existing emoji. For example, a green thumbs-up. This would be a great start to allow people to respond to all kinds of ecofriendly positive action. Another few ideas: A green lightning bolt ? for renewable energy A green sun ??for solar power A green flex bicep ??for green activism A green thanks ???? A green OKhand ?? After reading the stuff about how to submit emoji it wasn't clear to me how to suggest just additional colours. I feel that putting a green block next to a normal thumbsup for example is not as good as an actual green thumbsup! As this is just a suggestion to add some coloured versions of existing emoji, it is in a slightly different category to normal applications. Some quick searches show that a phrase such as 'green-thumb' is used. (2900 million hits, the other terms also do well) I don't want to underline the cause aspect but being eco friendly is something most people can agree on today. My question is: how should I proceed with this suggestion to add green versions of of say 5 emoji? thanks Matt Black, Coldcut/Ninja Tune > On 8 Jul 2019, at 23:45, Matt Black wrote: > > Most of the criteria we meet very well. > https://unicode.org/emoji/proposals.html#selection_factors > Green thumbs up?? see below > ?? okhand > ???? > ???? > ?? Lightening=green energy > > > > > > > Green thumbs up?? > Thumbs up or just green thumb! > > > > > an exceptional aptitude for gardening or for growing plants successfully: > Houseplants provide much pleasure for the city dweller with a green thumb. lightning=green energy > > > ???? > Green-thanks > ???? > Solar-power -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-22.png Type: image/png Size: 99971 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-24.png Type: image/png Size: 59218 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-25.png Type: image/png Size: 60961 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-26.png Type: image/png Size: 58322 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-27.png Type: image/png Size: 44766 bytes Desc: not available URL: From unicode at unicode.org Tue Jul 9 13:59:15 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 9 Jul 2019 20:59:15 +0200 Subject: Numeric group separators and Bidi Message-ID: Is there a narrow space usable as a numeric group separator, and that also has the same bidi property as digits (i.e. neutral outside the span of digits and separators, but inheriting the implied directionality of the previous digit) ? I can't find a way to use narrow spaces instead of punctuation signs (dot or comma) for example in Arabic/Hebrew, for example to present tabular numeric data in a really language-neutral way. In Arabic/Hebrew we need to use punctuations as group separators because spaces don't work (not even the narrow non-breaking space U+202F used in French and recommended in ISO), but then these punctuation separators are interpreted differently (notably between French and English where the interpretation dot and comma are swapped) Note that: - the "figure space" is not suitable (as it has the same width as digits and is used as a "filler" in tabular data; but it also does not have the correct bidi behavior, as it does not have the same bidi properties as digits). - the "thin space" is not suitable (it is breakable) - the "narrow non-breaking space" U+202F (used in French and currently in ISO) is not suitable, or may be I'm wrong and its presence is still neutral between groups of digits where it inherits the properties of the previous digit, but still does not enforces the bidi direction of the whole span of digits. Can you point me if U+202F is really suitable ? I made some tests with various text renderers, and some of them "break" the group of digits by reordering these groups, changing completely the rendered value (units become thousands or more, and thousands become units...). But may be these are bugs in renderers. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jul 9 14:24:26 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 09 Jul 2019 22:24:26 +0300 Subject: Numeric group separators and Bidi In-Reply-To: (message from Philippe Verdy via Unicode on Tue, 9 Jul 2019 20:59:15 +0200) References: Message-ID: <83h87v3tat.fsf@gnu.org> > Date: Tue, 9 Jul 2019 20:59:15 +0200 > From: Philippe Verdy via Unicode > > I can't find a way to use narrow spaces instead of punctuation signs (dot or comma) for example in > Arabic/Hebrew, for example to present tabular numeric data in a really language-neutral way. In Arabic/Hebrew > we need to use punctuations as group separators because spaces don't work (not even the narrow > non-breaking space U+202F used in French and recommended in ISO), but then these punctuation > separators are interpreted differently (notably between French and English where the interpretation dot and > comma are swapped) Please show an example and describe how would you like it to look on display. I don't think I understand the use case(s). From unicode at unicode.org Tue Jul 9 15:10:00 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Tue, 9 Jul 2019 22:10:00 +0200 Subject: Numeric group separators and Bidi In-Reply-To: References: Message-ID: Hi Philippe, What do you mean U+202F doesn't work fo you? Whereas the logical string "hebrew 123456 hebrew" indeed shows the number incorrectly as "456 123", it's not the case with U+202F instead of space, then the number shows up as "123 456" as expected. I think you need to pick a character whose BiDi class is "Common Number Separator", see e.g. https://www.compart.com/en/unicode/bidiclass/CS for a list of such characters including U+00A0 no-break space and U+202F narrow no-break space. This suggests to me that U+202F is a correct choice if you need the look of a narrow space. Another possibility is to embed the number in a LRI...PDI block, as e.g. https://unicode.org/cldr/utility/bidic.jsp does with the "1?3%" fragment of its default example. cheers, egmont On Tue, Jul 9, 2019 at 9:01 PM Philippe Verdy via Unicode wrote: > > Is there a narrow space usable as a numeric group separator, and that also has the same bidi property as digits (i.e. neutral outside the span of digits and separators, but inheriting the implied directionality of the previous digit) ? > > I can't find a way to use narrow spaces instead of punctuation signs (dot or comma) for example in Arabic/Hebrew, for example to present tabular numeric data in a really language-neutral way. In Arabic/Hebrew we need to use punctuations as group separators because spaces don't work (not even the narrow non-breaking space U+202F used in French and recommended in ISO), but then these punctuation separators are interpreted differently (notably between French and English where the interpretation dot and comma are swapped) > > Note that: > - the "figure space" is not suitable (as it has the same width as digits and is used as a "filler" in tabular data; but it also does not have the correct bidi behavior, as it does not have the same bidi properties as digits). > - the "thin space" is not suitable (it is breakable) > - the "narrow non-breaking space" U+202F (used in French and currently in ISO) is not suitable, or may be I'm wrong and its presence is still neutral between groups of digits where it inherits the properties of the previous digit, but still does not enforces the bidi direction of the whole span of digits. > > Can you point me if U+202F is really suitable ? I made some tests with various text renderers, and some of them "break" the group of digits by reordering these groups, changing completely the rendered value (units become thousands or more, and thousands become units...). But may be these are bugs in renderers. > From unicode at unicode.org Tue Jul 9 15:43:06 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 9 Jul 2019 22:43:06 +0200 Subject: Numeric group separators and Bidi In-Reply-To: References: Message-ID: Well my first feeling was that U+202F should work all the time, but I found cases where this is not always the case. So this must be bugs in those renderers. And using Bidi controls (LRI/BDI) is absolutely not an option. These controls are only intended to be used in pure plain-text files that have no other ways to specify the embedding, and whose content is entirely static (no generated by templates that return data from unspecified locales to an unspecified locale). As well the option of localizing each item is not possible. That's why I search a locale-neutral solution that is acceptable in all languages, and does not give false interpretation on the actual values of numbers (which can have different scales or precision, and with also optional data, not always present in all items to render but added to the list, for example as annotations that should still be as locale-neutral as possible). So U+202F is supposed to the the solution, but I did not find any way to properly present the decimal separator: it is only unambiguous as a decimal separator (and not a group separator) if there's a group separator present in the number (and this is not always true!) And there I'm stuck with the dot or comma, with no appropriate symbol that would not be confusable (may be the small vertical tick hanging from the baseline could replace both the dot and the comma?). Le mar. 9 juil. 2019 ? 22:10, Egmont Koblinger a ?crit : > Hi Philippe, > > What do you mean U+202F doesn't work fo you? > > Whereas the logical string "hebrew 123456 hebrew" indeed shows > the number incorrectly as "456 123", it's not the case with U+202F > instead of space, then the number shows up as "123 456" as expected. > > I think you need to pick a character whose BiDi class is "Common > Number Separator", see e.g. > https://www.compart.com/en/unicode/bidiclass/CS for a list of such > characters including U+00A0 no-break space and U+202F narrow no-break > space. This suggests to me that U+202F is a correct choice if you need > the look of a narrow space. > > Another possibility is to embed the number in a LRI...PDI block, as > e.g. https://unicode.org/cldr/utility/bidic.jsp does with the "1?3%" > fragment of its default example. > > cheers, > egmont > > On Tue, Jul 9, 2019 at 9:01 PM Philippe Verdy via Unicode > wrote: > > > > Is there a narrow space usable as a numeric group separator, and that > also has the same bidi property as digits (i.e. neutral outside the span of > digits and separators, but inheriting the implied directionality of the > previous digit) ? > > > > I can't find a way to use narrow spaces instead of punctuation signs > (dot or comma) for example in Arabic/Hebrew, for example to present tabular > numeric data in a really language-neutral way. In Arabic/Hebrew we need to > use punctuations as group separators because spaces don't work (not even > the narrow non-breaking space U+202F used in French and recommended in > ISO), but then these punctuation separators are interpreted differently > (notably between French and English where the interpretation dot and comma > are swapped) > > > > Note that: > > - the "figure space" is not suitable (as it has the same width as digits > and is used as a "filler" in tabular data; but it also does not have the > correct bidi behavior, as it does not have the same bidi properties as > digits). > > - the "thin space" is not suitable (it is breakable) > > - the "narrow non-breaking space" U+202F (used in French and currently > in ISO) is not suitable, or may be I'm wrong and its presence is still > neutral between groups of digits where it inherits the properties of the > previous digit, but still does not enforces the bidi direction of the whole > span of digits. > > > > Can you point me if U+202F is really suitable ? I made some tests with > various text renderers, and some of them "break" the group of digits by > reordering these groups, changing completely the rendered value (units > become thousands or more, and thousands become units...). But may be these > are bugs in renderers. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jul 9 16:09:56 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Tue, 9 Jul 2019 23:09:56 +0200 Subject: Numeric group separators and Bidi In-Reply-To: References: Message-ID: On Tue, Jul 9, 2019 at 10:43 PM Philippe Verdy wrote: > > Well my first feeling was that U+202F should work all the time, but I found cases where this is not always the case. So this must be bugs in those renderers. Could you share some concrete examples? From unicode at unicode.org Tue Jul 9 18:19:05 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 10 Jul 2019 01:19:05 +0200 Subject: Fwd: Numeric group separators and Bidi In-Reply-To: References: Message-ID: > Well my first feeling was that U+202F should work all the time, but I > found cases where this is not always the case. So this must be bugs in > those renderers. > I think we can attribute these bugs to the fact that this character is insufficiently known, and not even accessible in most input tools... including the Windows "Charmap" where it is not even listed with other spaces or punctuations, except if we display the FULL list of characters supported by a selected font that maps it (many fonts don't map it) and the "Unicode" encoding. Windows charmap is so outdated (and has many inconsistancies in its proposed grouping, look for example at the groups proposed for Greek, they are complete non-sense, with duplicate subranges, but groups made completely arbitrarily, making this basic tool really difficult to use). And beside that, all the input methods proposed in Windows still don't offer it (this is also true on other platforms). So finally there are not enough text to render with it, and renderers are not fixed to render it correctly, developers think there's no emergency and that this bug is minor, it can stay for years without ever being corrected (just like with the old "Charmap" on Windows) even if such bug or omission was signaled repeatedly. This finally tends to perpetuate the old bad practices (and this is what happened with ASCII speading everywhere even in scopes where it should not have been used at all and certainly not selected as the only viable alternative, the same is seen today with the choice of languages/locales, where everything that is not English is minored as non-important for users). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 10 02:57:35 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Wed, 10 Jul 2019 09:57:35 +0200 Subject: Numeric group separators and Bidi In-Reply-To: References: Message-ID: On Wed, Jul 10, 2019 at 1:21 AM Philippe Verdy via Unicode wrote: > >> Well my first feeling was that U+202F should work all the time, but I found cases where this is not always the case. So this must be bugs in those renderers. > > I think we can attribute these bugs What bugs? I asked for an example, you haven't provided, yet you blame others without even considering that you might be doing or expecting something wrong. So I'm asking again. Please show us an example along the lines of: "I'm using the FooBar software, version 1.2.3, this and that particular field. I enter a data, the hexdump of that data is included here. I expect it to render as 123, instead it renders as 321." I don't find it a nice attitude to blame others without having a thorough understanding of the situation, without having firm reasons to suspect the problem elsewhere rather than in your expectations. And if a renderer is incorrect (which is not impossible, but maybe a bit early to claim), you just have to ditch it and replace with a correct one. Or, well, maybe your goal is to locate a set of faulty renderers, locate and understand their exact bugs, and find a workaround, i.e. a Unicode representation of numbers in RTL context with narrow spaces which is immune to those bugs? Not sure if any of us here are eager to help with that, I'm not, sorry. Not sure if it's possible at all (if there are really such bugs), probably not, given your further constraints such as not using BiDi control chars. egmont From unicode at unicode.org Sat Jul 13 00:36:15 2019 From: unicode at unicode.org (=?UTF-8?Q?S=C5=82awomir_Osipiuk?= via Unicode) Date: Sat, 13 Jul 2019 01:36:15 -0400 Subject: Unicode "no-op" Character? In-Reply-To: <001201d5293d$bd30bf10$37923d30$@gmail.com> References: <002401d5288f$6919cab0$3b4d6010$@gmail.com> <001201d5293d$bd30bf10$37923d30$@gmail.com> Message-ID: Hello again everyone, Though I initially took the shoo-away, there have been some comments made since then that I feel compelled to rebut. To avoid spamming the list, I?ve combined my responses into a single message. Before that, I will say, again, for the record: I know this NOOP idea is unlikely to ever happen. Certainly not with the responses I've gotten. I haven't submitted it, nor even looked into how to. I know it would be rejected. This is a thought experiment, nothing more. If that doesn't interest you, please disregard this message. And again, the hypothetical NOOP is a character whose canonical equivalent is the absence of a character. The logical consequences of that statement apply fully. On Wed, Jul 3, 2019 at 8:00 PM Shawn Steele via Unicode wrote: > > Even more complicated is that, as pointed out by others, it's pretty much impossible to say "these n codepoints should be ignored and have no meaning" because some process would try to use codepoints 1-3 for some private meaning. Another would use codepoint 1 for their own thing, and there'd be a conflict. This is so utterly, completely, and severely missing the point I'm starting to feel like a madman screaming to the heavens, "Why can't they just understand?!" Yes, a different process will have a different private meaning for the codepoint. That is not a bug, it is a feature. A conflict is always resolved by the current process saying, "I'm holding the string now. The old NOOPs are gone, canonically decomposed to nothing. The new ones mean what I want them to mean, as long as I or my buddies hold the string. If you didn't want that, you shouldn't have given the string to me!" This conflict-resolution mechanism is the special sauce. If a process needs a private marker that will be preserved in interchange, there are plenty of PUA characters to use, and even a couple of private control characters. > I also think that the conversation has pretty much proven that such a system is mathematically impossible. (You can't have a "private" no-meaning codepoint that won't conflict with other "private" uses in a public space). No such thing has been proven in the slightest. Any conflict is resolved, in the default case, by normalizing all NOOPs to nothing. On Wed, Jul 3, 2019 at 5:46 PM Mark E. Shoulson via Unicode wrote: > > Um... How could you be sure that process X would get the no-ops that process W wrote? After all, it's *discardable*, like you said, and the database programs and libraries aren't in on the secret. Yes, there is a requirement that W and X communicate via some "NOOP-preserving path" (call it a NOOPPP). Such paths would generally be very short and direct, because NOOPs are intended to be ephemeral, not archival! They wouldn't be hard to come by. Memory mappings or pipes. Direct inter-process comms. Anything that operates at byte-level. Even simple persisting mechanisms like file storage or databases can preserve NOOP by doing... nothing. "Discardable" doesn't mean it must be discarded, merely that it can be. Where there are no security implications or other need, strings containing NOOP can simply be passed through and stored as-is. Where any interface, library, or process does not preserve NOOP, it cannot be part of a NOOPPP. Tough luck. > Moreover, as you say, what about when Process Z (or its companions) comes along and is using THE SAME MECHANISM for something utterly different? How does it know that process W wasn't writing no-ops for it, but was writing them for Process X? It is the responsibility of Process Z (and any process that interprets NOOPs non-trivially) to be aware of the context/source of what it's receiving. Prior agreement or advertised contract. On Wed, Jul 3, 2019 at 2:06 PM Rebecca Bettencourt wrote: > > And the database driver filters out the U+000F completely as a matter of best practice and security-in-depth. I'm struggling to see the security implication of "store this string, verbatim, in your regular VARCHAR (or whatever) text field". I can store the string "DROP TABLE [STUDENTS];" in a text field and unless the database is horribly broken it will store that without issue. A database could strip out NOOP out of text fields and still claim to be Unicode conformant. But I wonder why it would bother to do that. And even then, you could just store the string in a VARBINARY field or whatever just accepts bytes. > You can't say "this character should be ignored everywhere" and "this character should be preserved everywhere" at the same time. That's the contradiction. I have not said "this character should be preserved everywhere". That statement is completely false. Unfortunately, that means what I said is still not being understood at all. Forgive me for being frustrated. Finally, a general comment: I think people are getting hung-up on this idea because they?re still thinking in terms of what is being guaranteed, while this is explicitly about an inversion of that concept. Not a guarantee, but a disclaimer. I called it an ?ephemeral private sentinel? because that name captures what it is. It?s not for archiving or interchange, except for extremely short and direct cases under special conditions. Most objections I?ve gotten so far arise out of misunderstanding and attempts to force normal character behaviour on it. I can take criticism, but not when it?s based on a completely false premise. Define a character that is canonically equivalent to the absence of a character. Make it so a conforming receiving process able to purge it whenever convenient. That's not hard to implement, especially in relation to other existing requirements. But would it be useful? I claim it would be very useful indeed. Many things that can be done with ordinary characters will not be possible with this one. That's fine. Other things will be possible. This idea isn?t really dissimilar to the original intended meanings of SYN or NUL or DEL, or for that matter to Unicode noncharacters. In fact if the standard had enforced purging noncharacters during interchange (instead of vacillating about their illegality before currently recommending they be preserved or at least U+FFFDed) we?d already be 99% of the way to what I suggested. The ideal opportunity to define this behaviour (for a single code point or a set) was almost three decades ago, but it definitely could have been done, and it would not have been expensive. I just hold onto this idea for that day I get a time machine. From unicode at unicode.org Mon Jul 15 13:03:44 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 15 Jul 2019 11:03:44 -0700 Subject: Numeric group separators and Bidi Message-ID: <20190715110344.665a7a7059d7ee80bb4d670165c8327d.1ffd289290.wbe@email03.godaddy.com> Philippe Verdy wrote: > [... U+202F ...] and not even accessible in most input tools... > including the Windows "Charmap" where it is not even listed with other > spaces or punctuations, except if we display the FULL list of > characters supported by a selected font that maps it (many fonts don't > map it) and the "Unicode" encoding. Windows charmap is so outdated > (and has many inconsistancies in its proposed grouping, look for > example at the groups proposed for Greek, they are complete non-sense, > with duplicate subranges, but groups made completely arbitrarily, > making this basic tool really difficult to use). BabelMap (http://www.babelstone.co.uk/Software/BabelMap.html) is free of charge, is easy to use, runs on all versions of Windows since 2000, provides much better support for almost all Character Map functions than Character Map, and has tons of additional useful features which can be easily ignored if not needed. The only possible reason for a knowledgeable, let alone Unicode-knowledgeable, Windows user to use the built-in Character Map utility instead of BabelMap would be to look up and pick from legacy character sets (which I think is what Philippe is referring to as "proposed groupings"). -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed Jul 17 12:37:53 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 17 Jul 2019 10:37:53 -0700 Subject: Removing accents and diacritics from a word In-Reply-To: References: Message-ID: <94a3e51a-7f44-fb19-1dfa-f9d384a97ad3@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 17 13:07:14 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Wed, 17 Jul 2019 11:07:14 -0700 Subject: Removing accents and diacritics from a word In-Reply-To: References: <94a3e51a-7f44-fb19-1dfa-f9d384a97ad3@ix.netcom.com> Message-ID: <55085b14-2871-33cb-1c89-174d84826774@ix.netcom.com> On 7/17/2019 11:02 AM, Norbert Lindenberg wrote: > ?Misspelling?? Not helpful. Anybody have a serious suggestion? A./ > > >> On Jul 17, 2019, at 10:37, Asmus Freytag via Unicode wrote: >> >> A question has come up in another context: >> >> Is there any linguistic term for describing the process of removing accents and diacritics from a word to create its ?base form?, e.g. S?o Tom? to Sao Tome? >> >> The linguistic term "string normalization" appears not that preferable in a computing context. >> >> Any ideas? >> >> A./ >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 17 13:25:02 2019 From: unicode at unicode.org (=?UTF-8?Q?S=C5=82awomir_Osipiuk?= via Unicode) Date: Wed, 17 Jul 2019 14:25:02 -0400 Subject: Removing accents and diacritics from a word In-Reply-To: <94a3e51a-7f44-fb19-1dfa-f9d384a97ad3@ix.netcom.com> References: <94a3e51a-7f44-fb19-1dfa-f9d384a97ad3@ix.netcom.com> Message-ID: <002401d53ccc$f3c42290$db4c67b0$@gmail.com> ?Transliteration?? Maybe more generic that what you?re looking for. Used for the process of producing the ?machine readable zone? on passports: https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf (see section 6, page 30) ?Accent folding? or ?diacritic folding? is used in some places. String folding is ?A string transform F, with the property that repeated applications of the same function F produce the same output: F(F(S)) = F(S) for all input strings S?. Accent folding is a special case of that. https://unicode.org/reports/tr23/#StringFunctionClassificationDefinitions https://alistapart.com/article/accent-folding-for-auto-complete/ From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag via Unicode Sent: Wednesday, July 17, 2019 13:38 To: Unicode Mailing List Subject: Removing accents and diacritics from a word A question has come up in another context: Is there any linguistic term for describing the process of removing accents and diacritics from a word to create its ?base form?, e.g. S?o Tom? to Sao Tome? The linguistic term "string normalization" appears not that preferable in a computing context. Any ideas? A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 17 13:37:38 2019 From: unicode at unicode.org (Tex via Unicode) Date: Wed, 17 Jul 2019 11:37:38 -0700 Subject: Removing accents and diacritics from a word In-Reply-To: <55085b14-2871-33cb-1c89-174d84826774@ix.netcom.com> References: <94a3e51a-7f44-fb19-1dfa-f9d384a97ad3@ix.netcom.com> <55085b14-2871-33cb-1c89-174d84826774@ix.netcom.com> Message-ID: <007101d53cce$b5f835d0$21e8a170$@xencraft.com> Asmus, are you including the case where an accented character maps to two unaccented characters? e.g. ? to AA or ? to AE From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (c) via Unicode Sent: Wednesday, July 17, 2019 11:07 AM To: Norbert Lindenberg Cc: Unicode Mailing List Subject: Re: Removing accents and diacritics from a word On 7/17/2019 11:02 AM, Norbert Lindenberg wrote: ?Misspelling?? Not helpful. Anybody have a serious suggestion? A./ On Jul 17, 2019, at 10:37, Asmus Freytag via Unicode wrote: A question has come up in another context: Is there any linguistic term for describing the process of removing accents and diacritics from a word to create its ?base form?, e.g. S?o Tom? to Sao Tome? The linguistic term "string normalization" appears not that preferable in a computing context. Any ideas? A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 17 17:55:42 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 18 Jul 2019 00:55:42 +0200 Subject: ISO 15924 : missing indication of support for Syriac variants Message-ID: The ISO 15924/RA reference page contains indication of support in Unicode for variants of various scripts such as Aran, Latf, Latg, Hanb, Hans, Hant:. 160 *Arab* Arabic arabe Arabic 1.1 2004-05-01 161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1 2014-11-15 ... 503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec bopomofo (alias pour han + bopomofo) 1.1 2016-01-19 500 *Hani* Han (Hanzi, Kanji, Hanja) id?ogrammes han (sinogrammes) Han 1.1 2009-02-23 501 *Hans* Han (Simplified variant) id?ogrammes han (variante simplifi?e) 1.1 2004-05-29 502 *Hant* Han (Traditional variant) id?ogrammes han (variante traditionnelle) 1.1 2004-05-29 ... 217 *Latf* Latin (Fraktur variant) latin (variante bris?e) 1.1 2004-05-01 216 *Latg* Latin (Gaelic variant) latin (variante ga?lique) 1.1 2004-05-01 215 *Latn* Latin latin Latin 1.1 2004-05-01 ... There are other entries for aliases or mixed script also for Japanese and Korean. But for Syriac variants this is missing and this is the only script for which this occurs: 135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01 138 Syre Syriac (Estrangelo variant) syriaque (variante estrangh?lo) 2004-05-01 137 Syrj Syriac (Western variant) syriaque (variante occidentale) 2004-05-01 136 Syrn Syriac (Eastern variant) syriaque (variante orientale) 2004-05-01 Why is there no Unicode version given for these 3 variants ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 17 18:07:58 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 18 Jul 2019 01:07:58 +0200 Subject: ISO 15924 : missing indication of support for Syriac variants In-Reply-To: References: Message-ID: Note also that there are variants registered with Unicode versions (Age) for symbols, even if they don't have any assigned Unicode alias, but this is not a problem. 994 Zinh Code for inherited script codet pour ?criture h?rit?e Inherited 2009-02-23 995 *Zmth * Mathematical notation notation math?matique 3.2 2007-11-26 993 *Zsye * Symbols (Emoji variant) symboles (variante ?moji) 6.0 2015-12-16 996 *Zsym * Symbols symboles 1.1 2007-11-26 The Unicode version is an important information which allows determining that texts created in a given language (or notation), and written in these scripts, can be written using the UCS. Weren't the 3 variants of Syriac unified in Unicode (even if they may be distinguished in ISO 15924, for example to allow selecting a suitable but preferred sets of fonts, like this is commonly used for Chinese Mandarin, Arabic, Japanese, Korean or Latin) ? Le jeu. 18 juil. 2019 ? 00:55, Philippe Verdy a ?crit : > The ISO 15924/RA reference page contains indication of support in Unicode > for variants of various scripts such as Aran, Latf, Latg, Hanb, Hans, Hant:. > 160 *Arab* Arabic arabe Arabic 1.1 2004-05-01 > 161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1 > 2014-11-15 > ... > 503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec bopomofo > (alias pour han + bopomofo) 1.1 2016-01-19 > > 500 *Hani* Han (Hanzi, Kanji, Hanja) id?ogrammes han (sinogrammes) Han 1.1 > 2009-02-23 > > 501 *Hans* Han (Simplified variant) id?ogrammes han (variante simplifi?e) > 1.1 2004-05-29 > 502 *Hant* Han (Traditional variant) id?ogrammes han (variante > traditionnelle) 1.1 2004-05-29 > ... > 217 *Latf* Latin (Fraktur variant) latin (variante bris?e) 1.1 2004-05-01 > 216 *Latg* Latin (Gaelic variant) latin (variante ga?lique) 1.1 2004-05-01 > 215 *Latn* Latin latin Latin 1.1 2004-05-01 > ... > There are other entries for aliases or mixed script also for Japanese and > Korean. > > But for Syriac variants this is missing and this is the only script for > which this occurs: > 135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01 > 138 Syre Syriac (Estrangelo variant) syriaque (variante estrangh?lo) > 2004-05-01 > 137 Syrj Syriac (Western variant) syriaque (variante occidentale) > 2004-05-01 > 136 Syrn Syriac (Eastern variant) syriaque (variante orientale) 2004-05-01 > Why is there no Unicode version given for these 3 variants ? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 17 18:16:52 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 18 Jul 2019 01:16:52 +0200 Subject: ISO 15924 : missing indication of support for Syriac variants In-Reply-To: References: Message-ID: Sorry I misread (with an automated tool) an old dataset where these "3.0" versions were indicated in an incorrect form Le jeu. 18 juil. 2019 ? 01:07, Philippe Verdy a ?crit : > Note also that there are variants registered with Unicode versions (Age) > for symbols, even if they don't have any assigned Unicode alias, but this > is not a problem. > 994 Zinh Code for inherited script codet pour ?criture h?rit?e Inherited > 2009-02-23 > 995 *Zmth * Mathematical > notation notation math?matique 3.2 2007-11-26 > 993 *Zsye * Symbols > (Emoji variant) symboles (variante ?moji) 6.0 2015-12-16 > 996 *Zsym > * > Symbols symboles 1.1 2007-11-26 > The Unicode version is an important information which allows determining > that texts created in a given language (or notation), and written in these > scripts, can be written using the UCS. > > Weren't the 3 variants of Syriac unified in Unicode (even if they may be > distinguished in ISO 15924, for example to allow selecting a suitable but > preferred sets of fonts, like this is commonly used for Chinese Mandarin, > Arabic, Japanese, Korean or Latin) ? > > > Le jeu. 18 juil. 2019 ? 00:55, Philippe Verdy a > ?crit : > >> The ISO 15924/RA reference page contains indication of support in Unicode >> for variants of various scripts such as Aran, Latf, Latg, Hanb, Hans, Hant:. >> 160 *Arab* Arabic arabe Arabic 1.1 2004-05-01 >> 161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1 >> 2014-11-15 >> ... >> 503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec >> bopomofo (alias pour han + bopomofo) 1.1 2016-01-19 >> >> 500 *Hani* Han (Hanzi, Kanji, Hanja) id?ogrammes han (sinogrammes) Han >> 1.1 2009-02-23 >> >> 501 *Hans* Han (Simplified variant) id?ogrammes han (variante simplifi?e) >> 1.1 2004-05-29 >> 502 *Hant* Han (Traditional variant) id?ogrammes han (variante >> traditionnelle) 1.1 2004-05-29 >> ... >> 217 *Latf* Latin (Fraktur variant) latin (variante bris?e) 1.1 2004-05-01 >> 216 *Latg* Latin (Gaelic variant) latin (variante ga?lique) 1.1 >> 2004-05-01 >> 215 *Latn* Latin latin Latin 1.1 2004-05-01 >> ... >> There are other entries for aliases or mixed script also for Japanese and >> Korean. >> >> But for Syriac variants this is missing and this is the only script for >> which this occurs: >> 135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01 >> 138 Syre Syriac (Estrangelo variant) syriaque (variante estrangh?lo) >> 2004-05-01 >> 137 Syrj Syriac (Western variant) syriaque (variante occidentale) >> 2004-05-01 >> 136 Syrn Syriac (Eastern variant) syriaque (variante orientale) >> 2004-05-01 >> Why is there no Unicode version given for these 3 variants ? >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 17 18:55:03 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Wed, 17 Jul 2019 16:55:03 -0700 Subject: Removing accents and diacritics from a word In-Reply-To: <007101d53cce$b5f835d0$21e8a170$@xencraft.com> References: <94a3e51a-7f44-fb19-1dfa-f9d384a97ad3@ix.netcom.com> <55085b14-2871-33cb-1c89-174d84826774@ix.netcom.com> <007101d53cce$b5f835d0$21e8a170$@xencraft.com> Message-ID: On 7/17/2019 11:37 AM, Tex wrote: > > Asmus, are you including the case where an accented character maps to > two unaccented characters? > > e.g. ? to AA or ? to AE > If that's covered by the same term; but it's not simple "typewriter/telegraph" fallback. > > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Asmus Freytag (c) via Unicode > *Sent:* Wednesday, July 17, 2019 11:07 AM > *To:* Norbert Lindenberg > *Cc:* Unicode Mailing List > *Subject:* Re: Removing accents and diacritics from a word > > On 7/17/2019 11:02 AM, Norbert Lindenberg wrote: > > ?Misspelling?? > > Not helpful. Anybody have a serious suggestion? > > A./ > > On Jul 17, 2019, at 10:37, Asmus Freytag via Unicode wrote: > > A question has come up in another context: > > Is there any linguistic term for describing the process of removing accents and diacritics from a word to create its ?base form?, e.g. S?o Tom? to Sao Tome? > > The linguistic term "string normalization" appears not that preferable in a computing context. > > Any ideas? > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 17 18:54:52 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 18 Jul 2019 01:54:52 +0200 Subject: ISO 15924 : missing indication of support for Syriac variants In-Reply-To: References: Message-ID: But my concern is in fact valid as well for Egyptian Hieratic (considered in Chapter 14 to be "unified" with the Hieroglyphs, and being a cursive variant, currently not supported in any font because of the very complex set of ligatures this would require, and that may not even work properly with the existing markup notations used with Hieroglyphs). But if the "Manuel de codage" for Egyptian Hieroglyphs (describing a markup notation) contains extensions to represent the Hieratic variants with the unified Hieroglyphs, then the Unicode version (age) used for Hieroglyphs should also be assigned to Hieratic. In fact the ligatures system for the "cursive" Egyptian Hieratic is so complex (and may also have its own variants showing its progression from Hieroglyphs to Demotic or Old Coptic), that probably Hieratic should no longer be considered "unified" with Hieroglyphs, and its existing ISO 15924 code is then not represented at all in Unicode. For now ISO 15924 still does not consider Egyptian Hieratic to be "unified" with Egyptian Hieroglyphs; this is not indicated in its descriptive names given in English or French with a suffix like "(cursive variant of Egyptian Hieroglyphs)", and it has no "Unicode Age" version given, as if it was still not encoded at all by Unicode, and then Chapter 14 of the standard (in its section about Hieroglyphs where Hieratic is cited once) is probably misleading, waiting for further studies. And I'm unable to find any non-proprietary (interoperable?) attempt to encode Hieratic, the only attempts being with Hieroglyphs. Le jeu. 18 juil. 2019 ? 01:16, Philippe Verdy a ?crit : > Sorry I misread (with an automated tool) an old dataset where these "3.0" > versions were indicated in an incorrect form > > Le jeu. 18 juil. 2019 ? 01:07, Philippe Verdy a > ?crit : > >> Note also that there are variants registered with Unicode versions (Age) >> for symbols, even if they don't have any assigned Unicode alias, but this >> is not a problem. >> 994 Zinh Code for inherited script codet pour ?criture h?rit?e Inherited >> 2009-02-23 >> 995 *Zmth * Mathematical >> notation notation math?matique 3.2 2007-11-26 >> 993 *Zsye * Symbols >> (Emoji variant) symboles (variante ?moji) 6.0 2015-12-16 >> 996 *Zsym >> * >> Symbols symboles 1.1 2007-11-26 >> The Unicode version is an important information which allows determining >> that texts created in a given language (or notation), and written in these >> scripts, can be written using the UCS. >> >> Weren't the 3 variants of Syriac unified in Unicode (even if they may be >> distinguished in ISO 15924, for example to allow selecting a suitable but >> preferred sets of fonts, like this is commonly used for Chinese Mandarin, >> Arabic, Japanese, Korean or Latin) ? >> >> >> Le jeu. 18 juil. 2019 ? 00:55, Philippe Verdy a >> ?crit : >> >>> The ISO 15924/RA reference page contains indication of support in >>> Unicode for variants of various scripts such as Aran, Latf, Latg, Hanb, >>> Hans, Hant:. >>> 160 *Arab* Arabic arabe Arabic 1.1 2004-05-01 >>> 161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1 >>> 2014-11-15 >>> ... >>> 503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec >>> bopomofo (alias pour han + bopomofo) 1.1 2016-01-19 >>> >>> 500 *Hani* Han (Hanzi, Kanji, Hanja) id?ogrammes han (sinogrammes) Han >>> 1.1 2009-02-23 >>> >>> 501 *Hans* Han (Simplified variant) id?ogrammes han (variante >>> simplifi?e) 1.1 2004-05-29 >>> 502 *Hant* Han (Traditional variant) id?ogrammes han (variante >>> traditionnelle) 1.1 2004-05-29 >>> ... >>> 217 *Latf* Latin (Fraktur variant) latin (variante bris?e) 1.1 >>> 2004-05-01 >>> 216 *Latg* Latin (Gaelic variant) latin (variante ga?lique) 1.1 >>> 2004-05-01 >>> 215 *Latn* Latin latin Latin 1.1 2004-05-01 >>> ... >>> There are other entries for aliases or mixed script also for Japanese >>> and Korean. >>> >>> But for Syriac variants this is missing and this is the only script for >>> which this occurs: >>> 135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01 >>> 138 Syre Syriac (Estrangelo variant) syriaque (variante estrangh?lo) >>> 2004-05-01 >>> 137 Syrj Syriac (Western variant) syriaque (variante occidentale) >>> 2004-05-01 >>> 136 Syrn Syriac (Eastern variant) syriaque (variante orientale) >>> 2004-05-01 >>> Why is there no Unicode version given for these 3 variants ? >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 17 19:05:58 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Wed, 17 Jul 2019 17:05:58 -0700 Subject: Removing accents and diacritics from a word In-Reply-To: <002401d53ccc$f3c42290$db4c67b0$@gmail.com> References: <94a3e51a-7f44-fb19-1dfa-f9d384a97ad3@ix.netcom.com> <002401d53ccc$f3c42290$db4c67b0$@gmail.com> Message-ID: On 7/17/2019 11:25 AM, S?awomir Osipiuk wrote: > > ?Transliteration?? > > Maybe more generic that what you?re looking for. Used for the process > of producing the ?machine readable zone? on passports: > > https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf (see > section 6, page 30) > > ?Accent folding? or ?diacritic folding? is used in some places. String > folding is ?A string transform F, with the property that repeated > applications of the same function F produce the same output: F(F(S)) = > F(S) for all input strings S?. Accent folding is a special case of that. > > https://unicode.org/reports/tr23/#StringFunctionClassificationDefinitions > > https://alistapart.com/article/accent-folding-for-auto-complete/ > Diacritic folding. Thanks. Just didn't think of the operation as folding the way it came up, but that's what it is. A./ > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Asmus Freytag via Unicode > *Sent:* Wednesday, July 17, 2019 13:38 > *To:* Unicode Mailing List > *Subject:* Removing accents and diacritics from a word > > A question has come up in another context: > > Is there any linguistic term for describing the process of removing > accents and diacritics from a word to create its ?base form?, e.g. S?o > Tom? to Sao Tome? > > The linguistic term "string normalization" appears not that preferable > in a computing context. > > Any ideas? > > A./ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 17 20:03:29 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 18 Jul 2019 02:03:29 +0100 Subject: ISO 15924 : missing indication of support for Syriac variants In-Reply-To: References: Message-ID: <20190718020329.4d9ba6d1@JRWUBU2> On Thu, 18 Jul 2019 01:54:52 +0200 Philippe Verdy via Unicode wrote: > In fact the ligatures system for the "cursive" Egyptian Hieratic is so > complex (and may also have its own variants showing its progression > from Hieroglyphs to Demotic or Old Coptic), that probably Hieratic > should no longer be considered "unified" with Hieroglyphs, and its > existing ISO 15924 code is then not represented at all in Unicode. Writing hieroglyphic text as plain text has only been supported since Unicode 12.0, so it may take a little while to explore workable encoding conventions. A significant issue is that the hieratic script is right to left but Unicode only standardises the encoding of left-to-right transcriptions. I don't recall the difference between retrograde v. normal text being declared a style difference. For comparison, we still have no guidance on how to encode sexagesimal Mesopotamian cuneiform numbers, e.g. '610' v. '20' written using the U graphic element. Richard. From unicode at unicode.org Wed Jul 17 21:31:58 2019 From: unicode at unicode.org (=?UTF-8?B?WWlmw6FuIFfDoW5n?= via Unicode) Date: Thu, 18 Jul 2019 11:31:58 +0900 Subject: Unicode's got a new logo? Message-ID: Hi there, I cannot help but notice the new home.unicode.org site embraces a new logo, blue base color with a humanist type, rather than the traditional one, red and geometric. Does anybody know if it means that Unicode wants to renew its logo or that they serve for different purposes? Which should I cite as the official logo? I think I've read the description and the blog post but couldn't find an explanation. Thank you. From unicode at unicode.org Wed Jul 17 23:01:30 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 17 Jul 2019 21:01:30 -0700 Subject: ISO 15924 : missing indication of support for Syriac variants In-Reply-To: <20190718020329.4d9ba6d1@JRWUBU2> References: <20190718020329.4d9ba6d1@JRWUBU2> Message-ID: <9a70fa52-e298-6af6-6bf5-d54c0caf70f5@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jul 18 11:10:20 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 18 Jul 2019 09:10:20 -0700 Subject: ISO 15924 : missing indication of support for Syriac variants In-Reply-To: References: Message-ID: <6ea6f3c4-e809-c258-58a8-14e05937f055@sonic.net> On 7/17/2019 4:54 PM, Philippe Verdy via Unicode wrote: > then the Unicode version (age) used for Hieroglyphs should also be > assigned to Hieratic. It is already. > > In fact the ligatures system for the "cursive" Egyptian Hieratic is so > complex (and may also have its own variants showing its progression > from Hieroglyphs to Demotic or Old Coptic), that probably Hieratic > should no longer be considered "unified" with Hieroglyphs, and its > existing ISO 15924 code is then not represented at all in Unicode. It *is* considered unified with Egyptian hieroglyphs, until such time as anyone would make a serious case that the Unicode Standard (and students of the Egyptian hieroglyphs, in both their classic, monumental forms and in hieratic) would be better served by a disunification. Note that *many* cursive forms of scripts are not easily "supported" by out-of-the-box plain text implementations, for obvious reasons. And in the case of Egyptian hieroglyphs, it would probably be a good strategy to first get some experience in implementations/fonts supporting the Unicode 12.0 controls for hieroglyphs, before worrying too much about what does or doesn't work to represent hieratic texts adequately. (Demotic is clearly a different case.) > > For now ISO 15924 still does not consider Egyptian Hieratic to be > "unified" with Egyptian?Hieroglyphs; this is not indicated in its > descriptive names given in English or French with a suffix like > "(cursive variant of Egyptian Hieroglyphs)", *and it has no "Unicode > Age" version given, as if it was still not encoded at all by Unicode*, That latter part of that statement (highlighted) is false, as is easily determined by simple inspection of the Egyh entry on: https://www.unicode.org/iso15924/iso15924-codes.html --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jul 18 13:50:03 2019 From: unicode at unicode.org (Steffen Nurpmeso via Unicode) Date: Thu, 18 Jul 2019 20:50:03 +0200 Subject: Unicode's got a new logo? In-Reply-To: References: Message-ID: <20190718185003.JLYdu%steffen@sdaoden.eu> Yif?n W?ng via Unicode wrote in : |I cannot help but notice the new home.unicode.org site embraces a new |logo, blue base color with a humanist type, rather than the |traditional one, red and geometric. Does anybody know if it means that |Unicode wants to renew its logo or that they serve for different |purposes? Which should I cite as the official logo? I think I've read |the description and the blog post but couldn't find an explanation. I also decided to enter /L2 directly from now on. I am happy that you give me the opportunity to finally send a mail regarding this topic. (Excuses to the designers from Adobe.) --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From unicode at unicode.org Thu Jul 18 14:06:45 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 18 Jul 2019 20:06:45 +0100 Subject: ISO 15924 : missing indication of support for Syriac variants In-Reply-To: <9a70fa52-e298-6af6-6bf5-d54c0caf70f5@ix.netcom.com> References: <20190718020329.4d9ba6d1@JRWUBU2> <9a70fa52-e298-6af6-6bf5-d54c0caf70f5@ix.netcom.com> Message-ID: <20190718200645.1a34aba4@JRWUBU2> On Wed, 17 Jul 2019 21:01:30 -0700 Asmus Freytag via Unicode wrote: > On 7/17/2019 6:03 PM, Richard Wordingham via Unicode wrote: >> A significant issue is that the hieratic script is right to left but >> Unicode only standardises the encoding of left-to-right >> transcriptions. I don't recall the difference between retrograde v. >> normal text being declared a style difference. > Use directional overrides. Those have been in the standard forever. How do they help distinguish normal right-to-left text and right-to-left retrograde text? As I understand it, the implementer has to guess which way characters in an ancient script face when the direction of the text is overridden. Unicode used to define the orientation, but that got withdrawn a few years ago. Richard. From unicode at unicode.org Thu Jul 18 14:12:23 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 18 Jul 2019 12:12:23 -0700 Subject: Access to the Unicode technical site (was: Re: Unicode's got a new logo?) In-Reply-To: <20190718185003.JLYdu%steffen@sdaoden.eu> References: <20190718185003.JLYdu%steffen@sdaoden.eu> Message-ID: <3d1676bb-f3c1-8a3e-fdc5-1c0bdd74afec@sonic.net> On 7/18/2019 11:50 AM, Steffen Nurpmeso via Unicode wrote: > I also decided to enter /L2 directly from now on. For folks wishing to access the UTC document register, Unicode Consortium standards, and so forth, all of those links will be permanently stable. They are not impacted by the rollout of the new home page and its related content. If you need access to the more technical information from the UTC, CLDR-TC, ICU-TC, etc., feel free to bookmark such pages as: https://www.unicode.org/L2/ for the UTC document register. https://www.unicode.org/charts/ for the Unicode code charts index, https://www.unicode.org/versions/latest/ for the latest version of the Unicode Standard, and so forth. All such technical links are stable on the site, and will continue to be stable. For general access to the technical content on the Unicode website, see: https://www.unicode.org/main.html which provides easy link access to all the technical content areas and to the ongoing technical committee work. --Ken From unicode at unicode.org Thu Jul 18 15:38:20 2019 From: unicode at unicode.org (Walter Tross via Unicode) Date: Thu, 18 Jul 2019 22:38:20 +0200 Subject: Removing accents and diacritics from a word In-Reply-To: <0d44efaf-638d-e7f9-572c-88c41559cf99@ix.netcom.com> References: <94a3e51a-7f44-fb19-1dfa-f9d384a97ad3@ix.netcom.com> <002401d53ccc$f3c42290$db4c67b0$@gmail.com> <0d44efaf-638d-e7f9-572c-88c41559cf99@ix.netcom.com> Message-ID: OK, but if I, as a German, were to search for M?nchen in a context where I only had ASCII characters available, I would type Muenchen. Il giorno gio 18 lug 2019 alle ore 22:23 Asmus Freytag (c) < asmusf at ix.netcom.com> ha scritto: > On 7/18/2019 1:08 PM, Walter Tross wrote: > > Please remember that diacritics carry information. > > That goes without saying, The context is for a situation like the one > where you might need to allow someone to enter a word without accents (e.g. > because they don't have the right keyboard). > > In Italian, e.g., where the grave or acute accent is almost always at the > end of words, this information is preserved, when transliterating, by > removing the accent and appending an apostrophe, like in per??pero' (pero > would be a different word). E.g., my father-in-law has Nicolo' instead of > Nicol? on his credit card. > In German, ?, ? and ? are transliterated as ae, oe and ue. E.g., the > portal of M?nchen (Munich) is https://www.muenchen.de/ > Etc. > > whether to fold the umlauts using the added "e" or just the base letter, > or doing both, would depend on the circumstance. > > This is not about preserving information, but enabling access/search from > an approximation of the full word. > > A./ > > > > Il giorno gio 18 lug 2019 alle ore 02:09 Asmus Freytag (c) via Unicode < > unicode at unicode.org> ha scritto: > >> On 7/17/2019 11:25 AM, S?awomir Osipiuk wrote: >> >> ?Transliteration?? >> >> Maybe more generic that what you?re looking for. Used for the process of >> producing the ?machine readable zone? on passports: >> >> https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf (see >> section 6, page 30) >> >> >> >> ?Accent folding? or ?diacritic folding? is used in some places. String >> folding is ?A string transform F, with the property that repeated >> applications of the same function F produce the same output: F(F(S)) = F(S) >> for all input strings S?. Accent folding is a special case of that. >> >> https://unicode.org/reports/tr23/#StringFunctionClassificationDefinitions >> >> https://alistapart.com/article/accent-folding-for-auto-complete/ >> >> Diacritic folding. Thanks. Just didn't think of the operation as folding >> the way it came up, but that's what it is. >> >> A./ >> >> >> >> >> >> >> *From:* Unicode [mailto:unicode-bounces at unicode.org >> ] *On Behalf Of *Asmus Freytag via Unicode >> *Sent:* Wednesday, July 17, 2019 13:38 >> *To:* Unicode Mailing List >> *Subject:* Removing accents and diacritics from a word >> >> >> >> A question has come up in another context: >> >> Is there any linguistic term for describing the process of removing >> accents and diacritics from a word to create its ?base form?, e.g. S?o Tom? >> to Sao Tome? >> >> The linguistic term "string normalization" appears not that preferable in >> a computing context. >> >> Any ideas? >> >> A./ >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jul 19 08:43:35 2019 From: unicode at unicode.org (Steffen Nurpmeso via Unicode) Date: Fri, 19 Jul 2019 15:43:35 +0200 Subject: Access to the Unicode technical site (was: Re: Unicode's got a new logo?) In-Reply-To: <3d1676bb-f3c1-8a3e-fdc5-1c0bdd74afec@sonic.net> References: <20190718185003.JLYdu%steffen@sdaoden.eu> <3d1676bb-f3c1-8a3e-fdc5-1c0bdd74afec@sonic.net> Message-ID: <20190719134335.63Ism%steffen@sdaoden.eu> Hello Mr. Ken Whistler. Ken Whistler wrote in <3d1676bb-f3c1-8a3e-fdc5-1c0bdd74afec at sonic.net>: |On 7/18/2019 11:50 AM, Steffen Nurpmeso via Unicode wrote: |> I also decided to enter /L2 directly from now on. | |For folks wishing to access the UTC document register, Unicode |Consortium standards, and so forth, all of those links will be |permanently stable. They are not impacted by the rollout of the new home |page and its related content. | |If you need access to the more technical information from the UTC, |CLDR-TC, ICU-TC, etc., feel free to bookmark such pages as: | |https://www.unicode.org/L2/ | |for the UTC document register. | |https://www.unicode.org/charts/ | |for the Unicode code charts index, | |https://www.unicode.org/versions/latest/ | |for the latest version of the Unicode Standard, and so forth. All such |technical links are stable on the site, and will continue to be stable. Are these things still linked from the top homepage yet? Thank you very much for the information. (My gut feeling is that it is tremendous that very highly qualified people care for such vanities.) |For general access to the technical content on the Unicode website, see: | |https://www.unicode.org/main.html | |which provides easy link access to all the technical content areas and |to the ongoing technical committee work. I hopefully will come to truly Unicode the things i do!! (By then programming will hopefully be true fun again. I hope..) A nice weekend i wish, from soon sunny again Germany! --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From unicode at unicode.org Fri Jul 19 23:07:13 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 20 Jul 2019 05:07:13 +0100 Subject: Breaking lines at Grapheme Boundaries Message-ID: <20190720050713.44604d56@JRWUBU2> If a renderer claims to support a writing system, should it render the text reasonably if its client breaks lines at extended grapheme cluster boundaries? The writing system itself has no compunction about breaking lines between legacy grapheme clusters, though I've no idea how I should get a mere word-processor to implement some of these line breaks. (The big problem here is that Indic reordering would be required around the line break.) Richard. From unicode at unicode.org Sat Jul 20 04:44:03 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 20 Jul 2019 11:44:03 +0200 Subject: ISO 15924 : missing indication of support for Syriac variants In-Reply-To: <6ea6f3c4-e809-c258-58a8-14e05937f055@sonic.net> References: <6ea6f3c4-e809-c258-58a8-14e05937f055@sonic.net> Message-ID: I had strange browser effects/caching issues: I did not see several "Age" values in that page (possibly because of a broken cache), and even my script did not detect it. I have already fixed that on my side and cleaned my cache to get a proper view of that page. Sorry for this disturbance, I trusted too much what my small semi-atuomated tool had collected (but I've not detected where it could have failed to parse the content, so I updated my own data manually). ISO 15924 does not have lot of data that cannot be edited by human. Le jeu. 18 juil. 2019 ? 18:10, Ken Whistler a ?crit : > > On 7/17/2019 4:54 PM, Philippe Verdy via Unicode wrote: > > then the Unicode version (age) used for Hieroglyphs should also be > assigned to Hieratic. > > It is already. > > > In fact the ligatures system for the "cursive" Egyptian Hieratic is so > complex (and may also have its own variants showing its progression from > Hieroglyphs to Demotic or Old Coptic), that probably Hieratic should no > longer be considered "unified" with Hieroglyphs, and its existing ISO 15924 > code is then not represented at all in Unicode. > > It *is* considered unified with Egyptian hieroglyphs, until such time as > anyone would make a serious case that the Unicode Standard (and students of > the Egyptian hieroglyphs, in both their classic, monumental forms and in > hieratic) would be better served by a disunification. > > Note that *many* cursive forms of scripts are not easily "supported" by > out-of-the-box plain text implementations, for obvious reasons. And in the > case of Egyptian hieroglyphs, it would probably be a good strategy to first > get some experience in implementations/fonts supporting the Unicode 12.0 > controls for hieroglyphs, before worrying too much about what does or > doesn't work to represent hieratic texts adequately. (Demotic is clearly a > different case.) > > > For now ISO 15924 still does not consider Egyptian Hieratic to be > "unified" with Egyptian Hieroglyphs; this is not indicated in its > descriptive names given in English or French with a suffix like "(cursive > variant of Egyptian Hieroglyphs)", *and it has no "Unicode Age" version > given, as if it was still not encoded at all by Unicode*, > > That latter part of that statement (highlighted) is false, as is easily > determined by simple inspection of the Egyh entry on: > > https://www.unicode.org/iso15924/iso15924-codes.html > > --Ken > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jul 21 18:03:00 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 22 Jul 2019 00:03:00 +0100 Subject: Displaying Lines of Text as Line-Broken by a Human Message-ID: <20190722000300.70d4fe83@JRWUBU2> I've been transcribing some Pali text written on palm leaf in the Tai Tham script. I'm looking for a way of reflecting the line boundaries in a manuscript in a transcription. The problem is that lines sometimes start or end with an isolated spacing mark. I want my text to be searchable and therefore encoded in Unicode. (I appreciate that There is a trade-off between searchability and showing line boundaries. The unorthodox spelling is also a problem.) How unreasonable is it for a font to render as just the spacing mark? Some rendering systems give the font no way of distinguishing dotted circles in the backing store from dotted circles added by the renderer, so this technique is not Unicode compliant. An alternative solution is to have a parallel font (or, more neatly, a feature) that renders some base character (or sequence) as a zero-width non-inking character. This, however, would violate that character's identity. I suspect there is no Unicode-compliant solution. Richard. From unicode at unicode.org Sun Jul 21 22:53:19 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 21 Jul 2019 20:53:19 -0700 Subject: Displaying Lines of Text as Line-Broken by a Human In-Reply-To: <20190722000300.70d4fe83@JRWUBU2> References: <20190722000300.70d4fe83@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 22 11:16:23 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 22 Jul 2019 18:16:23 +0200 Subject: Akkha script (used by Eastern Magar language) in ISO 15924? Message-ID: According to Ethnolog, the Eastern Magar language (mgp) is written in two scripts: Devanagari and "Akkha". But the "Akkha" script does not seem to have any ISO 15924 code. The Ethnologue currently assigns a private use code (Qabl) for this script. Was the addition delayed due to lack of evidence (even if this language is official in Nepal and India) ? Did the editors of Ethnologue submit an addition request for that script (e.g. for the code "Akkh" or "Akha" ?) Or is it considered unified with another script that could explain why it is not coded ? If this is a variant it could have its own code (like Nastaliq in Arabic). Or may be this is just a subset of another (Sino-Tibetan) script ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 22 11:18:57 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 22 Jul 2019 17:18:57 +0100 Subject: Displaying Lines of Text as Line-Broken by a Human In-Reply-To: References: <20190722000300.70d4fe83@JRWUBU2> Message-ID: <20190722171857.2a247559@JRWUBU2> On Sun, 21 Jul 2019 20:53:19 -0700 Asmus Freytag via Unicode wrote: > There's really no inherent need for many spacing combining marks to > have a base character. At least the ones that do not reorder and that > don't overhang the base character's glyph. We are in agreement here. > As far as I can? tell, it's largely a convention that originally > helped identify clusters and other lack of break opportunities. But > now that we have separate properties for segmentation, it's not > strictly necessary to overload the combining property for that > purpose. Which relates to the separate question I asked about breaking at grapheme boundaries. Interestingly, I'm not seeing breaks next to an invisible stacker, but that may be because Pali subscript consonants only slightly increase the width of the cluster. The need for a base makes sense for reordering spacing marks, but should be to detect editing errors, not deliberate effects. An unreordered rordering mark plus consonant is visually ambiguous with consonant plus reordering mark. > In you example, why do you need the ZWJ and dotted circle? The user- and application-supplied text would be . > Originally, just applying a combining mark to a NBSP should normally > show the mark by itself. If a font insists on inserting a dotted > circle glyph, that's not required from a conformance perspective - > just something that's seen as helpful (to most users). It's not the font that inserts the dotted circle, it's the rendering engine. That's why the USE set Tai Tham rendering back several years. Now, there is at least one renderer (HarfBuzz) for which a cunning font can work out whether the renderer has introduced the dotted circle glyph rather than it being in the text to be rendered. I am looking for a general font-level solution to the problem that would even work on Windows 10. The ZWJ seems a reasonable hint that the space should be rendered with zero width. Do you think it is reasonable for to have zero width contribution from the NBSP when the spacing mark has a non-overhanging glyph? It seems to be an unstandardised area, but zero width might be considered to violate the character identity of NBSP. I also have the problem of visually line-final U+1A6E TAI THAM VOWEL SIGN E, which needs to be separated from a preceding consonant in the backing store. It seems to be particularly common before the holes (two per page) for the string that holds the pages together. Perhaps the scribe tried to avoid line-final U+1A6E. There are examples of these issues in Figure 9b of http://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf . The last syllable of _catt?ro_ 'four' straddles lines 2 and 3, with its first glyph (corresponding to SIGN E) ending line 2, and starting line 3. The antepenultimate syllable of _sammodam?nehi_ (misspelt _samoddam?nehi_) 'pleasing' is split between lines 7 and 8, with line 7 ending in MA and line 8 starting in SIGN AA. I am looking for advice on what is the least bad readily achievable solution. I can then adapt that to cope with the messier issue of the non-spacing character U+1A58 TAI THAM SIGN MAI KANG LAI, which acts like Burmese kinzi in the Pali text I am working on. (If one does not know the font well, one should not put a line break next to it unless all other options are exhausted.) Figure 9b also has an example of this issue. The initial consonant of sa?khepa? (misspelt sa?kheppa?) 'collection, summary' is on line 9, while the rest of the word, starting , is on line 10. There is weird hack that currently helps with LibreOffice - inserting CGJ turns off some parts of Indic shaping in the rest of the run. Or have I missed some new specification of Indic encoding? This helps with visually line-final SIGN E. Richard. From unicode at unicode.org Mon Jul 22 11:43:30 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Mon, 22 Jul 2019 09:43:30 -0700 Subject: Akkha script (used by Eastern Magar language) in ISO 15924? In-Reply-To: References: Message-ID: <81107f77-2eff-99ba-5846-98b77b9478dd@sonic.net> See the entry for "Magar Akkha" on: http://linguistics.berkeley.edu/sei/scripts-not-encoded.html Anshuman Pandey did preliminary research on this in 2011. http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf It would be premature to assign an ISO 15924 script code, pending the research to determine whether this script should be separately encoded. --Ken On 7/22/2019 9:16 AM, Philippe Verdy via Unicode wrote: > According to Ethnolog, the Eastern Magar language (mgp) is written in > two scripts: Devanagari and "Akkha". > > But the "Akkha" script does not seem to have any ISO 15924 code. > > The Ethnologue currently assigns a private use code (Qabl) for this > script. > > Was the addition delayed due to lack of evidence (even if this > language is official in Nepal and India) ? > > Did the editors of Ethnologue submit an addition request for that > script (e.g. for the code "Akkh" or "Akha" ?) > > Or is it considered unified with another script that could explain why > it is not coded ? If this is a variant it could have its own code > (like Nastaliq in Arabic). Or may be this is just a subset of another > (Sino-Tibetan) script ? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 22 12:00:30 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Mon, 22 Jul 2019 10:00:30 -0700 Subject: New website In-Reply-To: <02A77E81-0A98-4A1C-816D-150650C335A1@gmail.com> References: <20190722163118.6a131a36@sil-mh8> <9aeff531-ee05-82d0-2892-13c5b173ae18@sonic.net> <02A77E81-0A98-4A1C-816D-150650C335A1@gmail.com> Message-ID: Your helpful suggestions will be passed along to the people working on the new site. In the meantime, please note that the link to the "Unicode Technical Site" has been added to the left column of quick links in the page bottom banner, so it is easily available now from any page on the new site. --Ken On 7/22/2019 9:54 AM, Zachary Carpenter wrote: > It seems that many of the concerns expressed here could be resolved > with a menu link to the ?Unicode Technical Site? on the left-hand menu bar From unicode at unicode.org Mon Jul 22 12:06:03 2019 From: unicode at unicode.org (Lorna Evans via Unicode) Date: Mon, 22 Jul 2019 13:06:03 -0400 Subject: Akkha script (used by Eastern Magar language) in ISO 15924? In-Reply-To: <81107f77-2eff-99ba-5846-98b77b9478dd@sonic.net> References: <81107f77-2eff-99ba-5846-98b77b9478dd@sonic.net> Message-ID: Also: https://scriptsource.org/scr/Qabl On Mon, Jul 22, 2019, 12:47 PM Ken Whistler via Unicode wrote: > See the entry for "Magar Akkha" on: > > http://linguistics.berkeley.edu/sei/scripts-not-encoded.html > > Anshuman Pandey did preliminary research on this in 2011. > > http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf > > It would be premature to assign an ISO 15924 script code, pending the > research to determine whether this script should be separately encoded. > > --Ken > On 7/22/2019 9:16 AM, Philippe Verdy via Unicode wrote: > > According to Ethnolog, the Eastern Magar language (mgp) is written in two > scripts: Devanagari and "Akkha". > > But the "Akkha" script does not seem to have any ISO 15924 code. > > The Ethnologue currently assigns a private use code (Qabl) for this script. > > Was the addition delayed due to lack of evidence (even if this language is > official in Nepal and India) ? > > Did the editors of Ethnologue submit an addition request for that script > (e.g. for the code "Akkh" or "Akha" ?) > > Or is it considered unified with another script that could explain why it > is not coded ? If this is a variant it could have its own code (like > Nastaliq in Arabic). Or may be this is just a subset of another > (Sino-Tibetan) script ? > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 22 12:33:59 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 22 Jul 2019 19:33:59 +0200 Subject: Akkha script (used by Eastern Magar language) in ISO 15924? In-Reply-To: <81107f77-2eff-99ba-5846-98b77b9478dd@sonic.net> References: <81107f77-2eff-99ba-5846-98b77b9478dd@sonic.net> Message-ID: Le lun. 22 juil. 2019 ? 18:43, Ken Whistler a ?crit : > See the entry for "Magar Akkha" on: > > http://linguistics.berkeley.edu/sei/scripts-not-encoded.html > > Anshuman Pandey did preliminary research on this in 2011. > That's what I said: 8 years ago already. > http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf > > It would be premature to assign an ISO 15924 script code, pending the > research to determine whether this script should be separately encoded. > And before that, does it mean that texts have to use the "Brah" code for early classification if they are tentatively encoded with Brahmi (and tagged as "mgp-Brah", which should limit the impact, because there's no other evidence that "mgp", the modern language, is related directly to the old Brahmi script, when the "mgp" still did not even exist) ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 22 12:39:07 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 22 Jul 2019 19:39:07 +0200 Subject: Akkha script (used by Eastern Magar language) in ISO 15924? In-Reply-To: References: <81107f77-2eff-99ba-5846-98b77b9478dd@sonic.net> Message-ID: Also we can note that "mgp" (Eastern Magari) is severely endangered according to multiple sources include Ethnologue and the Linguist List. This is still not the case for Western Magari (mostly on Nepal, not in Sikkim India), where evidence is probably easier to find (where the encoding of a new script and disunificaition from Brahmi, may then be more easily justified with their modern use, and probably unified with the remaining use for Eastern Magari). Le lun. 22 juil. 2019 ? 19:33, Philippe Verdy a ?crit : > > > Le lun. 22 juil. 2019 ? 18:43, Ken Whistler a > ?crit : > >> See the entry for "Magar Akkha" on: >> >> http://linguistics.berkeley.edu/sei/scripts-not-encoded.html >> >> Anshuman Pandey did preliminary research on this in 2011. >> > > That's what I said: 8 years ago already. > > >> http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf >> >> It would be premature to assign an ISO 15924 script code, pending the >> research to determine whether this script should be separately encoded. >> > And before that, does it mean that texts have to use the "Brah" code for > early classification if they are tentatively encoded with Brahmi (and > tagged as "mgp-Brah", which should limit the impact, because there's no > other evidence that "mgp", the modern language, is related directly to the > old Brahmi script, when the "mgp" still did not even exist) ? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 22 18:14:17 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 22 Jul 2019 16:14:17 -0700 Subject: New website In-Reply-To: References: <20190722163118.6a131a36@sil-mh8> <9aeff531-ee05-82d0-2892-13c5b173ae18@sonic.net> <02A77E81-0A98-4A1C-816D-150650C335A1@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jul 22 19:42:37 2019 From: unicode at unicode.org (Anshuman Pandey via Unicode) Date: Mon, 22 Jul 2019 17:42:37 -0700 Subject: Akkha script (used by Eastern Magar language) in ISO 15924? In-Reply-To: References: <81107f77-2eff-99ba-5846-98b77b9478dd@sonic.net> Message-ID: As I pointed out in L2/11-144, the ?Magar Akkha? script is an appropriation of Brahmi, renamed to link it to the primordialist daydreams of an ethno-linguistic community in Nepal. I have never seen actual usage of the script by Magars. If things have changed since 2011, I would very much welcome such information. Otherwise, the so-called ?Magar Akkha? is not suitable for encoding. The Brahmi encoding that we have should suffice. All my best, Anshu > On Jul 22, 2019, at 10:06 AM, Lorna Evans via Unicode wrote: > > Also: https://scriptsource.org/scr/Qabl > > >> On Mon, Jul 22, 2019, 12:47 PM Ken Whistler via Unicode wrote: >> See the entry for "Magar Akkha" on: >> >> http://linguistics.berkeley.edu/sei/scripts-not-encoded.html >> >> Anshuman Pandey did preliminary research on this in 2011. >> >> http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf >> >> It would be premature to assign an ISO 15924 script code, pending the research to determine whether this script should be separately encoded. >> >> --Ken >> >>> On 7/22/2019 9:16 AM, Philippe Verdy via Unicode wrote: >>> According to Ethnolog, the Eastern Magar language (mgp) is written in two scripts: Devanagari and "Akkha". >>> >>> But the "Akkha" script does not seem to have any ISO 15924 code. >>> >>> The Ethnologue currently assigns a private use code (Qabl) for this script. >>> >>> Was the addition delayed due to lack of evidence (even if this language is official in Nepal and India) ? >>> >>> Did the editors of Ethnologue submit an addition request for that script (e.g. for the code "Akkh" or "Akha" ?) >>> >>> Or is it considered unified with another script that could explain why it is not coded ? If this is a variant it could have its own code (like Nastaliq in Arabic). Or may be this is just a subset of another (Sino-Tibetan) script ? >>> >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jul 23 00:12:32 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 23 Jul 2019 07:12:32 +0200 Subject: Akkha script (used by Eastern Magar language) in ISO 15924? In-Reply-To: References: <81107f77-2eff-99ba-5846-98b77b9478dd@sonic.net> Message-ID: So can I conclude that what The Ethnologue displays (using a private-use ISO 15924 "Qabl") is wrong ? And that translations classified under "mgp-Brah" are fine (while "mgp-Qabl" would be unusable for interchange) ? Le mar. 23 juil. 2019 ? 02:42, Anshuman Pandey a ?crit : > As I pointed out in L2/11-144, the ?Magar Akkha? script is an > appropriation of Brahmi, renamed to link it to the primordialist daydreams > of an ethno-linguistic community in Nepal. I have never seen actual usage > of the script by Magars. If things have changed since 2011, I would very > much welcome such information. Otherwise, the so-called ?Magar Akkha? is > not suitable for encoding. The Brahmi encoding that we have should suffice. > > All my best, > Anshu > > On Jul 22, 2019, at 10:06 AM, Lorna Evans via Unicode > wrote: > > Also: https://scriptsource.org/scr/Qabl > > > On Mon, Jul 22, 2019, 12:47 PM Ken Whistler via Unicode < > unicode at unicode.org> wrote: > >> See the entry for "Magar Akkha" on: >> >> http://linguistics.berkeley.edu/sei/scripts-not-encoded.html >> >> Anshuman Pandey did preliminary research on this in 2011. >> >> http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf >> >> It would be premature to assign an ISO 15924 script code, pending the >> research to determine whether this script should be separately encoded. >> >> --Ken >> On 7/22/2019 9:16 AM, Philippe Verdy via Unicode wrote: >> >> According to Ethnolog, the Eastern Magar language (mgp) is written in two >> scripts: Devanagari and "Akkha". >> >> But the "Akkha" script does not seem to have any ISO 15924 code. >> >> The Ethnologue currently assigns a private use code (Qabl) for this >> script. >> >> Was the addition delayed due to lack of evidence (even if this language >> is official in Nepal and India) ? >> >> Did the editors of Ethnologue submit an addition request for that script >> (e.g. for the code "Akkh" or "Akha" ?) >> >> Or is it considered unified with another script that could explain why it >> is not coded ? If this is a variant it could have its own code (like >> Nastaliq in Arabic). Or may be this is just a subset of another >> (Sino-Tibetan) script ? >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jul 23 02:26:24 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 23 Jul 2019 08:26:24 +0100 Subject: Akkha script (used by Eastern Magar language) in ISO 15924? In-Reply-To: References: <81107f77-2eff-99ba-5846-98b77b9478dd@sonic.net> Message-ID: <20190723082624.2ee43ae1@JRWUBU2> On Mon, 22 Jul 2019 17:42:37 -0700 Anshuman Pandey via Unicode wrote: > As I pointed out in L2/11-144, the ?Magar Akkha? script is an > appropriation of Brahmi, renamed to link it to the primordialist > daydreams of an ethno-linguistic community in Nepal. I have never > seen actual usage of the script by Magars. If things have changed > since 2011, I would very much welcome such information. Otherwise, > the so-called ?Magar Akkha? is not suitable for encoding. The Brahmi > encoding that we have should suffice. How would mere usage qualify it as a separate script? Richard. From unicode at unicode.org Tue Jul 23 12:40:54 2019 From: unicode at unicode.org (Anshuman Pandey via Unicode) Date: Tue, 23 Jul 2019 10:40:54 -0700 Subject: Akkha script (used by Eastern Magar language) in ISO 15924? In-Reply-To: <20190723082624.2ee43ae1@JRWUBU2> References: <81107f77-2eff-99ba-5846-98b77b9478dd@sonic.net> <20190723082624.2ee43ae1@JRWUBU2> Message-ID: > On Jul 23, 2019, at 12:26 AM, Richard Wordingham via Unicode wrote: > > On Mon, 22 Jul 2019 17:42:37 -0700 > Anshuman Pandey via Unicode wrote: > >> As I pointed out in L2/11-144, the ?Magar Akkha? script is an >> appropriation of Brahmi, renamed to link it to the primordialist >> daydreams of an ethno-linguistic community in Nepal. I have never >> seen actual usage of the script by Magars. If things have changed >> since 2011, I would very much welcome such information. Otherwise, >> the so-called ?Magar Akkha? is not suitable for encoding. The Brahmi >> encoding that we have should suffice. > > How would mere usage qualify it as a separate script? I apologize for using the wrong conjunction. Instead of ?otherwise? I should have written ?nevertheless?. All my best, Anshu From unicode at unicode.org Wed Jul 24 21:23:39 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 24 Jul 2019 22:23:39 -0400 Subject: SHEQEL and L2/19-291 Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jul 24 21:52:02 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 25 Jul 2019 02:52:02 +0000 Subject: SHEQEL and L2/19-291 In-Reply-To: References: Message-ID: https://en.wikipedia.org/wiki/Israeli_new_shekel "With the issuing of the third series, the Bank of Israel has adopted the standard English spelling of shekel and plural shekels for its currency.[30] Previously, the Bank had formally used the Hebrew transcriptions of sheqel and sheqalim (from ??????????).[31]" BTW, Google flags "sheqel" in its search box as an incorrect spelling. On 2019-07-25 2:23 AM, Mark E. Shoulson via Unicode wrote: > Just looking at document L2/19-291, > https://www.unicode.org/L2/L2019/19291-missing-currency.pdf "Currency signs > missing in Unicode" by Eduardo Mar?n Silva. And I'm wondering why he feels it > necessary for the Unicode standard to say that a more correct spelling for the > Israeli currency would be "shekel" (and not "sheqel"). What criterion is being > used that makes this "more correct"? I think it's more popular and common, so > maybe that's it. But historically and linguistically, "sheqel" is more > accurate. The middle letter is ?, U+05E7 HEBREW LETTER QOF (which is not "more > correctly" KOF), from the root ????? Sh.Q.L meaning "weight". It's true that > Modern Hebrew does not distinguish K and Q phonetically in speech; maybe that is > what is meant? Still, the "historical" transliteration of QOF with Q is very > widespread, and I believe occurs even on some coins/bills (could be wrong here; > is this what is meant by "more correct"? That "shekel" is what is used > officially on the currency and I am misremembering?) > > > Just wondering about this, since it seems to be stressed in the document. > > > ~mark > From unicode at unicode.org Thu Jul 25 01:03:05 2019 From: unicode at unicode.org (Simon Montagu via Unicode) Date: Thu, 25 Jul 2019 09:03:05 +0300 Subject: SHEQEL and L2/19-291 In-Reply-To: References: Message-ID: <26870365-69f9-b8e4-9b6b-489fd3753b01@smontagu.org> Unicode uses Q consistently to transcribe U+05E7 in the names of other Hebrew characters, e.g. U+0594 HEBREW ACCENT ZAQEF QATAN, U+05B8 HEBREW POINT QAMATS and several others. The official English name of the currency was "New Sheqel" at the time that U+20AA was encoded in Unicode. I don't think that "shekel" should be described as "more correct", at most it should be given as an alternative spelling. There has been a trend in Israel in recent years to use K instead of Q, because Q is considered "confusing for foreigners", though it escapes me how blurring the distinction between two different characters (QOF and KAF) makes things *less* confusing. The Academy for the Hebrew Language has followed this trend to a limited extent, by introducing "simplified transcription rules for signs and maps" in which U+05E7 is to be transcribed by K, but in the official rules for "precise transcription" Q is still used. See https://hebrew-academy.org.il/%D7%9B%D7%9C%D7%9C%D7%99-%D7%94%D7%AA%D7%A2%D7%AA%D7%99%D7%A7/ (Hebrew) On 25.7.2019 5:23, Mark E. Shoulson via Unicode wrote: > Just looking at document L2/19-291, > https://www.unicode.org/L2/L2019/19291-missing-currency.pdf "Currency > signs missing in Unicode" by Eduardo Mar?n Silva.? And I'm wondering why > he feels it necessary for the Unicode standard to say that a more > correct spelling for the Israeli currency would be "shekel" (and not > "sheqel").? What criterion is being used that makes this "more > correct"?? I think it's more popular and common, so maybe that's it. > But historically and linguistically, "sheqel" is more accurate.? The > middle letter is ?, U+05E7 HEBREW LETTER QOF (which is not "more > correctly" KOF), from the root ????? Sh.Q.L meaning "weight".? It's true > that Modern Hebrew does not distinguish K and Q phonetically in speech; > maybe that is what is meant?? Still, the "historical" transliteration of > QOF with Q is very widespread, and I believe occurs even on some > coins/bills (could be wrong here; is this what is meant by "more > correct"? That "shekel" is what is used officially on the currency and I > am misremembering?) > > > Just wondering about this, since it seems to be stressed in the document. > > > ~mark >