From 747.neutron at gmail.com Wed Dec 1 22:33:54 2021 From: 747.neutron at gmail.com (=?UTF-8?B?V8OhbmcgWWlmw6Fu?=) Date: Thu, 2 Dec 2021 13:33:54 +0900 Subject: Directionality controls for malicious code In-Reply-To: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> Message-ID: > If not, and since there are relatively few scripts of RtoL characters, > is there any legitimate use of BiDi controls outside of script runs of > those scripts. I feel in this paragraph you assume that every script is either LTR or RTL, but at least CJK scripts are allowed to be written both LTR and RTL (although defaulted as LTR). > If not, then could the Bidi control characters be made to have their scx > property value be all the RtoL scripts, and software such as git could > warn or forbid text of mixed scripts? It's rather useful to warn IMO, but prohibition is unrealistic considering that most modern rich text formats employ ASCII characters for format control. For instance, if somebody want to show an Arabic snippet surrounded by HTML tags inside an otherwise English comment (or vice versa), I bet the primitive bidi algorithm that doesn't understand <...> is a consecutive HTML grammar will mess up the graphical order with 150% probability* that hardly readable without bidi controls. * A character has a chance to be misrendered one more time after the first misrendering. 2021?12?1?(?) 3:42 Karl Williamson via Unicode : > > It is possible to make text appear to be other than what it really is by > using BiDi controls. > > Such text may be be in the form of computer code, which could allow a > trojan horse attack by sneaking stuff past human code reviewers. > > I have not studied the BiDi algorithm, so this may be naive. > > Is there any legitimate use of BiDi controls in text that doesn't have a > mixture of LtoR and RtoL strings? > > If not, and since there are relatively few scripts of RtoL characters, > is there any legitimate use of BiDi controls outside of script runs of > those scripts. > > If not, then could the Bidi control characters be made to have their scx > property value be all the RtoL scripts, and software such as git could > warn or forbid text of mixed scripts? > > Or could a new property be created that allowed for machine detection of > malicious use? > > Karl Williamson From sosipiuk at gmail.com Wed Dec 1 23:27:06 2021 From: sosipiuk at gmail.com (=?utf-8?Q?S=C5=82awomir_Osipiuk?=) Date: Thu, 2 Dec 2021 00:27:06 -0500 Subject: Directionality controls for malicious code In-Reply-To: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> Message-ID: <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> The burden of guarding against BiDi misuse should be on the programming languages and/or their compilers. I'm not sure why this hasn't been widely implemented yet. At minimum any BiDi controls within a source file should emit a warning during compilation, with compiler options available to error on any mixture of LTR and RTL text, or to whitelist specific files which are known to contain such a mixture with a valid cause, etc. There is nothing that can be done at the Unicode level to cater to coding languages that the coding languages can't do themselves via their own specifications and tools. Indeed it is far more appropriate for BiDi warnings and prohibitions to be tailored to the syntax of each language. (E.g. it may be generally "okay" for a line containing only a comment to mix directionality, but not for a line containing both code and comment). S?awomir Osipiuk From duerst at it.aoyama.ac.jp Thu Dec 2 00:35:17 2021 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Thu, 2 Dec 2021 15:35:17 +0900 Subject: Directionality controls for malicious code In-Reply-To: <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> Message-ID: I think it's correct that it is not possible to fix this in Unicode itself. But that doesn't mean that it should be checked/warned by compilers. There are many other tools involved, in particular editors. There are probably way less serious editors than programming languages. Editors can clearly show problematic characters, so that users can decide whether they are dangerous or necessary (or both). Regards, Martin. On 2021-12-02 14:27, S?awomir Osipiuk via Unicode wrote: > The burden of guarding against BiDi misuse should be on the programming languages and/or their compilers. I'm not sure why this hasn't been widely implemented yet. At minimum any BiDi controls within a source file should emit a warning during compilation, with compiler options available to error on any mixture of LTR and RTL text, or to whitelist specific files which are known to contain such a mixture with a valid cause, etc. > > There is nothing that can be done at the Unicode level to cater to coding languages that the coding languages can't do themselves via their own specifications and tools. Indeed it is far more appropriate for BiDi warnings and prohibitions to be tailored to the syntax of each language. (E.g. it may be generally "okay" for a line containing only a comment to mix directionality, but not for a line containing both code and comment). > > S?awomir Osipiuk > From doug at ewellic.org Thu Dec 2 01:24:37 2021 From: doug at ewellic.org (Doug Ewell) Date: Thu, 2 Dec 2021 00:24:37 -0700 Subject: Directionality controls for malicious code In-Reply-To: References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> Message-ID: <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> Martin J. D?rst wrote: > There are many other tools involved, in particular editors. There are > probably way less serious editors than programming languages. Editors > can clearly show problematic characters, so that users can decide > whether they are dangerous or necessary (or both). Given the publicity surrounding the "Trojan Source" paper, I'd be surprised if someone weren't already working on a Visual Studio Code extension that flags bidi controls in the editor window. It might already be available, for all I know. Going into a panic and writing this into programming language specifications is what doesn't need to happen. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From duerst at it.aoyama.ac.jp Thu Dec 2 01:43:18 2021 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Thu, 2 Dec 2021 16:43:18 +0900 Subject: Directionality controls for malicious code In-Reply-To: <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> Message-ID: <19c344c5-184e-1e66-8694-620abf4bf3c5@it.aoyama.ac.jp> Hello Doug, others, On 2021-12-02 16:24, Doug Ewell via Unicode wrote: > Martin J. D?rst wrote: > >> There are many other tools involved, in particular editors. There are >> probably way less serious editors than programming languages. Editors >> can clearly show problematic characters, so that users can decide >> whether they are dangerous or necessary (or both). > > Given the publicity surrounding the "Trojan Source" paper, I'd be surprised if someone weren't already working on a Visual Studio Code extension that flags bidi controls in the editor window. It might already be available, for all I know. Here you go: https://code.visualstudio.com/updates/v1_62#_unicode-directional-formatting-characters Regards, Martin. > Going into a panic and writing this into programming language specifications is what doesn't need to happen. > > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > From eliz at gnu.org Thu Dec 2 02:06:25 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 02 Dec 2021 10:06:25 +0200 Subject: Directionality controls for malicious code In-Reply-To: <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> (message from =?utf-8?Q?S=C5=82awomir?= Osipiuk via Unicode on Thu, 2 Dec 2021 00:27:06 -0500) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> Message-ID: <83zgpjph1a.fsf@gnu.org> > Date: Thu, 2 Dec 2021 00:27:06 -0500 > From: S?awomir Osipiuk via Unicode > > The burden of guarding against BiDi misuse should be on the programming languages and/or their compilers. I'm not sure why this hasn't been widely implemented yet. At minimum any BiDi controls within a source file should emit a warning during compilation, with compiler options available to error on any mixture of LTR and RTL text, or to whitelist specific files which are known to contain such a mixture with a valid cause, etc. Such warnings should not be blindly emitted for bidi controls within comments and strings, since that is human-readable text, where those controls are completely legitimate. At least the na?ve warning for any occurrence of these controls should avoided in those cases, because it is likely to be a false positive, especially when a program is intended to use RTL scripts. There are many projects that require to compile without any warnings, or treat warnings as errors, and those won't compile with such "draconian" compilers. Smart discovery of questionable usage of directional controls is possible, and such warnings, even in comments and strings, should employ that. But it is harder to implement, and requires some minimal understanding of UAX#9 and its use of explicit directional controls. > There is nothing that can be done at the Unicode level to cater to coding languages that the coding languages can't do themselves via their own specifications and tools. Indeed it is far more appropriate for BiDi warnings and prohibitions to be tailored to the syntax of each language. (E.g. it may be generally "okay" for a line containing only a comment to mix directionality, but not for a line containing both code and comment). Yes. But there's more to it than just syntax. For example, directional controls that push weak or neutral characters one embedding level could be okay inside a comment, but if the embedding level is pushed to higher levels, that is suspicious. The problem is that compilers which do implement such warnings generally emit them whenever they see such codepoints, disregarding the context and bidi-specific knowledge (because it's much easier), and the result is completely unacceptable for programs that need to communicate in RTL scripts. From eliz at gnu.org Thu Dec 2 02:13:50 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 02 Dec 2021 10:13:50 +0200 Subject: Directionality controls for malicious code In-Reply-To: (message from Martin J. =?utf-8?Q?D=C3=BCrst?= via Unicode on Thu, 2 Dec 2021 15:35:17 +0900) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> Message-ID: <83y253pgox.fsf@gnu.org> > Date: Thu, 2 Dec 2021 15:35:17 +0900 > From: Martin J. D?rst via Unicode > > There are many other tools involved, in particular editors. There are > probably way less serious editors than programming languages. Editors > can clearly show problematic characters, so that users can decide > whether they are dangerous or necessary (or both). Just showing the bidi controls to the user will not necessarily allow the user to make that decision. Most users don't have a working understanding of the UBA, even if they do use RTL scripts. The editor should avoid making these controls stand out unless their use in the specific context is highly questionable, and it should provide some clear enough explanation for the users to understand the issue. For example, since most editors provide logical-order cursor movement, suggesting that the user moves the cursor across the problematic text, one character at a time, could go a long way towards the goal of providing the user with pertinent information to make an informed decision. From eliz at gnu.org Thu Dec 2 02:24:23 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 02 Dec 2021 10:24:23 +0200 Subject: Directionality controls for malicious code In-Reply-To: <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> (message from Doug Ewell via Unicode on Thu, 2 Dec 2021 00:24:37 -0700) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> Message-ID: <83v907pg7c.fsf@gnu.org> > Date: Thu, 2 Dec 2021 00:24:37 -0700 > From: Doug Ewell via Unicode > > Martin J. D?rst wrote: > > > There are many other tools involved, in particular editors. There are > > probably way less serious editors than programming languages. Editors > > can clearly show problematic characters, so that users can decide > > whether they are dangerous or necessary (or both). > > Given the publicity surrounding the "Trojan Source" paper, I'd be surprised if someone weren't already working on a Visual Studio Code extension that flags bidi controls in the editor window. It might already be available, for all I know. Why are you thinking about the proprietary VS and not about Emacs? ;-) > Going into a panic and writing this into programming language specifications is what doesn't need to happen. Blindly showing these controls wherever they are should not happen, either, because most of their uses are not malicious. The tests must be smarter than just looking at the codepoint, they should also look at the surrounding text and examine the effect of those directional controls on that text. From eliz at gnu.org Thu Dec 2 02:35:35 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 02 Dec 2021 10:35:35 +0200 Subject: Directionality controls for malicious code In-Reply-To: <19c344c5-184e-1e66-8694-620abf4bf3c5@it.aoyama.ac.jp> (message from Martin J. =?utf-8?Q?D=C3=BCrst?= via Unicode on Thu, 2 Dec 2021 16:43:18 +0900) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <19c344c5-184e-1e66-8694-620abf4bf3c5@it.aoyama.ac.jp> Message-ID: <83sfvbpfoo.fsf@gnu.org> > Date: Thu, 2 Dec 2021 16:43:18 +0900 > From: Martin J. D?rst via Unicode > > > Given the publicity surrounding the "Trojan Source" paper, I'd be surprised if someone weren't already working on a Visual Studio Code extension that flags bidi controls in the editor window. It might already be available, for all I know. > > Here you go: > https://code.visualstudio.com/updates/v1_62#_unicode-directional-formatting-characters That's lip service, IMNSHO. It's even against UAX#9, which says those controls should be invisible. From daniel.buenzli at erratique.ch Thu Dec 2 09:19:12 2021 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 2 Dec 2021 16:19:12 +0100 Subject: Directionality controls for malicious code In-Reply-To: <83v907pg7c.fsf@gnu.org> References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> Message-ID: On 2 December 2021 at 09:24:23, Eli Zaretskii via Unicode (unicode at corp.unicode.org) wrote: > Blindly showing these controls wherever they are should not happen, > either, because most of their uses are not malicious. The tests must > be smarter than just looking at the codepoint, they should also look > at the surrounding text and examine the effect of those directional > controls on that text. I agree with Eli and I think programming language specifications should say something about it. We need a formal criterion that allows to check that a given span of characters in logical order does not visually overflow those characters that preceed or succeed them. This check can then be applied on the content of the various syntactic constructs of your language (e.g. string literals, comments, etc.) and you report a syntax error if there's a visual overflow. This makes sure no text is allowed to visually escape the boundaries it's supposed to be confined to. I'm not familiar enough with the bidi algorithm but for example it seems that unbounded RLO or RLI in a span should be forbidden unless they are properly balanced with a matching PDI or PDF (if you happen to need that imbalance for your program in your string literals just use your Unicode character escape notation). But I'm sure the problem is much more complex than that and I'd be curious if people in the know of the algorithm have an idea on how to go about it.? There's also likely quite a few other security contexts where such a check could be useful (e.g. untrusted user input). Best,? Daniel From sosipiuk at gmail.com Thu Dec 2 11:10:40 2021 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Thu, 2 Dec 2021 12:10:40 -0500 Subject: Directionality controls for malicious code In-Reply-To: <83v907pg7c.fsf@gnu.org> References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> Message-ID: Replying to several messages here: On Thu, Dec 2, 2021 at 1:35 AM Martin J. D?rst wrote: > There are many other tools involved, in particular editors. There are > probably way less serious editors than programming languages. Editors > can clearly show problematic characters, so that users can decide > whether they are dangerous or necessary (or both). It's better to do this at the language/compiler level because the effects of BiDi "trickery" will vary with language, not with the editor. The editor cannot be relied on to help in this instance, because any contributor may decide that the one-line change he wants to add to a giant project can be done with Notepad. The compiler should know when code, not string contents or comments, is being manipulated with RTL controls. On Thu, Dec 2, 2021 at 2:27 AM Doug Ewell via Unicode wrote: > Going into a panic and writing this into programming language specifications is what doesn't need to happen. No one is advising panic. On Thu, Dec 2, 2021 at 3:27 AM Eli Zaretskii via Unicode wrote: > Blindly showing these controls wherever they are should not happen, > either, because most of their uses are not malicious. Yes, it should. This is not general prose intended to look nice. It's a programming language demanding precision where a one-character typo can majorly change functionality. The "users" in this case are assumed to be a (relatively) specialist technical audience. Clarity of "what's happening" outweighs other considerations. > There are many projects that require to compile without any warnings, or treat warnings as errors, and those won't compile with such "draconian" compilers. Which is why I mentioned that whitelisting, or some method of suppressing the warnings, i.e. an "I know what I'm doing" option, should also be added. But it should not be the default behavior. This is a classic security vs. usability tradeoff, but I think you're overestimating the amount of projects this would actually cause problems for. On Thu, Dec 2, 2021 at 3:38 AM Eli Zaretskii via Unicode wrote: > It's even against UAX#9, which says those > controls should be invisible. That rule should be ignored when it is counterproductive in a specialist context. On Thu, Dec 2, 2021 at 10:22 AM Daniel B?nzli via Unicode wrote: > We need a formal criterion that allows to check that a given span of characters in logical order does not visually overflow those characters that preceed or succeed them. Yes, this is ideal. The problem is that Unicode doesn't "understand" that string-terminating or comment-introducing characters in any given programming language should reset the directionality. That's why the solution must be at the same level that gives meaning to strings and comments (and variables, etc.) i.e. the programming language itself. > (if you happen to need that imbalance for your program in your string literals just use your Unicode character escape notation). Yes. It makes perfect sense for control characters to be permitted only as escape sequences. This is already common, if not required, in many cases. I've seen plenty of "\r\n" in strings, and no one complains that it doesn't look good, it's just how it's done. From eliz at gnu.org Thu Dec 2 12:31:28 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 02 Dec 2021 20:31:28 +0200 Subject: Directionality controls for malicious code In-Reply-To: (message from Daniel =?utf-8?Q?B=C3=BCnzli?= on Thu, 2 Dec 2021 16:19:12 +0100) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> Message-ID: <834k7qamf3.fsf@gnu.org> > Date: Thu, 2 Dec 2021 16:19:12 +0100 > From: Daniel B?nzli > Cc: unicode at corp.unicode.org > > I'm not familiar enough with the bidi algorithm but for example it seems that unbounded RLO or RLI in a span should be forbidden unless they are properly balanced with a matching PDI or PDF The UBA mandates that all embeddings end at paragraph end, i.e. at a newline. So unterminated embeddings and isolates behave exactly as terminated ones do, and requiring the embeddings and isolates to be properly terminated will only catch sloppy malicious tinkering with these controls, it won't catch the non-sloppy ones. > But I'm sure the problem is much more complex than that and I'd be curious if people in the know of the algorithm have an idea on how to go about it.? I did have some ideas, and implemented detection of suspicious reordering for Emacs. From eliz at gnu.org Thu Dec 2 12:43:12 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 02 Dec 2021 20:43:12 +0200 Subject: Directionality controls for malicious code In-Reply-To: (message from =?utf-8?Q?S=C5=82awomir?= Osipiuk via Unicode on Thu, 2 Dec 2021 12:10:40 -0500) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> Message-ID: <8335naalvj.fsf@gnu.org> > Date: Thu, 2 Dec 2021 12:10:40 -0500 > Cc: Unicode > From: S?awomir Osipiuk via Unicode > > > Blindly showing these controls wherever they are should not happen, > > either, because most of their uses are not malicious. > > Yes, it should. This is not general prose intended to look nice. Comments and strings in a program _are_ general prose. > The "users" in this case are assumed to be a (relatively) specialist > technical audience. The vast majority of those professionals have no idea about what the UBA does, how the bidi control characters work, and what they are for. So them being specialists doesn't help in this matter. > > There are many projects that require to compile without any warnings, or treat warnings as errors, and those won't compile with such "draconian" compilers. > > Which is why I mentioned that whitelisting, or some method of > suppressing the warnings, i.e. an "I know what I'm doing" option, > should also be added. Many projects frown on such measures, and some outright prohibit them. Try telling your QA person that you want to suppress warnings because they annoy you. > But it should not be the default behavior. If it is not the default, chances are it will seldom or never be turned on. > I think you're overestimating the amount of projects this would > actually cause problems for. That depends on the audience for which you are writing programs. In some locales around the world the number of projects for which this could be a problem is very large. > > It's even against UAX#9, which says those > > controls should be invisible. > > That rule should be ignored when it is counterproductive in a > specialist context. You are in a Unicode forum, and you are arguing for ignoring its rules? > > We need a formal criterion that allows to check that a given span of characters in logical order does not visually overflow those characters that preceed or succeed them. > > Yes, this is ideal. The problem is that Unicode doesn't "understand" > that string-terminating or comment-introducing characters in any given > programming language should reset the directionality. That's why the > solution must be at the same level that gives meaning to strings and > comments (and variables, etc.) i.e. the programming language itself. That's a worthy goal, but I think it isn't easy to achieve. We could instead employ a simpler, language-independent heuristics, based on the bidi context of those control characters. For example, if weak characters of class EN or neutral characters of class ON have their embedding level pushed too high (where "too high" depends on the base paragraph direction), it becomes suspicious and can be flagged. > Yes. It makes perfect sense for control characters to be permitted > only as escape sequences. That could be a solution for strings, but not for comments. And even in strings, using escapes makes the strings much harder to read and proofread. From mark at macchiato.com Thu Dec 2 12:51:19 2021 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 2 Dec 2021 10:51:19 -0800 Subject: Directionality controls for malicious code In-Reply-To: <834k7qamf3.fsf@gnu.org> References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> Message-ID: The UBA explicitly carves out room for specialized text handling in https://unicode.org/reports/tr9/#Higher-Level_Protocols. The goal of that is to allow editors to handle bidi ordering in a sensible (and not misleading) fashion in environments such as programming language editing, specifically so that tokens are 'self-contained' and the ordering among tokens is clear. (There needs, however, to be more and clearer examples and guidance in the UBA, #31, #36, and #39.) Mark On Thu, Dec 2, 2021 at 10:33 AM Eli Zaretskii via Unicode < unicode at corp.unicode.org> wrote: > > Date: Thu, 2 Dec 2021 16:19:12 +0100 > > From: Daniel B?nzli > > Cc: unicode at corp.unicode.org > > > > I'm not familiar enough with the bidi algorithm but for example it seems > that unbounded RLO or RLI in a span should be forbidden unless they are > properly balanced with a matching PDI or PDF > > The UBA mandates that all embeddings end at paragraph end, i.e. at a > newline. So unterminated embeddings and isolates behave exactly as > terminated ones do, and requiring the embeddings and isolates to be > properly terminated will only catch sloppy malicious tinkering with > these controls, it won't catch the non-sloppy ones. > > > But I'm sure the problem is much more complex than that and I'd be > curious if people in the know of the algorithm have an idea on how to go > about it. > > I did have some ideas, and implemented detection of suspicious > reordering for Emacs. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Thu Dec 2 17:43:13 2021 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 3 Dec 2021 00:43:13 +0100 Subject: Directionality controls for malicious code In-Reply-To: References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> Message-ID: On 2 December 2021 at 19:51:19, Mark Davis ?? via Unicode (unicode at corp.unicode.org) wrote: > The UBA explicitly carves out room for specialized text handling in > https://unicode.org/reports/tr9/#Higher-Level_Protocols. The goal of that > is to allow editors to handle bidi ordering in a sensible (and not > misleading) fashion in environments such as programming language editing, > specifically so that tokens are 'self-contained' and the ordering among > tokens is clear. I would prefer if that was a property we could check/enforce on spans of the Unicode text itself. In my opinion using a viewer that uses a special UBA is not really a good solution, if not a solution at all (e.g. if you want to check these properties when you embed user generated content to be rendered via a browser). On 2 December 2021 at 18:10:40, S?awomir Osipiuk via Unicode (unicode at corp.unicode.org) wrote: > Yes, this is ideal. The problem is that Unicode doesn't "understand"? > that string-terminating or comment-introducing characters? > in any given programming language should reset the directionality.? Indeed directionality reset is precisely what I would like to be able to detect or enforce for arbitrary spans of Unicode text. Basically I think it would be nice to have:? 1) An algorithm that given text and a span therein determines if the span visually overflows its own content. 2) An algorithm that given text and a span therein returns a new span of text with the same textual content but with additional bidi control characters that make sure the span is visually contained to its content in the given text. Formulated differently: how can we make sure arbitrary spans of Unicode text behave, as far as UBA is concerned, as a self-contained paragraph.? Best,? Daniel From mark at macchiato.com Thu Dec 2 18:31:41 2021 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 2 Dec 2021 16:31:41 -0800 Subject: Directionality controls for malicious code In-Reply-To: References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> Message-ID: I think those are good suggestions. Note that that section doesn't necessarily mean that a special UBA algorithm is used; the results could be accomplished by modifying the line before displaying it. It sounds like the text isn't clear about that. Some things I think are fairly easy to do irrespective of the compiler; for example, I think it would be safe to forbid all unescaped stateful bidi controls in source code. And that eliminates a significant class of potential issues, but not all. As to your #1 and #2 #1. An algorithm to guarantee that tokens are self-contained wouldn't be too hard. It would take something like a line plus token boundaries and return which tokens (if any) are broken in display. (For performance reasons you probably wouldn't want to do each token span separately.) #2. By using bidi isolates, it is pretty easy to mark-up the text so that you get a consistent order of tokens when applying the UBA. Any editing of the result could get pretty surprising for users, however. Mark On Thu, Dec 2, 2021 at 3:43 PM Daniel B?nzli wrote: > On 2 December 2021 at 19:51:19, Mark Davis ?? via Unicode ( > unicode at corp.unicode.org) wrote: > > > The UBA explicitly carves out room for specialized text handling in > > https://unicode.org/reports/tr9/#Higher-Level_Protocols. The goal of > that > > is to allow editors to handle bidi ordering in a sensible (and not > > misleading) fashion in environments such as programming language editing, > > specifically so that tokens are 'self-contained' and the ordering among > > tokens is clear. > > I would prefer if that was a property we could check/enforce on spans of > the Unicode text itself. In my opinion using a viewer that uses a special > UBA is not really a good solution, if not a solution at all (e.g. if you > want to check these properties when you embed user generated content to be > rendered via a browser). > > On 2 December 2021 at 18:10:40, S?awomir Osipiuk via Unicode ( > unicode at corp.unicode.org) wrote: > > > Yes, this is ideal. The problem is that Unicode doesn't "understand" > > that string-terminating or comment-introducing characters > > in any given programming language should reset the directionality. > > Indeed directionality reset is precisely what I would like to be able to > detect or enforce for arbitrary spans of Unicode text. Basically I think it > would be nice to have: > > 1) An algorithm that given text and a span therein determines if the span > visually overflows its own content. > > 2) An algorithm that given text and a span therein returns a new span of > text with the same textual content but with additional bidi control > characters that make sure the span is visually contained to its content in > the given text. > > Formulated differently: how can we make sure arbitrary spans of Unicode > text behave, as far as UBA is concerned, as a self-contained paragraph. > > Best, > > Daniel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Fri Dec 3 01:22:00 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 03 Dec 2021 09:22:00 +0200 Subject: Directionality controls for malicious code In-Reply-To: (message from Daniel =?utf-8?Q?B=C3=BCnzli?= via Unicode on Fri, 3 Dec 2021 00:43:13 +0100) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> Message-ID: <83ilw6886f.fsf@gnu.org> > Date: Fri, 3 Dec 2021 00:43:13 +0100 > Cc: unicode at corp.unicode.org > From: Daniel B?nzli via Unicode > > > Yes, this is ideal. The problem is that Unicode doesn't "understand"? > > that string-terminating or comment-introducing characters? > > in any given programming language should reset the directionality.? > > Indeed directionality reset is precisely what I would like to be able to detect or enforce for arbitrary spans of Unicode text. I don't see how it would help. For example, if you examine the examples provided in that paper, you will see that the directional format controls were inserted inside comments, but in a way that made parts of the comments to look like part of the code. > Basically I think it would be nice to have:? > > 1) An algorithm that given text and a span therein determines if the span visually overflows its own content. What do you mean by "visually overflows"? > 2) An algorithm that given text and a span therein returns a new span of text with the same textual content but with additional bidi control characters that make sure the span is visually contained to its content in the given text. This is not clear, either. > Formulated differently: how can we make sure arbitrary spans of Unicode text behave, as far as UBA is concerned, as a self-contained paragraph.? "Self-contained" in what sense? From daniel.buenzli at erratique.ch Fri Dec 3 03:00:21 2021 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 3 Dec 2021 10:00:21 +0100 Subject: Directionality controls for malicious code In-Reply-To: <83ilw6886f.fsf@gnu.org> References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> <83ilw6886f.fsf@gnu.org> Message-ID: On 3 December 2021 at 08:22:00, Eli Zaretskii via Unicode (unicode at corp.unicode.org) wrote: > I don't see how it would help.?For example, if you examine the > examples provided in that paper, you will see that the directional > format controls were inserted inside comments, but in a way that made > parts of the comments to look like part of the code. Yes. The idea is to disallow in the grammar of your language visual reorderings to occur across certain textual boundaries specific to your language. If you take C multi-line comments /* ? */ the idea is that:? 1. No text logically between the /* and */ should visually be able to get on the left of /*? 2. No text logically between the /* and */ should visually be able to get on the right of */ 3. No text logically before the /* should visually be able to get on the right of /* 4. No text logically after the */ should visually be able to get on the left of */? I'd say that a short way of saying that is that the text logically inside the /* and */ should be made to behave as an UBA paragraph ? since no reorderings occur accross paragraphs. Violations of that property should result in a syntax error or a warning. So I would like foolproof tools that allow to a) detect violations of these constraints and b) enforce them.? For example in the case above, for enforcing them, would it be sufficient to insert a LRI (or RLI, or FSI) after /* and a PDI before */ ? Would that make sure that the properties 1-4 are satisfied for all contexts and contents of comments ? I hope the above makes more clear the points of my message. Best, Daniel From daniel.buenzli at erratique.ch Fri Dec 3 03:04:11 2021 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 3 Dec 2021 10:04:11 +0100 Subject: Directionality controls for malicious code In-Reply-To: References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> <83ilw6886f.fsf@gnu.org> Message-ID: On 3 December 2021 at 10:00:21, Daniel B?nzli (daniel.buenzli at erratique.ch) wrote: > For example in the case above, for enforcing them, would it be sufficient to insert a LRI > (or RLI, or FSI) after /* and a PDI before */ ?? Rather: insert before */ one PDI plus as many PDIs to close potential unclosed LRI/RLI/FSI you may found in the logical text inside /* */. Daniel From daniel.buenzli at erratique.ch Fri Dec 3 03:10:30 2021 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 3 Dec 2021 10:10:30 +0100 Subject: Directionality controls for malicious code In-Reply-To: References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> <83ilw6886f.fsf@gnu.org> Message-ID: On 3 December 2021 at 10:04:11, Daniel B?nzli (daniel.buenzli at erratique.ch) wrote: > On 3 December 2021 at 10:00:21, Daniel B?nzli (daniel.buenzli at erratique.ch) wrote: > > > For example in the case above, for enforcing them, would it be sufficient to insert a > LRI > > (or RLI, or FSI) after /* and a PDI before */ ? > > Rather: insert before */ one PDI plus as many PDIs to close potential unclosed LRI/RLI/FSI > you may found in the logical text inside /* */. and suppress unbalanced PDIs from the logical text inside /* */ I'm getting to something? :-) D From eliz at gnu.org Fri Dec 3 05:43:47 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 03 Dec 2021 13:43:47 +0200 Subject: Directionality controls for malicious code In-Reply-To: (message from Daniel =?utf-8?Q?B=C3=BCnzli?= on Fri, 3 Dec 2021 10:00:21 +0100) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> <83ilw6886f.fsf@gnu.org> Message-ID: <837dcl9amk.fsf@gnu.org> > Date: Fri, 3 Dec 2021 10:00:21 +0100 > From: Daniel B?nzli > Cc: unicode at corp.unicode.org > > Yes. The idea is to disallow in the grammar of your language visual reorderings to occur across certain textual boundaries specific to your language. Text editors usually understand very little of the language grammar, or not at all. > If you take C multi-line comments /* ? */ the idea is that:? > > 1. No text logically between the /* and */ should visually be able to get on the left of /*? > 2. No text logically between the /* and */ should visually be able to get on the right of */ > 3. No text logically before the /* should visually be able to get on the right of /* > 4. No text logically after the */ should visually be able to get on the left of */? > > I'd say that a short way of saying that is that the text logically inside the /* and */ should be made to behave as an UBA paragraph ? since no reorderings occur accross paragraphs. Violations of that property should result in a syntax error or a warning. That's a tough ticket. It requires the editor to perform the kind of processing that is much more complicated than what they do now. Think about nested comments, comment-like text inside strings embedded within the code, etc. > For example in the case above, for enforcing them, would it be sufficient to insert a LRI (or RLI, or FSI) after /* and a PDI before */ ? Would that make sure that the properties 1-4 are satisfied for all contexts and contents of comments ? If you ever programmed an editor, you know that actually inserting something into the text of a file that wasn't there to begin with is a no-no: you are likely to leak those insertions to the outside world. Basically, there are user expectations that if you open a file, go through it without any changes, then save it, it ends up identical to what it was before. If you start inserting characters into the text, you will have a hard time keeping that promise, because it is hard to distinguish between text you insert and the text the user inserts. From daniel.buenzli at erratique.ch Fri Dec 3 06:50:08 2021 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 3 Dec 2021 13:50:08 +0100 Subject: Directionality controls for malicious code In-Reply-To: <837dcl9amk.fsf@gnu.org> References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> <83ilw6886f.fsf@gnu.org> <837dcl9amk.fsf@gnu.org> Message-ID: Eli,? I'm not suggesting to do *any* of what I mentioned in text editors :-) Text editors should just do regular UBA.? These checks are meant to be done by the compiler on the sources they are fed with. Somehow the problem really looks like matching new kind of parentheses with a few twists. It would just be nice if we had a good description of the exact rules you need to enforce. Regarding the idea of text insertion to ensure text boundaries are respected I'm more thinking about templating. Suppose you have a template with fields to fill in with untrusted user input and you want to make sure the text remains contained to well specified boundaries.? Best,? Daniel From eliz at gnu.org Fri Dec 3 07:04:51 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 03 Dec 2021 15:04:51 +0200 Subject: Directionality controls for malicious code In-Reply-To: (message from Daniel =?utf-8?Q?B=C3=BCnzli?= on Fri, 3 Dec 2021 13:50:08 +0100) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> <83ilw6886f.fsf@gnu.org> <837dcl9amk.fsf@gnu.org> Message-ID: <83zgph7sb0.fsf@gnu.org> > Date: Fri, 3 Dec 2021 13:50:08 +0100 > From: Daniel B?nzli > Cc: unicode at corp.unicode.org > > I'm not suggesting to do *any* of what I mentioned in text editors :-) Text editors should just do regular UBA.? Editors have better chance to catch user's attention. Besides, what about the cases when you get a program compiled by someone else and review its sources to see if you can trust it? > These checks are meant to be done by the compiler on the sources they are fed with. Somehow the problem really looks like matching new kind of parentheses with a few twists. It would just be nice if we had a good description of the exact rules you need to enforce. Compilers are much less likely to like your ideas, because your ideas require an implementation of the UBA, which compilers didn't until now have to do. > Regarding the idea of text insertion to ensure text boundaries are respected I'm more thinking about templating. Suppose you have a template with fields to fill in with untrusted user input and you want to make sure the text remains contained to well specified boundaries.? I'm not sure how this will help. Program code can be written even in a very simple text editor without any templates. How can you tell users to use templates? From daniel.buenzli at erratique.ch Fri Dec 3 07:29:56 2021 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 3 Dec 2021 14:29:56 +0100 Subject: Directionality controls for malicious code In-Reply-To: References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> <83ilw6886f.fsf@gnu.org> <837dcl9amk.fsf@gnu.org> Message-ID: On 3 December 2021 at 14:04:51, Eli Zaretskii (eliz at gnu.org) wrote: >?Editors have better chance to catch user's attention.? A compiler error catches your attention very quickly :-) > Besides, what about the cases when you get a program compiled by > someone else and review its sources to see if you can trust it? Personally I don't think that's a super interesting use case. If you want to gain trust in a binary by reading its source code you need the ability to build it and something like [1]. Otherwise you are not gaining any trust over the binary at all. > I'm not sure how this will help. Program code can be written even in > a very simple text editor without any templates. How can you tell > users to use templates? In this case I'm not talking about computer programs. I'm talking about user input in general e.g. any website with user generated content.? Best,? Daniel [1]:?https://reproducible-builds.org/ From eliz at gnu.org Fri Dec 3 07:37:34 2021 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 03 Dec 2021 15:37:34 +0200 Subject: Directionality controls for malicious code In-Reply-To: (message from Daniel =?utf-8?Q?B=C3=BCnzli?= on Fri, 3 Dec 2021 14:29:56 +0100) References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> <83ilw6886f.fsf@gnu.org> <837dcl9amk.fsf@gnu.org> Message-ID: <83v9057qsh.fsf@gnu.org> > Date: Fri, 3 Dec 2021 14:29:56 +0100 > From: Daniel B?nzli > Cc: unicode at corp.unicode.org > > On 3 December 2021 at 14:04:51, Eli Zaretskii (eliz at gnu.org) wrote: > > >?Editors have better chance to catch user's attention.? > > A compiler error catches your attention very quickly :-) We've gone a full circle now: flagging those as errors, when signal-to-noise ratio is low, is not a good idea. Users will turn the errors/warnings off, and that's the end of it. > > Besides, what about the cases when you get a program compiled by > > someone else and review its sources to see if you can trust it? > > Personally I don't think that's a super interesting use case. It happens in my experience almost every day. YMMV, of course. From daniel.buenzli at erratique.ch Fri Dec 3 07:55:06 2021 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 3 Dec 2021 14:55:06 +0100 Subject: Directionality controls for malicious code In-Reply-To: <83v9057qsh.fsf@gnu.org> References: <794ff17b-df41-1a16-39b7-8a173eae5fd1@khwilliamson.com> <005601d7e73d$3f3c0b80$bdb42280$@gmail.com> <001501d7e74d$a9f0d630$fdd28290$@ewellic.org> <83v907pg7c.fsf@gnu.org> <834k7qamf3.fsf@gnu.org> <83ilw6886f.fsf@gnu.org> <837dcl9amk.fsf@gnu.org> <83v9057qsh.fsf@gnu.org> Message-ID: On 3 December 2021 at 14:37:34, Eli Zaretskii (eliz at gnu.org) wrote: > We've gone a full circle now: flagging those as errors, when > signal-to-noise ratio is low, is not a good idea. Users will turn the > errors/warnings off, and that's the end of it. The idea is to *only* flag the illegitimate uses. Defined as: those cases when text visually overflow the syntactic boundaries established by your programming language (e.g. string literal or comment delimiters).? This is a high signal-to-noise ratio and you can even make it a hard syntax error. I'm not saying nothing should be done in editors to help with this aswell. But I don't really understand your objections, these are complementary approaches. Best, Daniel From public at khwilliamson.com Mon Dec 6 10:15:49 2021 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 6 Dec 2021 09:15:49 -0700 Subject: \p{Numeric_Value=-1/2} Message-ID: U+0F33 TIBETAN DIGIT HALF ZERO has a numeric value of -0.5. (I believe the existence of this character in the wild is apocryphal however.) There is no rule against other code points becoming encoded with a negative value. However, UTS 18 says the hyphen-minus sign is supposed to be ignored within \p{} constructs, leaving no way to legally specify negative values. I suspect that UTS 18 should be clarified to indicate that the hyphen minus at the beginning of a number should not be ignored, even with loose matching. But then what to do about two in a row? From haberg-1 at telia.com Mon Dec 6 12:12:06 2021 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Mon, 6 Dec 2021 19:12:06 +0100 Subject: Persian musical accidentals koron and sori Message-ID: <276469FF-83C0-440E-A6F6-373E207413DD@telia.com> In view of your proposal for the Persian musical accidentals koron and sori [1], there is proposed glyph design intended for LilyPond [2-4]. Comments on the design appreciated. 1. https://corp.unicode.org/~roozbeh/sori-koron.pdf 2. https://gitlab.com/lilypond/lilypond/-/merge_requests/1047 3. https://gitlab.com/lilypond/lilypond/uploads/b40d4d586ddb994f91be7c650d808833/iranian.pdf 4. https://lists.gnu.org/archive/html/lilypond-user/2021-12/msg00072.html From jameskass at code2001.com Thu Dec 9 14:24:57 2021 From: jameskass at code2001.com (James Kass) Date: Thu, 9 Dec 2021 20:24:57 +0000 Subject: Khitan Small Script chart glyphs Message-ID: <9c272d8a-1d75-b27b-5cb9-2cf455e7f4d0@code2001.com> Looking over the 14.0 charts for Khitan Small Script, it appears that the glyphs used for U+18BDE and U+18CCA are identical.? Is there a distinction I'm missing, or is one of those glyphs incorrect? From 4mm4adbfrm4 at tonton-pixel.com Thu Dec 9 14:51:53 2021 From: 4mm4adbfrm4 at tonton-pixel.com (Michel Mariani) Date: Thu, 9 Dec 2021 21:51:53 +0100 Subject: Khitan Small Script chart glyphs In-Reply-To: <9c272d8a-1d75-b27b-5cb9-2cf455e7f4d0@code2001.com> References: <9c272d8a-1d75-b27b-5cb9-2cf455e7f4d0@code2001.com> Message-ID: > Le 9 d?c. 2021 ? 21:24, James Kass via Unicode a ?crit : > > Looking over the 14.0 charts for Khitan Small Script, it appears that the glyphs used for U+18BDE and U+18CCA are identical. Is there a distinction I'm missing, or is one of those glyphs incorrect? Yes, this has been recently documented in the Updates and Errata document: > The code charts for Khitan Small Script in Unicode version 14.0 show the glyphs for U+18BDE and U+18CCA as identical. However, the characters are are not duplicates. The glyph for U+18CCA will be modified in a future version of the standard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dwanders at sonic.net Thu Dec 9 14:58:09 2021 From: dwanders at sonic.net (dwanders at sonic.net) Date: Thu, 9 Dec 2021 12:58:09 -0800 Subject: Khitan Small Script chart glyphs In-Reply-To: <9c272d8a-1d75-b27b-5cb9-2cf455e7f4d0@code2001.com> References: <9c272d8a-1d75-b27b-5cb9-2cf455e7f4d0@code2001.com> Message-ID: <017201d7ed3f$793afaf0$6bb0f0d0$@sonic.net> Dear James, Good catch! This was noted in L2/21-182 and discussed by the Script Ad Hoc (see the recommendations in L2/21-174 on page 20 ). As noted by Michel Mariani, an erratum notice was posted, based on UTC consensus 169-C18. Debbie -----Original Message----- From: Unicode On Behalf Of James Kass via Unicode Sent: Thursday, December 9, 2021 12:25 PM To: Unicode Public List Subject: Khitan Small Script chart glyphs Looking over the 14.0 charts for Khitan Small Script, it appears that the glyphs used for U+18BDE and U+18CCA are identical. Is there a distinction I'm missing, or is one of those glyphs incorrect? -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Thu Dec 16 15:42:28 2021 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 16 Dec 2021 21:42:28 +0000 (GMT) Subject: Teletext control codes Message-ID: <43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com> In the document Proposal to add further characters from legacy computers and teletext to the UCS https://www.unicode.org/L2/L2021/21235-terminals-supplement.pdf at the end ogf page 4 is the following. > Control characters from microcomputer platforms and teletext were also > determined to be out of scope for the UCS. These characters were > located in what would today be considered the C0 control range > (0x00?0x1F) or the C1 control range (0x7F?0x9F). Processes that need > to interchange these codes should simply interchange the binary C0 or > C1 value, extended to the UCS code space but without further mapping. > Emulators should treat these control codes as appropriate for the > targeted environment. In relation to teletext control codes, my opinion is that they need to be encoded separately from the C0 control range. This would ensure that in interchange that none of the teletext control codes is ever misinterpreted as having the basic C0 character meaning. There was no ambiguity possible in a teletext system. In a viewdata system where the display was almost similar, there would have been ambiguity and so the control characters for viewdata systems were encoded not in the C0 set so as to avoid clashing between the teletext character set as used for viewdata and and the basic control characters. I remember that the signaling system used for control characters in viewdata was explained in an article by Mr S Fedida in an issue of Wireless World, before September 1977. It may have been one article in a four article sequence, spread over four issues of the magazine. https://en.wikipedia.org/wiki/Samuel_Fedida One possibility is to encode the teletext control characters as a block of 32 code points in plane 14, without closing up the unused points. These characters in plane 14 would be displayable characters and thus not control characters in non-teletext-emulating systems, each displayed as a glyph specified in The Unicode Standard as two small capital letters arranged one above the other, but not overlapping. For example A above G for Alphanumerics Green. I opine that it would be good for the proposal be extended to include encoding of the teletext control characters please. Could we discuss this please? William Overington Thursday 16 December 2021 From beckiergb at gmail.com Thu Dec 16 16:19:05 2021 From: beckiergb at gmail.com (Rebecca Bettencourt) Date: Thu, 16 Dec 2021 14:19:05 -0800 Subject: Teletext control codes In-Reply-To: <43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com> References: <43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com> Message-ID: On Thu, Dec 16, 2021 at 2:15 PM William_J_G Overington via Unicode < unicode at corp.unicode.org> wrote: > Could we discuss this please? > No. -- Rebecca Bettencourt -------------- next part -------------- An HTML attachment was scrubbed... URL: From harjitmoe at outlook.com Thu Dec 16 18:06:36 2021 From: harjitmoe at outlook.com (Harriet Riddle) Date: Fri, 17 Dec 2021 00:06:36 +0000 Subject: Teletext control codes In-Reply-To: <43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com> References: <43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com> Message-ID: William_J_G Overington via Unicode wrote: > In relation to teletext control codes, my opinion is that they need to > be encoded separately from the C0 control range. This would ensure > that in interchange that none of the teletext control codes is ever > misinterpreted as having the basic C0 character meaning. What is the "basic C0 character meaning"? In fact, the method of declaring when alternative C0 and C1 sets are in use is already /explicitly covered/ by ISO 10646 in section 13.4, which I excerpt as follows (I've added some explanations in hard brackets): For other C0 or C1 sets, the final octet F shall be obtained from the International Register of Coded Character Sets. The identifier sequences for these sets shall be * ESC 02/01 F /[i.e. 0x1B 0x21 then a byte from 0x40?7E (or 0x30?3F for a private use C0 set)]/ identifies a C0 set * ESC 02/02 F /[i.e. 0x1B 0x22 then a byte from 0x40?7E (or 0x30?3F for a private use C1 set)]/ identifies a C1 set If such an escape sequence appears within a code unit sequence conforming to ISO/IEC 2022 /[this strictly speaking includes e.g. ISO-8859-2, EUC-JP and ISO-2022-JP but not e.g. Windows-1252, Shift_JIS, EBCDIC or any Unicode encoding?although this specific provision is applicable to any ASCII?ish encoding]/, it shall consist only of the sequences of bit combinations as shown above /[i.e. each byte value in the escape sequence shall be emitted as a single byte of the specified binary value without transcoding]/. If such an escape sequence appears within a code unit sequence conforming to this document /[i.e. in a Unicode string]/, it shall be padded in accordance with Clause 12 /[i.e. in UTF-16 or UTF-32, a whole code unit shall be emitted for each byte in the escape sequence, not just the single byte; this has no actual effect on UTF-8]/. (end of excerpt) Here's the International Register; notice thirteen separate C0 sets in section 2.5 here, plus ten C1 sets in section 2.6: https://web.archive.org/web/20190424200034/https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf Now, the teletext control set, as it appears in the International Register, is actually IR-056, which is a C1 (rather than C0) set registration with the escape sequence ESC 0x22 0x40; this is because one of the ITU T.101 Videotex formats uses the Teletext control set as its C1 set and its registration was referenced to ITU T.101 (this also means it has the ECMA-48 CSI instead of a duplicate ESC?it is important to note here that Teletext's use of ESC should not even still appear in Teletext data once it's transcoded to Unicode, since it's used for character set switching, albeit in a manner incompatible with ISO 2022). https://web.archive.org/web/20200614215855/https://www.itscj.ipsj.or.jp/iso-ir/056.pdf So the /unambiguous/ way of representing it is to first emit the escape sequence U+001B+0022+0040 at the start of the string (or before any Teletext control codes appear), and /then/ map the Teletext controls (except ESC, which should change transcoder state but not be emitted) to U+0080?9F. This is /already well defined by the relevant standards/, and no work needs to be done on that front. As for adding /support/ for this to e.g. terminal emulators, that's another matter, but it's one which would need to be done for any other solution you might be inclined to propose too, so that's not really saying much. > There was no ambiguity possible in a teletext system. Well, yes, because the higher-level protocols were agreed upon as part of the system; it's only when this is mixed with another system (e.g. ECMA-48) that there needs to be indicators as to which one is in use: ESC 0x22 0x43 for ECMA-48's C1 set, versus ESC 0x22 0x40 for the Teletext controls used in the C1 area (represented either as codepoints or escape sequences). > One possibility is to encode the teletext control characters as a > block of 32 code points in plane 14, without closing up the unused > points. These characters in plane 14 would be displayable characters > and thus not control characters in non-teletext-emulating systems, > each displayed as a glyph specified in The Unicode Standard as two > small capital letters arranged one above the other, but not > overlapping. For example A above G for Alphanumerics Green. The existing control pictures never function as format effectors in their own right, and it would be weird if others started to. > I opine that it would be good for the proposal be extended to include > encoding of the teletext control characters please. Control "character" is a bit of a wooly term. The term "control /code/" refers specifically to the category Cc characters (a closed category), which don't have formal names (although they do have formal aliases) and mostly have behaviour defined by higher level protocols rather than by Unicode itself, some of which actually carry instructions for things other than the text renderer (BEL, for example). Category Cf, Zl and Zp characters are format effectors, which are a type of nonprinting character which semantically constitute part of the text itself and affect only how it's displayed by triggering (effecting) a particular format behaviour (line break, RTL override, permitted or forbidden line break, superscript etc). Some Cc characters from particular C0 or C1 sets are format effectors (LF for example), but the Cf/Zl/Zp ones are full-fledged Unicode characters with names and semantics defined by Unicode itself, not a higher level protocol such as ECMA-48 or the aforementioned IR-056. The Teletext controls are control /codes/, in existing systems that incorporate them. > > Could we discuss this please? > > William Overington > > Thursday 16 December 2021 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Sun Dec 19 05:34:16 2021 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sun, 19 Dec 2021 12:34:16 +0100 Subject: Teletext control codes In-Reply-To: References: <43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com> Message-ID: > 17 dec. 2021 kl. 01:06 skrev Harriet Riddle via Unicode : > > William_J_G Overington via Unicode wrote: >> In relation to teletext control codes, my opinion is that they need to be encoded separately from the C0 control range. This would ensure that in interchange that none of the teletext control codes is ever misinterpreted as having the basic C0 character meaning. > > What is the "basic C0 character meaning"? > > In fact, the method of declaring when alternative C0 and C1 sets are in use is already explicitly covered by ISO 10646 in section 13.4, which I excerpt as follows (I've added some explanations in hard brackets): > For other C0 or C1 sets, the final octet F shall be obtained from the International Register of Coded Character Sets. The identifier sequences for these sets shall be > > ESC 02/01 F [i.e. 0x1B 0x21 then a byte from 0x40?7E (or 0x30?3F for a private use C0 set)] identifies a C0 set > ESC 02/02 F [i.e. 0x1B 0x22 then a byte from 0x40?7E (or 0x30?3F for a private use C1 set)] identifies a C1 set > If such an escape sequence appears within a code unit sequence conforming to ISO/IEC 2022 [this strictly speaking includes e.g. ISO-8859-2, EUC-JP and ISO-2022-JP but not e.g. Windows-1252, Shift_JIS, EBCDIC or any Unicode encoding?although this specific provision is applicable to any ASCII?ish encoding], it shall consist only of the sequences of bit combinations as shown above [i.e. each byte value in the escape sequence shall be emitted as a single byte of the specified binary value without transcoding]. > > If such an escape sequence appears within a code unit sequence conforming to this document [i.e. in a Unicode string], it shall be padded in accordance with Clause 12 [i.e. in UTF-16 or UTF-32, a whole code unit shall be emitted for each byte in the escape sequence, not just the single byte; this has no actual effect on UTF-8]. > > (end of excerpt) That is a total, utter, complete, one thousand percent non-starter. Not just for Teletext, but for everything. The quoted section really needs to be deleted. It is an absolutely horrible suggestion, and flies in the face of the design of Unicode and current 10646. > >> One possibility is to encode the teletext control characters as a block of 32 code points in plane 14, without closing up the unused points. These characters in plane 14 would be displayable characters and thus not control characters in non-teletext-emulating systems, each displayed as a glyph specified in The Unicode Standard as two small capital letters arranged one above the other, but not overlapping. For example A above G for Alphanumerics Green. That is also a non-starter. But William is right on one point: The Teletext (Text-TV) control codes do (already!) have a graphical representation. And that is ? a SPACE. > Control "character" is a bit of a wooly term. The term "control code" refers specifically to the category Cc characters (a closed category), which don't have formal names (although they do have formal aliases) and mostly have behaviour defined by higher level protocols rather than by Unicode itself, some of which actually carry instructions for things other than the text renderer (BEL, for example). Category Cf, Zl and Zp characters are format effectors, which are a type of nonprinting character which semantically constitute part of the text itself and affect only how it's displayed by triggering (effecting) a particular format behaviour (line break, RTL override, permitted or forbidden line break, superscript etc). Some Cc characters from particular C0 or C1 sets are format effectors (LF for example), but the Cf/Zl/Zp ones are full-fledged Unicode characters with names and semantics defined by Unicode itself, not a higher level protocol such as ECMA-48 or the aforementioned IR-056. > > The Teletext controls are control codes, in existing systems that incorporate them I?m not sure what that is supposed to explain. I just got confused, it did not appear to make any sense. However, there are appropriate ways of dealing with (most) Teletext controls in a Unicode context. We discussed that over a year ago. Hint: take your pick: HTML (can handle most of Teletext, with special fonts; and that is done for many Teletext pages on a daily basis currently, indeed, minute-by-minute basis, to show them in web pages and smart phone apps) or ECMA-48 (needs some extensions to handle all Teletext controls).) /Kent K PS Note sure why I have been listed as one of the authors of https://www.unicode.org/L2/L2021/21235-terminals-supplement.pdf . Yes, I have had comments on an earlier version, and I have made suggestions related to it. But I have had no part in writing that document, I have not even been consulted about it. And as you can see from this email, I strongly oppose some of the things suggested there. >> >> Could we discuss this please? >> >> William Overington >> >> Thursday 16 December 2021 >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Sun Dec 19 19:10:46 2021 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Mon, 20 Dec 2021 02:10:46 +0100 Subject: Teletext control codes In-Reply-To: <7c7490cb.2b6df.17dd361c07b.Webtop.96@btinternet.com> References: <43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com> <7c7490cb.2b6df.17dd361c07b.Webtop.96@btinternet.com> Message-ID: > 19 dec. 2021 kl. 16:48 skrev William_J_G Overington : > > Hi > > > That is also a non-starter. But William is right on one point: The Teletext (Text-TV) control codes do (already!) have a graphical representation. And that is ? a SPACE. > > Um. what exactly are you saying I am right about please? > > > And that is ? a SPACE. > > Not quite. If in Hold graphics mode, a control character might not be displayed as a space. Yes, that is a detail I glossed over. But whether or not in ?hold graphics mode?, mapping to a SPACE (plus some kind of indication of color change (usually)) or to ?previous G3 char (if any)? (plus some kind of indication of color change (usually)) is a matter of conversion, not a matter of control code interpretation (after the conversion). > > https://www.rawles.org.uk/teletext/hold/ > > I used the Hold Graphics character near the start of each line of my Colour Check graphic that was on page 786 of Viewdata in September 1977, which was when I saw it. I do not know for how long it stayed on that Viewdata page. I wonder if it can be recovered from an archive somewhere. > > Basically the design was as if a large block of red colour graphics at upper leftt, a large block of green colour graphics at upper right and a large block of colour graphics at lower middle all overlapped, so that colours of red, green, blue, yellow, white, magenta and cyan all appeared. > > By Hold Graphics mode being in operation, there was no blank cell when the colour changed from red to yellow or from yellow to green, and so on. > > There was a blank area of black of about two or three lines at the top and bottom and black at left and right so that it looked good on the page as a work of art. > > The large blocks of colour were each made of teletext graphics that were the graphic equivalent of a lowercase letter 'e' all of this on a black background, so the large colour blocks were not solid but made up of chunks of colour on a black background. Contiguous graphics were used, so some chunks were one size and some chunks another size. > > The standards document to which I linked includes a word that I coined, namely telesoftware. > > Regarding my encoding in plane 14 idea. > > > That is also a non-starter. > > Any particular reason for that opinion please? Because all of the Teletext control codes are totally absurd/bizarre and should not be encoded in any way whatsoever in Unicode/10646. In addition, newer, and standard, versions of Teletext have in the protocol (?out of line? with the text) ways of specifying that the text is italic, bold, or has one of a number of colors not covered by the original Teletext implementation. It is a quite complicated conversion (to HTML or ECMA-48-extended) that cannot be covered by your approach. /Kent K > I suppose that I could incorporate some codes for teletext control codes into The Mariposa System if necessary. > > http://www.users.globalnet.co.uk/~ngo/mariposa_novel.htm > > http://www.users.globalnet.co.uk/~ngo/ > > > That is a total, utter, complete, one thousand percent non-starter. Not just for Teletext, but for everything. The quoted section really needs to be deleted. It is an absolutely horrible suggestion, and flies in the face of the design of Unicode and current 10646. > > Could you possibly clarify as to which section of what needs to be deleted please? > > Best regards, > > William Overington > > > > > > ------ Original Message ------ > From: "Kent Karlsson via Unicode" > > To: "Harriet Riddle" > > Cc: "William_J_G Overington" >; unicode at corp.unicode.org > Sent: Sunday, 2021 Dec 19 At 11:34 > Subject: Re: Teletext control codes > > > > 17 dec. 2021 kl. 01:06 skrev Harriet Riddle via Unicode >: > > > > William_J_G Overington via Unicode wrote: > > > In relation to teletext control codes, my opinion is that they need to be encoded separately from the C0 control range. This would ensure that in interchange that none of the teletext control codes is ever misinterpreted as having the basic C0 character meaning. > > > What is the "basic C0 character meaning"? > > In fact, the method of declaring when alternative C0 and C1 sets are in use is already explicitly covered by ISO 10646 in section 13.4, which I excerpt as follows (I've added some explanations in hard brackets): > > For other C0 or C1 sets, the final octet F shall be obtained from the International Register of Coded Character Sets. The identifier sequences for these sets shall be > > ESC 02/01 F [i.e. 0x1B 0x21 then a byte from 0x40?7E (or 0x30?3F for a private use C0 set)] identifies a C0 set > > ESC 02/02 F [i.e. 0x1B 0x22 then a byte from 0x40?7E (or 0x30?3F for a private use C1 set)] identifies a C1 set > If such an escape sequence appears within a code unit sequence conforming to ISO/IEC 2022 [this strictly speaking includes e.g. ISO-8859-2, EUC-JP and ISO-2022-JP but not e.g. Windows-1252, Shift_JIS, EBCDIC or any Unicode encoding?although this specific provision is applicable to any ASCII?ish encoding], it shall consist only of the sequences of bit combinations as shown above [i.e. each byte value in the escape sequence shall be emitted as a single byte of the specified binary value without transcoding]. > If such an escape sequence appears within a code unit sequence conforming to this document [i.e. in a Unicode string], it shall be padded in accordance with Clause 12 [i.e. in UTF-16 or UTF-32, a whole code unit shall be emitted for each byte in the escape sequence, not just the single byte; this has no actual effect on UTF-8]. > > (end of excerpt) > > That is a total, utter, complete, one thousand percent non-starter. Not just for Teletext, but for everything. The quoted section really needs to be deleted. It is an absolutely horrible suggestion, and flies in the face of the design of Unicode and current 10646. > > > > > > > One possibility is to encode the teletext control characters as a block of 32 code points in plane 14, without closing up the unused points. These characters in plane 14 would be displayable characters and thus not control characters in non-teletext-emulating systems, each displayed as a glyph specified in The Unicode Standard as two small capital letters arranged one above the other, but not overlapping. For example A above G for Alphanumerics Green. > > That is also a non-starter. But William is right on one point: The Teletext (Text-TV) control codes do (already!) have a graphical representation. And that is ? a SPACE. > > > > > Control "character" is a bit of a wooly term. The term "control code" refers specifically to the category Cc characters (a closed category), which don't have formal names (although they do have formal aliases) and mostly have behaviour defined by higher level protocols rather than by Unicode itself, some of which actually carry instructions for things other than the text renderer (BEL, for example). Category Cf, Zl and Zp characters are format effectors, which are a type of nonprinting character which semantically constitute part of the text itself and affect only how it's displayed by triggering (effecting) a particular format behaviour (line break, RTL override, permitted or forbidden line break, superscript etc). Some Cc characters from particular C0 or C1 sets are format effectors (LF for example), but the Cf/Zl/Zp ones are full-fledged Unicode characters with names and semantics defined by Unicode itself, not a higher level protocol such as ECMA-48 or the aforementioned IR-056. > > The Teletext controls are control codes, in existing systems that incorporate them > > I?m not sure what that is supposed to explain. I just got confused, it did not appear to make any sense. > > > > > However, there are appropriate ways of dealing with (most) Teletext controls in a Unicode context. We discussed that over a year ago. Hint: take your pick: HTML (can handle most of Teletext, with special fonts; and that is done for many Teletext pages on a daily basis currently, indeed, minute-by-minute basis, to show them in web pages and smart phone apps) or ECMA-48 (needs some extensions to handle all Teletext controls).) > > /Kent K > > > PS > Note sure why I have been listed as one of the authors of https://www.unicode.org/L2/L2021/21235-terminals-supplement.pdf . Yes, I have had comments on an earlier version, and I have made suggestions related to it. But I have had no part in writing that document, I have not even been consulted about it. And as you can see from this email, I strongly oppose some of the things suggested there. > > > > Could we discuss this please? > > William Overington > > Thursday 16 December 2021 -------------- next part -------------- An HTML attachment was scrubbed... URL: