From unicode at unicode.org Fri Feb 1 01:16:04 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 01 Feb 2019 09:16:04 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190131231719.2b545f7f@JRWUBU2> (message from Richard Wordingham via Unicode on Thu, 31 Jan 2019 23:17:19 +0000) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> Message-ID: <83d0ocyo5n.fsf@gnu.org> > Date: Thu, 31 Jan 2019 23:17:19 +0000 > From: Richard Wordingham via Unicode > > Emacs needs a lot of help - I can't write a generic Tai Tham > OpenType .flt file :-( Which is why Emacs is migrating towards HarfBuzz. From unicode at unicode.org Fri Feb 1 05:02:45 2019 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Fri, 1 Feb 2019 13:02:45 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190131231719.2b545f7f@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> Message-ID: <20190201110245.GA2111@macbook.localdomain> On Thu, Jan 31, 2019 at 11:17:19PM +0000, Richard Wordingham via Unicode wrote: > On Thu, 31 Jan 2019 12:46:48 +0100 > Egmont Koblinger wrote: > > No. How many cells do CJK ideographs occupy? We've had a strong hint > that a medial BEH should occupy one cell, while an isolated BEH should > occupy two. Monospaced Arabic fonts (there are not that many of them) are designed so that all forms occupy just one cell (most even including the mandatory lam-alef ligatures), unlike CJK fonts. I can imagine the terminal restricting itself to monspaced fonts, disable ?liga? feature just in case, and expect the font to well behave. Any other magic is likely to fail. Regards, Khaled From unicode at unicode.org Fri Feb 1 05:04:44 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 1 Feb 2019 11:04:44 +0000 Subject: Encoding italic In-Reply-To: <20190131151813.byeaj4fen5uptpb4@angband.pl> References: <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <20190131151813.byeaj4fen5uptpb4@angband.pl> Message-ID: <9f03881a-0505-ea03-ba6e-c8bebe837135@gmail.com> On 2019-01-31 3:18 PM, Adam Borowski via Unicode wrote: > They're only from a spammer's point of view. Spammers need love, too.? They?re just not entitled to any. From unicode at unicode.org Fri Feb 1 06:40:48 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 1 Feb 2019 13:40:48 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <83tvhozx4l.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <83va2616nj.fsf@gnu.org> <83tvhozx4l.fsf@gnu.org> Message-ID: Hi Eli, > Arabic presentation forms are more like an exception than a rule, I > hope you understand this by now. Most languages/scripts don't have > such forms, and even for Arabic they cover only a part of what needs > to be done to present correctly shaped text. Complex script shaping > is much more than just substituting some glyphs with others, it > requires an intimate knowledge of the font being used and its > capabilities, and the ability to control how various glyphs of a > grapheme cluster are placed relative to one another, something that an > application running on a text terminal cannot do. > > So I suggest that you don't consider Arabic presentation forms a > representative of the direction in which terminal emulators supporting > such scripts should evolve. Thanks a lot for this information! I now understand that presentation forms isn't an ideal possible approach, and the recommendation should be improved here. Until it happens, I'm uncertain whether using presentation form characters is a decent low hanging fruit that significantly improves the readability in some situations (e.g. "good enough" in some sense for Arabic), or is a dead end we shouldn't propagate. I still do not agree however that the entire responsibility can be shifted to the emulator. There are certain important bits of information that are only available to the application, and not the emulator ? as with many other aspects, such as reordering, copy-pasting, searching in the data in BiDi-aware text editors using the terminal's explicit mode, which are all pushed to the application because the emulator cannot do them correctly. I believe we should further study the situation, e.g. see whether ECMA-48's SAPV (8.3.18) parameters 5..8 (to explicitly specify whether to use isolated/initial/medial/final form for each character) are flexible enough to convey all this information, or perhaps a new, more powerful means should be crafted. At this point I lack sufficient knowledge to fix the design, I'd need to spend a lot of time studying the situation and/or working together with you guys, if you're up for it. Thanks a lot, egmont From unicode at unicode.org Fri Feb 1 06:54:02 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 1 Feb 2019 13:54:02 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <83sgx8zx0d.fsf@gnu.org> References: <20190129175046.x4jkv25mfnun7q7b@angband.pl> <83tvhq166l.fsf@gnu.org> <83sgx8zx0d.fsf@gnu.org> Message-ID: Hi Eli, > So we will some day have one such terminal emulator. That's good, but > a text-mode application that needs to support bidi cannot rely on its > users all having access to that single terminal. No. A text-mode application that needs to support BiDi must do the BiDi itself and pass visual order to the emulator, and beforehand switch the emulator to explicit mode so that you don't end up with "double BiDi". Once you emit visual order, there's no need for any BiDi control characters. For this behavior, the only feature you need from a terminal emulator is to have a mode where it doesn't shuffle the characters. Currently every emulator I'm aware of has such a mode, although in some of them you have to tweak the settings to get to this mode (in my firm opinion it's an unacceptable user experience), while in emulators according to my specification there'll be an escape sequence for text-mode apps to automatically switch to this mode. What BiDi control characters (LRE, LRI, FSI etc.) in implicit mode will give you ? if supported ? is that you'll be able to execute "cat file", and it'll be displayed correctly, even taking FSI and friends as present in the file into account. Of course this will only work in terminal emulators that support this. > This is indeed a significant issue, because it means applications > cannot force the terminal use a certain non-default base paragraph > direction. They can, since there's a dedicated escape sequence (SCP) for setting the base paragraph. That being said, not being able to remember FSI at the beginning of a string is indeed a significant issue, we agree on this. We just need to figure out how to alter the emulation behavior to remember them, which I find the next big step to address in the specification. cheers, egmont From unicode at unicode.org Fri Feb 1 07:16:03 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 1 Feb 2019 14:16:03 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <83r2cszwvp.fsf@gnu.org> References: <20190129175046.x4jkv25mfnun7q7b@angband.pl> <83r2cu15z7.fsf@gnu.org> <83r2cszwvp.fsf@gnu.org> Message-ID: Hi, On Thu, Jan 31, 2019 at 4:10 PM Eli Zaretskii wrote: > The reordering happens before TABs are converted to cursor motion, > does it not? No, not at all. You cannot "mix" handling the input and reordering, since the input is not available as a single step but arrives continuously in a stream. Consider a heavy BiDi text such as (I'm making up some random gibberish, uppercase being RTL): foo BAR FSI BAz quUX 1234 PDI whatEVer Someone prints it to the terminal, but due to the internals, the terminal doesn't receive this in one single step but in two consecutive ones, broken in the middle. Maybe the app split it in half (e.g. a shell script printed fragments one by one using printf without a trailing newline). Maybe the emitter is a "dd" printing blocks of let's say 4kB and this line happens to cross a boundary. Maybe a transport layer such as ssh split it for whatever reason. Then would you take the first half of this text, let's say foo BAR FSI BAz quU even with unbalanced BiDi controls, then reorder it, and continue from it? Continue how? How to remember to reorder the second half too, but not the first half once again in order to avoid "double BiDi"? What to do with explicit cursor movement, would they jump to the visual positon? This would break absolutely basic principles, e.g. jumping twice to the same location to overwrite a letter twice in a row may actually end up overwriting two different letters, since everything was potentially rearranged after the first overwrite happened? Any application having any existing preconception about cursor movement would uncontrollably fall apart. This approach is doomed to fail big time (and was the reason I had to drop ECMA TR/53's DCSM "presentation" mode). The only reasonable way is if you have two layers. The bottom layer does the emulation almost exactly as it used to do, with no BiDi whatsoever (except for tiny additions, e.g. it tracks BiDi-related properties such as the paragraph direction). The upper layer displays the data, and this upper layer performs BiDi solely for display purposes: using the lower layer's data as input, but not modifying it. This is, by the way, also what current emulators that shuffle the characters arond do. Let's also mention that the lower layer (emulation) should be as fast as possible. e.g. VTE can handle input in the ballpark of 10MB/s. Reordering, that is, running BiDi for display purposes needs to happen much more rarely, maybe 20-60 times per second. It would be a performance killer having to run the BiDi algorithm upon every received chunk of data ? in fact, to eliminate any possible behavior difference due to timing difference, it'd need to happen after every printable character received. There's absolutely no way we could reorder first, and then handle TAB's cursor movement. TAB's cursor movement happens in the lower layer, reordering happens in the upper one. cheers, egmont From unicode at unicode.org Fri Feb 1 07:26:00 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 01 Feb 2019 15:26:00 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: (message from Egmont Koblinger on Fri, 1 Feb 2019 13:40:48 +0100) References: <83k1in30kh.fsf@gnu.org> <83va2616nj.fsf@gnu.org> <83tvhozx4l.fsf@gnu.org> Message-ID: <83lg2zy713.fsf@gnu.org> > From: Egmont Koblinger > Date: Fri, 1 Feb 2019 13:40:48 +0100 > Cc: unicode at unicode.org > > I now understand that presentation forms isn't an ideal possible > approach, and the recommendation should be improved here. > > Until it happens, I'm uncertain whether using presentation form > characters is a decent low hanging fruit that significantly improves > the readability in some situations (e.g. "good enough" in some sense > for Arabic), or is a dead end we shouldn't propagate. IMNSHO, you shouldn't try solving this problem on your own. Instead, use a shaping engine, such as HarfBuzz, to do that for you, since the emulator does know which fonts it uses, and can access their properties. The only problem a terminal emulator does need to solve in this regard is what to do when N codepoints yield M /= N glyphs that the shaping engine tells you to emit, or, more generally, when the width on display after shaping is different from N times the character cell width. > I still do not agree however that the entire responsibility can be > shifted to the emulator. There are certain important bits of > information that are only available to the application, and not the > emulator ? as with many other aspects, such as reordering, > copy-pasting, searching in the data in BiDi-aware text editors using > the terminal's explicit mode, which are all pushed to the application > because the emulator cannot do them correctly. As soon as you attempt to target applications that move cursor and use cursor addressing, you are in trouble, and should IMO refrain from trying to support such applications. For example, Emacs doesn't even write whole lines to the screen, it compares the internal representation of what's on the screen and what should be there, and only emits the parts that should be modified. (It does that to minimize screen writes, which might be expensive, especially if writing to a remote terminal.) In such cases, the emulator doesn't stand a chance of doing TRT, because the application doesn't provide enough context for it to reorder text correctly. So I don't think a bidi-aware terminal emulator can support any application more complex than those which write full lines to the terminal, like 'cat', 'sed', 'diff', 'grep', etc. > I believe we should further study the situation, e.g. see whether > ECMA-48's SAPV (8.3.18) parameters 5..8 (to explicitly specify whether > to use isolated/initial/medial/final form for each character) are > flexible enough to convey all this information, or perhaps a new, more > powerful means should be crafted. Once again, I think it's impractical to expect applications to emit these controls. The emulator must do this part of the job. From unicode at unicode.org Fri Feb 1 07:31:20 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 01 Feb 2019 15:31:20 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: (message from Egmont Koblinger on Fri, 1 Feb 2019 13:54:02 +0100) References: <20190129175046.x4jkv25mfnun7q7b@angband.pl> <83tvhq166l.fsf@gnu.org> <83sgx8zx0d.fsf@gnu.org> Message-ID: <83imy3y6s7.fsf@gnu.org> > From: Egmont Koblinger > Date: Fri, 1 Feb 2019 13:54:02 +0100 > Cc: Adam Borowski , unicode at unicode.org > > For this behavior, the only feature you need from a terminal emulator > is to have a mode where it doesn't shuffle the characters. Currently > every emulator I'm aware of has such a mode, although in some of them > you have to tweak the settings to get to this mode (in my firm opinion > it's an unacceptable user experience), while in emulators according to > my specification there'll be an escape sequence for text-mode apps to > automatically switch to this mode. Like I said, as long as not every emulator supports this control, an application will need to detect its support, and that in itself is a complication. > > This is indeed a significant issue, because it means applications > > cannot force the terminal use a certain non-default base paragraph > > direction. > > They can, since there's a dedicated escape sequence (SCP) for setting > the base paragraph. Does this change the base direction globally for the whole screen, or only for the current text? The latter is what's needed. And again, just detecting whether this is supported is a complication. Emitting LRM or RLM as needed is much easier. From unicode at unicode.org Fri Feb 1 07:35:35 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 1 Feb 2019 14:35:35 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <83pnsczwoi.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <83o97y15g5.fsf@gnu.org> <83pnsczwoi.fsf@gnu.org> Message-ID: Hi, On Thu, Jan 31, 2019 at 4:14 PM Eli Zaretskii wrote:> > I suggest that you show the result to someone who does read Arabic. I contacted one guy who is pretty knowledgeable in Arabic scripts, as well as terminal emulation, I sent out an early unpublished version of the proposal to him, but unfortunately he was busy and didn't have the chance to respond. Let this thread be one where we invite Arabic folks to comment :) > Small changes can be very unpleasant to the eyes of an Arabic reader. I can easily imagine that! I can assure you, seeing ? instead of ? in my native language is extremely unpleasant to my eyes. Depending on the font you're using, you may not even have spotted any difference. But could someone argue for example that seeing an "i" and "w" equally wide is unpleasant to their eyes? Where do we draw the lines of what's an acceptable compromise on a platform that has technical limitations (fixed grid) to begin with? We really need input from Arabic folks to answer this. I'm also wondering: how unpleasant it is if a letter is cut in half (e.g. overflows at the edge of the text editor), and is shaped not according to the entire word but according to the visible part? I took it from the CSS specification that the desired behavior is to shape it according to the entire word, but I honestly don't know how acceptable or how unpleasant the other approach is. > You could do that, but it will require a lot of non-trivial processing > from the applications. Text-mode applications don't want any complex > tinkering, they want just to write their text and be done. The more > overhead you add to that simple task, the less probable it is that > applications will support such a terminal. I agree with your overall observation, but I'm not sure how much it applies to this context. Text-mode applications have to run the BiDi algorithm. The one I picked can also do shaping (well, the pretty limited one, using presentation forms). Shouldn't any BiDi algorithm also provide methods for shaping that produce some output that can be easily sent to the terminals? Shouldn't we push for them? As far as I imagine the ideal solution, doing this part of shaping shouldn't be any harder for apps than doing BiDi, basically all they would need to do is hook up to existing API methods. Of course, given the current APIs, it's probably really not this simple. cheers, egmont From unicode at unicode.org Fri Feb 1 07:36:58 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 01 Feb 2019 15:36:58 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: (message from Egmont Koblinger on Fri, 1 Feb 2019 14:16:03 +0100) References: <20190129175046.x4jkv25mfnun7q7b@angband.pl> <83r2cu15z7.fsf@gnu.org> <83r2cszwvp.fsf@gnu.org> Message-ID: <83h8dny6it.fsf@gnu.org> > From: Egmont Koblinger > Date: Fri, 1 Feb 2019 14:16:03 +0100 > Cc: Adam Borowski , unicode at unicode.org > > There's absolutely no way we could reorder first, and then handle > TAB's cursor movement. TAB's cursor movement happens in the lower > layer, reordering happens in the upper one. But that means you won't ever be able to be in compliance with UAX#9, because TAB has distinct properties that affect the UBA. If you reorder after all TABs have been converted to spaces, you will not be able to implement the support for Segment Separator characters. Am I missing something? From unicode at unicode.org Fri Feb 1 07:44:21 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 1 Feb 2019 14:44:21 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <83mungzw3y.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <83va2616nj.fsf@gnu.org> <20190130163142.ewknnpfnnl5y4m2c@angband.pl> <83mungzw3y.fsf@gnu.org> Message-ID: On Thu, Jan 31, 2019 at 4:26 PM Eli Zaretskii wrote: > > Yes, I do argue that emacs will need to print a new escape sequence. > > Which is much-much-much-much-much better than having to tell users to > > go into the settings of their macOS Terminal / Konsole / > > gnome-terminal etc. and disable BiDi there, isn't it? > > I'm not sure I agree. Most users can disable bidi reordering of the > terminal once and for all. They don't need it. What users are we talking about? Those who don't need BiDi ever at all? Everything is already perfect for them! They should't care about the "enable BiDi" settings of their terminal, either value will result in the same, correct behavior for them. Or do we talk about users who care about BiDi inside Emacs, but don't care about BiDi when echo'ing, cat'ing...? Do such users exist? Well, even if they do, they're not the only target of my work. Remember: My proposal aims to address both the Emacs as well as the echo/cat/... use cases. These are substantially different use cases that require the terminal emulator to be in a different mode, and thus automatic switching between the two modes has to be solved. cheers, egmont From unicode at unicode.org Fri Feb 1 07:47:22 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 1 Feb 2019 14:47:22 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <83o97y15g5.fsf@gnu.org> Message-ID: Hi Ken, > [language tag] > That is a complete non-starter for the Unicode Standard. Thanks for your input! (I hope it was clear that I just started throwing in random ideas, as in a brainstorming session. This one is ruled out, then.) cheers, egmont From unicode at unicode.org Fri Feb 1 08:15:53 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 1 Feb 2019 15:15:53 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190131231719.2b545f7f@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> Message-ID: Hi Richard, On Fri, Feb 1, 2019 at 12:19 AM Richard Wordingham via Unicode wrote: > Cropped why? If the problem is the truncation of lines, one can simple > store the next character. Yup, trancation of line for example. I agree that one could "store the next character". We could extend the terminal emulation protocol where by some means you can specify that column 80 contains a letter X, and even though there's no column 81, an app can still tell the terminal emulator that it should imagine that column 81 contans the letter Y, and perform shaping accordingly. This will need to be done not just at the end of the terminal, but at any position, and for both directions. Think of e.g. a vertically split tmux. You should be able to tell that column 40 contains X which should be shaped as if column 41 contained Y, and column 41 contains Z which should be shaped as if column 40 contained A. What I canont see at all is how this could be "simply". Could you please elaborate on that? I don't find this simple at all! >> > It's not able to > > separate different UI elements that happen to be adjacent in the > > terminal, separated by different background color or such. > > ZWJ and ZWNJ can handle that. Wouldn't it be a semantical misuse of these characters, though? They are supposed to be present in the logical order, and in logical order (that is: the terminal's implicit mode) they can work as desired. Are they okay to be present in visual order (the terminal's explicit mode, what we're discussing now) too? Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined above. > If a general text manipulating application, e.g. cat, grep or awk, is > writing to a file, it should not convert normal Arabic characters to > presentation forms. You are now asking a general application to > determine whether it is writing to a terminal or not, and alter its > output if it is writing to a terminal. No, this absolutely not what I'm talking about! There are two vastly different modes of the terminal. For "cat", "grep" etc. the terminal will be in implicit mode. Absolutely no BiDi handling is expected from these apps, the terminal will do BiDi and shaping (perhaps using Harfbuzz; perhaps using presentation form characters as a temporarily low hanging fruit until a better one is implemented ? the choice is obviously up to the implementation and not to the specification). For "emacs" and friends, an explicit mode is required where visual order is passed to the terminal. What we're discussing is how to handle shaping in this mode. > But it as an issue that needs to be addressed. As a terminal can be > addressed by cell, an application may need to keep track of what text > went into each cell. Misery results when the application gets it wrong. My recommendation doesn't change this principle at all. In the lower (emulation) layer every character still goes into the cell it used to go to, and is addressable using cursor motion escapes and so on exactly as without BiDi. > How many cells do CJK ideographs occupy? We've had a strong hint > that a medial BEH should occupy one cell, while an isolated BEH should > occupy two. CJK occupy two, but they do regardless of what's around them. That is, they already occupy two cells in the logical buffers, in the emulation layer. There is absolutely no sane way we can make in terminal emulation a character's logical width (as in number of cells it occupies) depend on its neighboring characters. (And even if we could by some terrible hacks, it would break the principle you just said as "misery results...", and the principle Eli said that things should remain reasonably simple, otherwise hardly anyone will bother implementing them.) This is a compromise Arabic folks will have to accept. When displayed, it's up for terminal emulators to perhaps enwiden/shrink cells as it wants to (they might even totally give up on monospace fonts), but then they'll risk vertical lines not aligning up perfectly vertically, content overflowing on the right etc. Konsole does such things. cheers, egmont From unicode at unicode.org Fri Feb 1 08:59:30 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 01 Feb 2019 16:59:30 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: (message from Egmont Koblinger on Fri, 1 Feb 2019 14:35:35 +0100) References: <83k1in30kh.fsf@gnu.org> <83o97y15g5.fsf@gnu.org> <83pnsczwoi.fsf@gnu.org> Message-ID: <83ftt7y2p9.fsf@gnu.org> > From: Egmont Koblinger > Date: Fri, 1 Feb 2019 14:35:35 +0100 > Cc: Fr?d?ric Grosshans , > unicode at unicode.org > > > You could do that, but it will require a lot of non-trivial processing > > from the applications. Text-mode applications don't want any complex > > tinkering, they want just to write their text and be done. The more > > overhead you add to that simple task, the less probable it is that > > applications will support such a terminal. > > I agree with your overall observation, but I'm not sure how much it > applies to this context. > > Text-mode applications have to run the BiDi algorithm. The one I > picked can also do shaping (well, the pretty limited one, using > presentation forms). Reordering and shaping have different requirements. Reordering can be done based only on the codepoints, whereas shaping needs also intimate knowledge of the fonts being used. The former can be done by a text-mode application, the latter cannot, not anywhere close to what readers of the respective scripts would expect. From unicode at unicode.org Fri Feb 1 10:42:10 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 1 Feb 2019 17:42:10 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: Message-ID: Hi, I'm trying to respond to every question, but I'm having a hard time keeping up :-) Thanks a lot for all the precious input about shaping! Here's my suggestion, for version 0.2 of the recommendation: - No longer encourage any use of presentation form characters. - State that it's the terminal emulator's task to perform shaping, both in implicit and explicit modes. - Leave it for a future enhancement to handle trickier cases in explicit mode, such as shaping of a word that's only partially visible, or prevent shaping when two words happen to touch each other and are visually separated by other means (e.g. background color). Leave it for further research whether we could use ZWJ/ZWNJ here, whether we could use ECMA's SAPV 5-8 & 21-11, or whether we should invent something new (perhaps even telling the terminal emulator what neighboring previous/next characters to imagine there for the purpose of shaping)... Let me know if you have any remaining problems/concerns/etc. As for the implementation in VTE: initially I'll still use presentation form characters, solely because that's a low hanging fruit approach (low investment, high gain). I've already implemented it in about an hour (a bit of further hacks will be necessary to extend it to explicit mode, but still easily doable), whereas switching to HarfBuzz is expected to take weeks of heavy work. We'll tackle that in a subsequent version. And if anyone's happy to help, there's already some bounty for harfbuzz support :) Thanks again for the great guidance! cheers, egmont On Tue, Jan 29, 2019 at 1:50 PM Egmont Koblinger wrote: > > Hi, > > Terminal emulators are a powerful tool used by many people for various > tasks. Most terminal emulators' bugtracker has a request to add RTL / > BiDi support. Unicode has supported BiDi for about 20 years now. > Still, the intersection of these two fields isn't solved. Even some > Unicode experts have stated over time that no one knows how to do it > properly. > > The only documentation I could find (ECMA TR/53) predates the Unicode > BiDi algorithm, and as such no surprise that it doesn't follow the > current state of the art or best practices. > > Some terminal emulators decided to run the BiDi algorithm for display > purposes on its lines (rather than paragraphs, uh), not seeing the big > picture that such a behavior turns them into a platform on top of > which it's literally impossible to implement proper BiDi-aware text > editing (vim, emacs, whatever) experience. In turn, vim, emacs and > friends stand there clueless, not knowing how to do BiDi in terminals. > > With about 5 years of experience in terminal emulator development, and > some prior BiDi homepage developing experience with the kind mentoring > of one of the BiDi gurus (Aharon, if you're reading this, hi there!), > I decided to tackle this issue. I studied and evaluated the > aforementioned documentation and the behavior of such terminals, > pointed out the problems, and came up with a draft proposal. > > My work isn't complete yet. One of the most important pending issues > is to figure out how to track BiDi control characters (e.g. which > character cells they belong to), it is to be addressed in a subsequent > version. But I sincerely hope I managed to get the basics right and > clean enough so that work can begin on implementing proper support in > terminal emulators as well as fullscreen text applications; and as we > gain experience and feedback, extending the spec to address the > missing bits too. > > You can find this (draft) specification at [1]. Feedback is welcome ? > if it's an actionable one then preferably over there in the project's > bugtracker. > > [1] https://terminal-wg.pages.freedesktop.org/bidi/ > > > cheers, > egmont (GNOME Terminal / VTE co-developer) From unicode at unicode.org Fri Feb 1 12:19:53 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 1 Feb 2019 19:19:53 +0100 Subject: Encoding italic In-Reply-To: <69f43412.412.168a368b74a.Webtop.72@btinternet.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr> <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org> <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com> <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> Message-ID: the proposal would contradict the goals of variation selectors and would pollute ther variation sequences registry (possibly even creating conflicts). And if we admit it for italics, than another VSn will be dedicated to bold, and another for monospace, and finally many would follow for various style modifiers. Finally we would no longer have enough variation selectors for all requests). And what we would have made was only trying to reproduce another existing styling standard, but very inefficiently (and this use wil be "abused" for all usages, creating new implementation constraints and contradicting goals with existing styling languages: they would then decide to make these characters incompatible for use in conforming applications. The Unicode encoding would have lost all its interest. I do not support the idea of encoding generic styles (applicable to more than 100k+ existing characters) using variation selectors. Their goal is only to allow semantic distinctions when two glyphs were unified in one language may occasionnaly (not always) have some significance in specific languages. But what you propose would apply to all languages, all scripts, and would definitely reserve some the the few existing VSn for this styling use, blocking further registration of needed distinctions (VSn characters are notably needed for sinographic scripts to properly represent toponyms or person names, or to solve some problems existing with generic character properties in Unicode that cannot be changed because of stability rules). Le jeu. 31 janv. 2019 ? 16:32, wjgo_10009 at btinternet.com via Unicode < unicode at unicode.org> a ?crit : > Is the way to try to resolve this for a proposal document to be produced > for using Variation Selector 14 in order to produce italics and for the > proposal document to be submitted to the Unicode Technical Committee? > > If the proposal is allowed to go to the committee rather than being > ruled out of scope, then we can know whether the Unicode Technical > Committee will allow the encoding. > > William Overington > > Thursday 31 January 2019 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 1 12:57:43 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 1 Feb 2019 18:57:43 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190201110245.GA2111@macbook.localdomain> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190201110245.GA2111@macbook.localdomain> Message-ID: <20190201185743.0a8b7df5@JRWUBU2> On Fri, 1 Feb 2019 13:02:45 +0200 Khaled Hosny via Unicode wrote: > On Thu, Jan 31, 2019 at 11:17:19PM +0000, Richard Wordingham via > Unicode wrote: > > On Thu, 31 Jan 2019 12:46:48 +0100 > > Egmont Koblinger wrote: > > > > No. How many cells do CJK ideographs occupy? We've had a strong > > hint that a medial BEH should occupy one cell, while an isolated > > BEH should occupy two. > > Monospaced Arabic fonts (there are not that many of them) are designed > so that all forms occupy just one cell (most even including the > mandatory lam-alef ligatures), unlike CJK fonts. > > I can imagine the terminal restricting itself to monspaced fonts, > disable ?liga? feature just in case, and expect the font to well > behave. Any other magic is likely to fail. Of course, strictly speaking, a monospaced font cannot support harakat as Egmont has proposed. Richard. From unicode at unicode.org Fri Feb 1 15:30:34 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 1 Feb 2019 21:30:34 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <83o97y15g5.fsf@gnu.org> Message-ID: <20190201213034.4f7c316e@JRWUBU2> On Fri, 1 Feb 2019 14:47:22 +0100 Egmont Koblinger via Unicode wrote: > Hi Ken, > > > [language tag] > > That is a complete non-starter for the Unicode Standard. > > Thanks for your input! > > (I hope it was clear that I just started throwing in random ideas, as > in a brainstorming session. This one is ruled out, then.) Language tagging is already available in Unicode, via the tag characters in the deprecated plane. Richard. From unicode at unicode.org Fri Feb 1 16:18:13 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 01 Feb 2019 15:18:13 -0700 Subject: Proposal for BiDi in terminal emulators Message-ID: <20190201151813.665a7a7059d7ee80bb4d670165c8327d.1b680d631f.wbe@email03.godaddy.com> Richard Wordingham wrote: > Language tagging is already available in Unicode, via the tag > characters in the deprecated plane. Plane 14 isn't deprecated -- that isn't a property of planes -- and the tag characters U+E0020 through U+E007E have been un-deprecated for use with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG are deprecated. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Fri Feb 1 16:28:09 2019 From: unicode at unicode.org (Andrew West via Unicode) Date: Fri, 1 Feb 2019 22:28:09 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190201151813.665a7a7059d7ee80bb4d670165c8327d.1b680d631f.wbe@email03.godaddy.com> References: <20190201151813.665a7a7059d7ee80bb4d670165c8327d.1b680d631f.wbe@email03.godaddy.com> Message-ID: On Fri, 1 Feb 2019 at 22:20, Doug Ewell via Unicode wrote: > > Richard Wordingham wrote: > > > Language tagging is already available in Unicode, via the tag > > characters in the deprecated plane. > > Plane 14 isn't deprecated -- that isn't a property of planes -- and the > tag characters U+E0020 through U+E007E have been un-deprecated for use > with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG are > deprecated. Cancel Tag is not deprecated any longer either (http://www.unicode.org/Public/UNIDATA/PropList.txt). Andrew From unicode at unicode.org Fri Feb 1 16:29:49 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Fri, 1 Feb 2019 14:29:49 -0800 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190201213034.4f7c316e@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <83o97y15g5.fsf@gnu.org> <20190201213034.4f7c316e@JRWUBU2> Message-ID: <1bbf61be-07d8-492e-b58d-da7a0d720a4b@att.net> Richard, On 2/1/2019 1:30 PM, Richard Wordingham via Unicode wrote: > > Language tagging is already available in Unicode, via the tag characters > in the deprecated plane. > Recte: 1. Plane 14 is not a "deprecated plane". 2. The tag characters in Tag Character block (U+E0000..U+E007F) are not deprecated. (They are used, for example, by UTS #51 to specify emoji tag sequences.) 3. However, the use of U+E0001 LANGUAGE TAG and the mechanism of using tag characters for spelling out language tags are explicitly deprecated by the standard. See: "Deprecated Use for Language Tagging" in Section 23.9 Tag Characters. https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf#G30427 and PropList.txt: E0001???????? ; Deprecated # Cf?????? LANGUAGE TAG As I stated earlier: language tags should use BCP 47, and belong in the markup level, not in the plain text stream. --Ken From unicode at unicode.org Fri Feb 1 17:38:04 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Sat, 02 Feb 2019 00:38:04 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190201185743.0a8b7df5@JRWUBU2> Message-ID: Den 2019-02-01 19:57, skrev "Richard Wordingham via Unicode" : > On Fri, 1 Feb 2019 13:02:45 +0200 > Khaled Hosny via Unicode wrote: > >> On Thu, Jan 31, 2019 at 11:17:19PM +0000, Richard Wordingham via >> Unicode wrote: >>> On Thu, 31 Jan 2019 12:46:48 +0100 >>> Egmont Koblinger wrote: >>> >>> No. How many cells do CJK ideographs occupy? We've had a strong >>> hint that a medial BEH should occupy one cell, while an isolated >>> BEH should occupy two. >> >> Monospaced Arabic fonts (there are not that many of them) are designed >> so that all forms occupy just one cell (most even including the >> mandatory lam-alef ligatures), unlike CJK fonts. >> >> I can imagine the terminal restricting itself to monspaced fonts, >> disable ?liga? feature just in case, and expect the font to well >> behave. Any other magic is likely to fail. > > Of course, strictly speaking, a monospaced font cannot support harakat > as Egmont has proposed. > > Richard. (harakat: non-spacing vowel mark in Arabic) "Monospaced font" is really a concept with modification. Even for "plain old ASCII" there are two advance widths, not just one: 0 for control characters (and escape/control sequences, neither of which should directly consult the font; even such things as OSC sequences, but the latter are a bad idea to have in any line one might wish to edit (vi/emacs/...) via a terminal emulator window). But terminals (read terminal emulators) can deal with mixed single width and double width characters (which is, IIUC, the motivation for the datafile EastAsianWidth.txt). Likewise non-spacing combining characters should be possible to deal reasonably with. It is a lot more difficult to deal with BiDi in a terminal emulator, also shaping may be hard to do, as well as reordering (or even splitting) combining characters. All sorts of problems arise; feeding the emulator a character (or "short" strings) at a time not allowed to buffer for display (causing reshaping or movement of already displayed characters, edit position movement even within a single line, etc.). Even if solvable for a "GUI" text editor (not via a terminal), they do not seem to be workable in a terminal (emulator) setting. Esp. not if one also wants to support multiline editing (vi/emacs/...) or even single-line editing. As long as editing is limited to a single line (such as the system line editor, or an "enhanced functionality" line editor (such as that used for bash; moving in the history sets the edit position at EOL) even variable width ("proportional) fonts should not pose a major problem. But for multiline editors (? la vi/emacs) it would not be possible to synch nicely (unless one accepts strange jums) the visual edit position and the actual edit position in the edit buffer: The program would not have access to the advance width data from the font that the terminal emulator uses, unless one revolutionise what terminal emulators do... (And I don't see a case for doing that.) But both a terminal emulator and multiline editing programs (for terminal emulators) still can have access to EastAsianWidth data as well as which characters are non-spacing; those are not font dependent. (There might be some glitches if the Unicode versions used do not match (the terminal emulator and the program being run are most often on different systems), but only for characters where these properties have changed, e.g. newly allocated non-spacing marks.) /Kent K PS No, I have not done extensive testing of various terminal emulators on how well the handle the stuff above. From unicode at unicode.org Fri Feb 1 17:41:25 2019 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Sat, 2 Feb 2019 01:41:25 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190201185743.0a8b7df5@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190201110245.GA2111@macbook.localdomain> <20190201185743.0a8b7df5@JRWUBU2> Message-ID: <20190201234125.GB2111@macbook.localdomain> On Fri, Feb 01, 2019 at 06:57:43PM +0000, Richard Wordingham via Unicode wrote: > On Fri, 1 Feb 2019 13:02:45 +0200 > Khaled Hosny via Unicode wrote: > > > On Thu, Jan 31, 2019 at 11:17:19PM +0000, Richard Wordingham via > > Unicode wrote: > > > On Thu, 31 Jan 2019 12:46:48 +0100 > > > Egmont Koblinger wrote: > > > > > > No. How many cells do CJK ideographs occupy? We've had a strong > > > hint that a medial BEH should occupy one cell, while an isolated > > > BEH should occupy two. > > > > Monospaced Arabic fonts (there are not that many of them) are designed > > so that all forms occupy just one cell (most even including the > > mandatory lam-alef ligatures), unlike CJK fonts. > > > > I can imagine the terminal restricting itself to monspaced fonts, > > disable ?liga? feature just in case, and expect the font to well > > behave. Any other magic is likely to fail. > > Of course, strictly speaking, a monospaced font cannot support harakat > as Egmont has proposed. There are two approaches for handling them in monospaced fonts; combining them with base characters as usual, or as spacing characters placed next to their bases. The later approach is a bit unusual, but makes editing heavily voweled text a bit more pleasant. It requires good OpenType support, though, so virtually no terminal supports it. Regards, Khaled -------------- next part -------------- A non-text attachment was scrubbed... Name: ALMFixed.png Type: image/png Size: 12872 bytes Desc: not available URL: From unicode at unicode.org Fri Feb 1 20:06:15 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 2 Feb 2019 02:06:15 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: <20190201185743.0a8b7df5@JRWUBU2> Message-ID: <20190202020615.03aabbfe@JRWUBU2> On Sat, 02 Feb 2019 00:38:04 +0100 Kent Karlsson via Unicode wrote: > Den 2019-02-01 19:57, skrev "Richard Wordingham via Unicode" > : > "Monospaced font" is really a concept with modification. Even for > "plain old ASCII" there are two advance widths, not just one: 0 for > control characters (and escape/control sequences, neither of which > should directly consult the font; even such things as OSC sequences, > but the latter are a bad idea to have in any line one might wish to > edit (vi/emacs/...) via a terminal emulator window). But terminals > (read terminal emulators) can deal with mixed single width and double > width characters (which is, IIUC, the motivation for the datafile > EastAsianWidth.txt). Likewise non-spacing combining characters should > be possible to deal reasonably with. I remember Michael Everson getting scant sympathy here when he complained that his 'monospaced' font was rejected as such because combining characters had zero width. The rule his font fell foul of invites distinct NFC and NFD forms of the same string to be rendered differently; it does not observe the spirit of canonical equivalence. > It is a lot more difficult to deal with BiDi in a terminal emulator, > also shaping may be hard to do, as well as reordering (or even > splitting) combining characters. All sorts of problems arise;... Which is why Egmont is here looking for comments and advice. Not all terminal emulators can deal with non-spacing combining characters. I have recent having unpleasant experiences with what appears to be Wikimedia's CodeEditor; it expects even non-spacing Thai vowel marks to have an advance width of one cell. The text is rendered in GUI style, i.e. according to the font selected somehow, but the cursor is positioned according to the character count. I haven't yet investigated its treatment of control characters. I think I'm going to have to make a font that works to its assumptions. Richard. From unicode at unicode.org Fri Feb 1 20:10:53 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 2 Feb 2019 02:10:53 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190201151813.665a7a7059d7ee80bb4d670165c8327d.1b680d631f.wbe@email03.godaddy.com> References: <20190201151813.665a7a7059d7ee80bb4d670165c8327d.1b680d631f.wbe@email03.godaddy.com> Message-ID: <20190202021053.5a1e827e@JRWUBU2> On Fri, 01 Feb 2019 15:18:13 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > > Language tagging is already available in Unicode, via the tag > > characters in the deprecated plane. > > Plane 14 isn't deprecated -- that isn't a property of planes -- and > the tag characters U+E0020 through U+E007E have been un-deprecated > for use with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F > CANCEL TAG are deprecated. Unicode may not deprecate the tag characters, but the characters of Plane 14 are widely deplored, despised or abhorred. That is why I think of it as the deprecated plane. Richard. From unicode at unicode.org Fri Feb 1 22:01:42 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 2 Feb 2019 04:01:42 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> Message-ID: <20190202040142.6647729f@JRWUBU2> On Fri, 1 Feb 2019 15:15:53 +0100 Egmont Koblinger via Unicode wrote: > Hi Richard, > > On Fri, Feb 1, 2019 at 12:19 AM Richard Wordingham via Unicode > wrote: > > > Cropped why? If the problem is the truncation of lines, one can > > simple store the next character. > > Yup, trancation of line for example. > > I agree that one could "store the next character". We could extend the > terminal emulation protocol where by some means you can specify that > column 80 contains a letter X, and even though there's no column 81, > an app can still tell the terminal emulator that it should imagine > that column 81 contans the letter Y, and perform shaping accordingly. > > This will need to be done not just at the end of the terminal, but at > any position, and for both directions. Think of e.g. a vertically > split tmux. You should be able to tell that column 40 contains X which > should be shaped as if column 41 contained Y, and column 41 contains Z > which should be shaped as if column 40 contained A. > > What I canont see at all is how this could be "simply". Could you > please elaborate on that? I don't find this simple at all! > > >> > It's not able to > > > separate different UI elements that happen to be adjacent in the > > > terminal, separated by different background color or such. > > > > ZWJ and ZWNJ can handle that. > > Wouldn't it be a semantical misuse of these characters, though? No. ZWNJ is used before the inanimate plural suffix of Persian, and in at least one language, is used to distinguish one usage from the digit ? (or is it the digit ??). > They are supposed to be present in the logical order, and in logical > order (that is: the terminal's implicit mode) they can work as > desired. > > Are they okay to be present in visual order (the terminal's explicit > mode, what we're discussing now) too? Where do you define the order for explicit mode? There may be complications in ensuring that gets stored as the content of a single cell. > > Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined > above. Example, please. > > > If a general text manipulating application, e.g. cat, grep or awk, > > is writing to a file, it should not convert normal Arabic > > characters to presentation forms. You are now asking a general > > application to determine whether it is writing to a terminal or > > not, and alter its output if it is writing to a terminal. > > No, this absolutely not what I'm talking about! > > There are two vastly different modes of the terminal. For "cat", > "grep" etc. the terminal will be in implicit mode. Absolutely no BiDi > handling is expected from these apps, the terminal will do BiDi and > shaping (perhaps using Harfbuzz; perhaps using presentation form > characters as a temporarily low hanging fruit until a better one is > implemented ? the choice is obviously up to the implementation and not > to the specification). > > For "emacs" and friends, an explicit mode is required where visual > order is passed to the terminal. What we're discussing is how to > handle shaping in this mode. (Partitioning grapheme clusters and Indic syllables) > > But it as an issue that needs to be addressed. As a terminal can be > > addressed by cell, an application may need to keep track of what > > text went into each cell. Misery results when the application gets > > it wrong. > > My recommendation doesn't change this principle at all. In the lower > (emulation) layer every character still goes into the cell it used to > go to, and is addressable using cursor motion escapes and so on > exactly as without BiDi. At present, VTE positions LTR Indic preceding spacing combining marks after the consonant. I though your draft scheme corrected this very local bidi issue, which is so local that the bidi algorithm ignores it. > > > > How many cells do CJK ideographs occupy? We've had a strong hint > > that a medial BEH should occupy one cell, while an isolated BEH > > should occupy two. > > CJK occupy two, but they do regardless of what's around them. That is, > they already occupy two cells in the logical buffers, in the emulation > layer. > > There is absolutely no sane way we can make in terminal emulation a > character's logical width (as in number of cells it occupies) depend > on its neighboring characters. (And even if we could by some terrible > hacks, it would break the principle you just said as "misery > results...", and the principle Eli said that things should remain > reasonably simple, otherwise hardly anyone will bother implementing > them.) This is a compromise Arabic folks will have to accept. So ???? _preah_ 'prefix denoting repect for gods, kings, etc.' will be three cells = <(COENG, RA), PO, YUUKALEAPINTU> and cause no confusion? Or will the cells be ? Richard. From unicode at unicode.org Sat Feb 2 05:17:28 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 2 Feb 2019 12:17:28 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: <20190201185743.0a8b7df5@JRWUBU2> Message-ID: Hi Kent, On Sat, Feb 2, 2019 at 12:41 AM Kent Karlsson via Unicode wrote: > [...] neither of which > should directly consult the font [...] > But terminals > (read terminal emulators) can deal with mixed single width and double > width characters (which is, IIUC, the motivation for the datafile > EastAsianWidth.txt). Yup, exactly; and for this reason, no terminal I'm aware of takes the single vs. double width property from the font. The logical behavior, i.e. knowing which logical cell contains what character (or which half of what character, in case of double wide ones) isn't influenced by the font. It's taken from EastAsianWidth (or other means, which we're working on: https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9 , to address e.g. incompatibilities arising from different Unicode version used by the app vs. the terminal, as you pointed out). Also think of cases like when the user modifies the font of the terminal run-time, or a headless terminal emulator, or a screen/tmux attached to multiple terminal emulators of different fonts at once... Adjusting the logical behavior according to the font would definitely be a wrong path to take. > Likewise non-spacing combining characters should > be possible to deal reasonably with. Most terminal emulators handle non-spacing combining marks, it's a piece of cake. (Spacing marks are more problematic.) > All sorts of problems arise; feeding > the emulator a character (or "short" strings) at a time not allowed > to buffer for display (causing reshaping or movement of already > displayed characters, edit position movement even within a single > line, etc.). Emulators need to update their screen to reflect whatever is in the logical buffer, and the contents of the logical buffer mustn't depend on the timing of the incoming data. As a consequence, when the input stream contains a base character + a combining accent, there is a slim chance that the base character without the combining accent makes it to the display for a short time. It's the emulator's job to "fix" it (that is, redraw the glyph with the combining accent) once the accent is received. If an emulator doesn't do it correctly, it's simply a bug in that emulator. On a side note, we're also working on an extension for atomic updates at https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9 which should significantly further decrease the chance of such intermittent screen updates. cheers, egmont From unicode at unicode.org Sat Feb 2 05:41:35 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 2 Feb 2019 11:41:35 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> Message-ID: <20190202114135.35eb394b@JRWUBU2> On Fri, 1 Feb 2019 15:15:53 +0100 Egmont Koblinger via Unicode wrote: > Hi Richard, > > On Fri, Feb 1, 2019 at 12:19 AM Richard Wordingham via Unicode > wrote: > > > Cropped why? If the problem is the truncation of lines, one can > > simple store the next character. > > Yup, trancation of line for example. > > I agree that one could "store the next character". We could extend the > terminal emulation protocol where by some means you can specify that > column 80 contains a letter X, and even though there's no column 81, > an app can still tell the terminal emulator that it should imagine > that column 81 contans the letter Y, and perform shaping accordingly. > > This will need to be done not just at the end of the terminal, but at > any position, and for both directions. Think of e.g. a vertically > split tmux. You should be able to tell that column 40 contains X which > should be shaped as if column 41 contained Y, and column 41 contains Z > which should be shaped as if column 40 contained A. > > What I canont see at all is how this could be "simply". Could you > please elaborate on that? I don't find this simple at all! I'm not conversant with the details of terminal controls and I haven't used fields. However, where I spoke of lines above, I believe you can simply translate it to fields. I don't know how one best handles fields - are they a list, possibly of rows within fields, or are they stored as cell attributes? If one were doing it by cell attributes, and the example above were in row 6, one might store some of the information below if 'Y' and 'A' do not appear in the display. Row 6 column 40: This is end of LTR paragraph, and treat as followed by Y Row 6 column 41: This is end of RTL paragraph, and treat as followed by A If storing attributes of rows within fields, the above information would be stored for the row within the field. If lines are wrapped, then you would probably want to store that fact instead and access the character contents indirectly. Richard. From unicode at unicode.org Sat Feb 2 05:54:16 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 2 Feb 2019 12:54:16 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190202040142.6647729f@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> Message-ID: Hi Richard, > > Are they okay to be present in visual order (the terminal's explicit > > mode, what we're discussing now) too? > > Where do you define the order for explicit mode? In explicit mode, the application (Emacs, Vim, whatever) reorders the characters, and passes visual order (left to right) to the terminal emulator. The terminal emulator preserves this visual order, doesn't reshuffle anything. How to handle ZW(N)J in visual order? What's the desired way? Is it specified anywhere? As far as I know, they specify the relation between two adjacent characters of the logical order, which might not even become adjacent in the visual. Should they always "stick" to the preceding character, for example? The Unicode BiDi algorithm doesn't seem to make a difference between base letters and combining accents for reordering. So, given in an RTL text a base letter + a combining accent, the BiDi algorithm gives the visual LTR order of the combining accent first (on the left), followed by the base letter. This order is not okay for terminal emulators. Combining accents have to be reordered in the output of the Unicode BiDi algorithm, so that they come after the base letter even in the visual LTR order. This is e.g. what FriBidi does by default, due to the REORDER_NSM flag. Presumably it doesn't just reorder non-spacing combining accents, but also ZW(N)J and alike symbols too, which already smells pretty problematic, doesn't it? Or is this what you need there, too? > There may be complications in ensuring that > gets stored > as the content of a single cell. How should the terminal emulator know which cell (the previous or the subsequent) do these two s belong to? > > Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined > > above. > > Example, please. Cropped strings, cropped strings that are adjacent to each other, and faulty shaping could kick in there. Two fields on the UI. One in columns 36-40 with cyan background, aiming to show ABCDEF, but due to limited room, can only show ABCDE (let's say it's scrolled horizontally this way). Another in columns 41-45 with yellow background, aiming to show UVWXYZ, but due to limited space only VWXYZ is shown (it's scrolled horizontally like this). What the terminal emulator sees is a continuous text of ABCDEVWXYZ. What the application wants to have is to get E shaped as if there was an F on its right, and get V shaped as if there was an U on its left. Once you address this problem, I'm not sure ZW(N)J are still required/desireable, rather than applying this more generic solution there as well. > At present, VTE positions LTR Indic preceding spacing combining marks > after the consonant. I though your draft scheme corrected this very > local bidi issue, which is so local that the bidi algorithm ignores it. Indic spacing combining marks are handled incorrectly by VTE and are being addressed in bug 584160 which I've already linked. This particular issue I don't consider BiDi at all. It's something totally different. The spacing accent can be to the right, somewhat on top of and somewhat to the right, on top of, somewhat to the left and somewhat on top of, or fully on the left. It's not binary left or right. Proper rendering should be done by font, and not at all by the BiDi of the terminal. The terminal is unaware of how much the base glyph is shifted to the right and the accent to its left. All that the terminal needs to do (and VTE gets it wrong now) is to pass these two into whichever font rendering engine in one single step. > So ???? LETTER RO, U+17C8 KHMER SIGN > _preah_ 'prefix denoting > repect for gods, kings, etc.' will be three cells = <(COENG, > RA), PO, YUUKALEAPINTU> and cause no confusion? Or will the cells be > ? First it's a base character followed by a non-spacing mark. As in most terminal emulators (and now we're absolutely not talking about my BiDi proposal) they are stored in the same cell. The first cell contains (PO, COENG). The next two are a base character followed by a spacing mark. In VTE 584160 I outline two possible approaches, but the one I'm in favor of, is that the row's second cell contains RO and the third cell contains YUUKALEAPINTU, which two are combined together properly when the logical contains get displayed. Another possibility which I'm pondering about is whether the emulation layer should combine them, that is, have the second cell store the "first half of (RO, YUUKA)" and the third cell store the "second half of (RO, YUUKA)". Does this make any sense? If not, could you please explain what and why is the desired behavior? Please keep in mind that I know nothing about Khmer in particular. Anyway, here we're talking about something that's totally independent from my BiDi work. It's also something that should be standardized across terminals, sure, but maybe not right now :) cheers, egmont From unicode at unicode.org Sat Feb 2 05:58:54 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 2 Feb 2019 12:58:54 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190202020615.03aabbfe@JRWUBU2> References: <20190201185743.0a8b7df5@JRWUBU2> <20190202020615.03aabbfe@JRWUBU2> Message-ID: Hi Richard, > Not all terminal emulators can deal with non-spacing combining > characters. Both Hebrew and Arabic seem to use non-spacing combining characters, presumably other Arabic-like scripts too. I forgot to state explicitly in my docs, but let's just say that handling non-spacing combining accents is a prerequisite for BiDi support. Those emulators that don't handle them should be out of scope for our current discussion. cheers, egmont From unicode at unicode.org Sat Feb 2 06:18:03 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 2 Feb 2019 13:18:03 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190202114135.35eb394b@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202114135.35eb394b@JRWUBU2> Message-ID: Hi Richard, On Sat, Feb 2, 2019 at 12:43 PM Richard Wordingham via Unicode wrote: > I'm not conversant with the details of terminal controls and I haven't > used fields. However, where I spoke of lines above, I believe you can > simply translate it to fields. I don't know how one best handles > fields - are they a list, possibly of rows within fields, or are they > stored as cell attributes? The very essential is that the terminal emulator stores "cells". Pretty much all the data (with very few exceptions) resides in cells. A cell contains a base letter, followed by possibly a few non-spacing marks. A cell has a foreground color, background color, bold, underlined, italic etc. properties. How these cells are linked up, in an array or whatever, is mostly irrelevant since it's likely to be different in every implementation. Of course it is possible to extend the per-cell storage to contain a "previous" and a "next" character, as to be used for shaping purposes only. Some questions: Is this enough (e.g. aren't there cases where more than the immediate neighbor are relevant)? Is the next base character enough, or do we also need to know the combining accents that belong to that? And can't we store significantly less information than the actual letter (let's say, 1 out of 13 [randomly made up number] possible ways of shaping)? Terminal emulators potentially store a lot of data (some even support infinite scrolling), and try to handle them in some effective way. That is, they do all sorts of bitpacking and crazy stuff. E.g. some might reject adding new attributes when the per-cell size of the attribute would extend 4 or 8 bytes, both for memory and performance reasons. Another example: VTE has one global pool of all the base character + combining accents combos that it has encountered, and starts assigning single codepoints to them from U+10000000 or so, so that then for each cell the base letter + combining accents still don't require more storage than 4 bytes. The takeaway is: the less data we need to remember per cell, the better, and every bit matters. But to recap, we're now just peeking into a possible future extension of the specs to see if it's viable (I guess it is), which I believe emulators might reasonably decide not to implement, if they think performance is more important than proper shaping in all the special cases. cheers, egmont From unicode at unicode.org Sat Feb 2 07:01:46 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Sat, 02 Feb 2019 14:01:46 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: Message-ID: Den 2019-02-02 12:17, skrev "Egmont Koblinger" : > the font. It's taken from EastAsianWidth (or other means, which we're > working on: https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9 Yes, that too: FE0F ? VARIATION SELECTOR-16 = emoji variation selector But the issue you refer to only deals with U+FE0F. There is also U+FE0E: FE0E ? VARIATION SELECTOR-15 = text variation selector which can make a character that is "default emoji" (which are wide) into "text variant", often single-width, for instance: 1F315 FE0E ; text style; # (6.0) FULL MOON SYMBOL --- >> Likewise non-spacing combining characters should >> be possible to deal reasonably with. > > Most terminal emulators handle non-spacing combining marks, it's a > piece of cake. (Spacing marks are more problematic.) Well, I guess you may need to put some (practical) limit to the number of non-spacing marks (like max two above + max one below; overstrikes are an edge case). Otherwise one may need to either increase the line height (bad idea for a terminal emulator I think) or the marks start to visually interfere with text on other lines (even with the hinted limits there may be some interference), also a bad idea for a terminal emulator. So I'm not so sure that non-spacing marks is a piece of cake... (I.e., need to limit them.) /Kent K From unicode at unicode.org Sat Feb 2 07:15:32 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 2 Feb 2019 14:15:32 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190201151813.665a7a7059d7ee80bb4d670165c8327d.1b680d631f.wbe@email03.godaddy.com> References: <20190201151813.665a7a7059d7ee80bb4d670165c8327d.1b680d631f.wbe@email03.godaddy.com> Message-ID: Actually not all U+E0020 through U+E007E are "un-deprecated" for this use. For now emoji flags only use: - U+E0041 through U+E005A (mapping to ASCII letters A through Z used in 2-letter ISO3166-1 codes). These are usable in pairs, without requiring any modifier (and only for ISO3166-1 registered codes). - I think that U+0030 through U+E0039 (mapping to ASCII digits 0 through 9) are reserved for ISO3166 extensions, started with only the 3 "countries" added in the United Kingdom ("ENENG", "ENSCO" and "ENWLS"), with possible pending additions for other ISO3166-2, but not mapping any dash separator). These tags are used as modifiers in sequences starting by a leading U+1F3F4 (WAVING BLACK FLAG) emoji. - U+E007F (CANCEL TAG) is already used too for the regional extensions as a mandatory terminator, as seen in the three British countries. It is not used for country flags made of 2-letter emoji codes without any leading flag emoji. And the proposal discussed here to use U+E003C, mapped to the ASCII "<" LOWER THAN as a leading tag sequence for reencoding HTML tags in sequences terminated by U+E003E ">" (and containing HTML element names using lowercase letter tags, possibly digit tags in these names, and "/" for HTML tags terminator, possibly also U+E0020 SPACE TAG for separating HTML attributes, U+003D "=" for attribute values, U+E0022 (') or U+E0027 (") around attribute values, but a problem if the mapped element names or attributes contain non-ASCII characters...) is not standard (it's just an experiment in one font), and would in fact not be compatible with the existing specification for tags. So only E+E0020 through U+E0040, and U+E005B through U+E007E remain deprecated. Le ven. 1 f?vr. 2019 ? 23:26, Doug Ewell via Unicode a ?crit : > Richard Wordingham wrote: > > > Language tagging is already available in Unicode, via the tag > > characters in the deprecated plane. > > Plane 14 isn't deprecated -- that isn't a property of planes -- and the > tag characters U+E0020 through U+E007E have been un-deprecated for use > with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG are > deprecated. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 2 09:12:47 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 2 Feb 2019 15:12:47 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: Message-ID: <20190202151247.0118dec4@JRWUBU2> On Sat, 02 Feb 2019 14:01:46 +0100 Kent Karlsson via Unicode wrote: > Den 2019-02-02 12:17, skrev "Egmont Koblinger" : > > Most terminal emulators handle non-spacing combining marks, it's a > > piece of cake. (Spacing marks are more problematic.) > Well, I guess you may need to put some (practical) limit to the number > of non-spacing marks (like max two above + max one below; overstrikes > are an edge case). Otherwise one may need to either increase the line > height (bad idea for a terminal emulator I think) or the marks start > to visually interfere with text on other lines (even with the hinted > limits there may be some interference), also a bad idea for a terminal > emulator. So I'm not so sure that non-spacing marks is a piece of > cake... (I.e., need to limit them.) Doesn't Jerusalem in biblical Hebrew sometime have 3 marks below the lamedh? The depth then is the maximum depth, not the sum of the depths. Tai Lue has 'mai sat 3 lem' - that's three marks above for a combination common enough to have a name. Throw in the repetition mark and that's four marks above if you treat the subscript consonant as a mark (or code it to comply with the USE's erroneous grammar). Richard. From unicode at unicode.org Sat Feb 2 10:35:13 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 2 Feb 2019 16:35:13 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202114135.35eb394b@JRWUBU2> Message-ID: <20190202163513.0bd17c3e@JRWUBU2> On Sat, 2 Feb 2019 13:18:03 +0100 Egmont Koblinger via Unicode wrote: > Hi Richard, > > On Sat, Feb 2, 2019 at 12:43 PM Richard Wordingham via Unicode > wrote: > > > I'm not conversant with the details of terminal controls and I > > haven't used fields. However, where I spoke of lines above, I > > believe you can simply translate it to fields. I don't know how > > one best handles fields - are they a list, possibly of rows within > > fields, or are they stored as cell attributes? > > The very essential is that the terminal emulator stores "cells". > Pretty much all the data (with very few exceptions) resides in cells. > > A cell contains a base letter, followed by possibly a few non-spacing > marks. A cell has a foreground color, background color, bold, > underlined, italic etc. properties. > > How these cells are linked up, in an array or whatever, is mostly > irrelevant since it's likely to be different in every implementation. > > Of course it is possible to extend the per-cell storage to contain a > "previous" and a "next" character, as to be used for shaping purposes > only. Some questions: Is this enough (e.g. aren't there cases where > more than the immediate neighbor are relevant)? Is the next base > character enough, or do we also need to know the combining accents > that belong to that? And can't we store significantly less information > than the actual letter (let's say, 1 out of 13 [randomly made up > number] possible ways of shaping)? Truncation at the start of the string gives us the clearest nasty. If you look at TUS Figure 13-7, you'll find that the final U+182D in ?????? _jarlig_ 'order' and ?????_chirig_ 'soldier' should be different because the former word has a masculine vowel, namely U+1820, and latter doesn't. When written horizontally, the Mongolian scipt is left-to-right, i.e. upside down compared to its Aramaic ancestor. What we need to note is the preceding 'gender'-determining vowel. There are analogues of THAI CHARACTER SARA AM in the Tai Tham script - and . In all the examples of the latter I've seen, U+1A74 is placed over the preceding consonant, so if U+1A64 is lost through lack of space, the U+1A74 should still remain. The former is a matter of style. Outside Thailand, the mark above is clearly associated (with one exception) with the U+1A74, so both can safely vanish together. In Thailand, the U+1A74 can be associated with the consonant instead, or hover over the gap between consonant and vowel. The exception is the ligature . That should really only get one cell. The combination ???? 'water, fluid' looks like . There are then some interesting Indic phenomena depending on how one treats subscript consonants. The coding structure is widespread. As a lesser from of this, in Khmer the first consonant and U+17B6 ligate, and the ligation is highly visible on that consonant even if the vowel is covered up. If the display were to chop off the second consonant, all that need be remembered is the following vowel. There is also the repha and analogues. Repha is graphically a superscript mark, but is usually encoded as . Burmese kinzi is similar, but has a 3-character code. They really ought to be associated with the same cell as the immediately following consonant. The good news is that the record of the relevant neighbour can be compressed to a few bits. Richard. From unicode at unicode.org Sat Feb 2 13:58:06 2019 From: unicode at unicode.org (Benjamin Riefenstahl via Unicode) Date: Sat, 02 Feb 2019 20:58:06 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: (Egmont Koblinger via Unicode's message of "Tue, 29 Jan 2019 13:50:31 +0100") References: Message-ID: <8736p69d4h.fsf@turtle-trading.net> Hi Egmont, hi all, This is a interesting discussion here. If only because I would have thought that there is only minimal interest by the actual target audience in supporting these scripts in a terminal, given the severe limitations of that environment. The most important limitation seems to me that a monospaced font must be used, which does not suite most scripts that do shaping. On the script-level I am familiar with Arabic, Syraic and Mandaic (I don't actually speak any of these languages, so if you want a real expert, I am not that person). Monospaced Arabic struggles and is not very elegant. I have not seen solutions for monospaced Syriac or Mandaic but I have trouble to even to imagine them. OTOH, that inelegance maybe can be an excuse (or a guide if you prefer) to make the implementation simpler in other respects, because expectations should be lower than for a graphical application. Anyway, as a concrete addition to the discussion, I have a simple Arabic shaping solution for Emacs on the terminal, especially on the Linux console, and this discussion finally made me make it public on Gitlab, see https://gitlab.com/cc_benny/termshape. The Gitlab CD is activated, so (mostly) ready-make Emacs packages can be downloaded as build artifacts. If anybody wants to discuss this implementation, we should probably move that discussion somewhere else, like to the Emacs mailing list (https://lists.gnu.org/mailman/listinfo/emacs-devel). Some specific technical points from thinking about the problem on my side: Presentation forms: Termshape uses the Arabic presentation forms available and so it is somewhat limited as mentioned by Eli. Given that we need to keep the implementation simple anyway, I am not sure that significantly more is really needed, at least given what Emacs provides already. Additional character forms could be added, where the Unicode repertoire is not sufficient. This could use PUA characters or other means like terminal control sequences. In both cases a common understanding would be needed between the terminal (or the font used by it) and the application, outside of Unicode. Ligatures: With most shaping one character is transformed into a character form that still only occupies one cell. A ligature like lam-alif OTOH only occupies one cell for two characters, so for justification etc. the application will have to know that the two characters together have a width of 1 on the screen. This is easier if the applicaton does the selection of ligatures. If you want to do this in the terminal, the application would probably need to have some way to measure the display width of a string, so that it can handle the situation. Be prepared though for the application to make quite a lot of these requests. For my own main use case for Emacs on a terminal, display over SSH, that could become a problem. Diacritics: The application can know what is a non-spacing character and what is not. So it can know that diacritics do not occupy their own cell and it should be able to ignore whether the terminal supports a specific diacritic or not. If the terminal does not support a diacritic the terminal can either just leave it out or the terminal can mess up the display more of less irreparably. In the first case, the worst is that the user does not see the character, in the second case the application cannot do anything about it with reasonable effort IMO. A real problem is a combination of diacritics and ligatures. Any diacritic applies to only one character in the ligature, and between the application and the terminal it is currently not possible to determine which one. This is one area where an implementation in the terminal would clearly have the advantage. But a terminal control sequence could also help. IMO we are talking about a luxury problem here, though. Do we want to set as our first goal showing complete quranic verses in all their glory, or are we satisfied with everyday Arabic like say the website of a modern Arabic newspaper? Thanks for your effort and for starting this discussion, benny From unicode at unicode.org Sat Feb 2 14:50:59 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 2 Feb 2019 13:50:59 -0700 Subject: Use of tag characters in emoji sequences (was: Re: Proposal for BiDi in terminal emulators) Message-ID: <027d01d4bb39$018c5110$04a4f330$@ewellic.org> Philippe Verdy wrote: > Actually not all U+E0020 through U+E007E are "un-deprecated" for this > use. Characters in Unicode are not "deprecated" for some purposes and not for others. "Deprecated" is a clearly defined property in Unicode. The only reference that matters here is in PropList.txt: E0000 ; Other_Default_Ignorable_Code_Point # Cn E0001 ; Deprecated # Cf LANGUAGE TAG E0002..E001F ; Other_Default_Ignorable_Code_Point # Cn [30] .. E0020..E007F ; Other_Grapheme_Extend # Cf [96] TAG SPACE..CANCEL TAG E0080..E00FF ; Other_Default_Ignorable_Code_Point # Cn [128] .. Note carefully that the code point marked "Deprecated" is deprecated, and the others listed here are not. (My earlier post saying that U+E007F was still deprecated was incorrect, as Andrew noted.) > For now emoji flags only use: > - U+E0041 through U+E005A (mapping to ASCII letters A through Z used > in 2-letter ISO3166-1 codes). These are usable in pairs, without > requiring any modifier (and only for ISO3166-1 registered codes). Section C.1 of UTS #51 says otherwise: tag_base U+1F3F4 BLACK FLAG tag_spec (U+E0030 TAG DIGIT ZERO .. U+E0039 TAG DIGIT NINE, U+E0061 TAG LATIN SMALL LETTER A .. U+E007A TAG LATIN SMALL LETTER Z)+ Emoji flags use lowercase tag letters, not uppercase, and may also use digits. The digits are for CLDR subdivision IDs containing ISO 3166-2 code elements that happen to be numeric, and there are plenty of these. For example, "fr75" is the subdivision ID for Paris. Almost all ISO 3166-2 code elements in France are numeric. > - I think that U+0030 through U+E0039 (mapping to ASCII digits 0 > through 9) are reserved for ISO3166 extensions, started with only the > 3 "countries" added in the United Kingdom ("ENENG", "ENSCO" and > "ENWLS"), with possible pending additions for other ISO3166-2, but not > mapping any dash separator). There is no top-level country "EN", and if there were, I doubt Scotland and Wales would be enthusiastic to be considered part of it. In any case, "gbeng" and "gbsco" and "gbwls" are merely the only subdivision IDs that are designated "RGI," or "recommended for general interchange," in CLDR. Any other subdivision ID can be used in a flag tag sequence, although the lack of RGI designation may cause vendors to think the sequence is "recommended against" and not support it in fonts. As shown above, tag digits are not reserved for "ISO 3166 extensions" (possibly implying ISO 3166-1), but are already usable for ISO 3166-2 code elements. > These tags are used as modifiers in sequences starting by a leading > U+1F3F4 > > (WAVING BLACK FLAG) emoji. This is true. (Note the lowercase tag letters.) > - U+E007F (CANCEL TAG) is already used too for the regional extensions > as a mandatory terminator, as seen in the three British countries. This is true. > It is not used for country flags made of 2-letter emoji codes without > any leading flag emoji. This is true, but not particularly relevant, as these use Regional Indicator Symbols and have nothing to do with the "emoji codes" discussed elsewhere. > And the proposal discussed here to use U+E003C, mapped to the ASCII > "<" LOWER THAN LESS-THAN SIGN > as a leading tag sequence for reencoding HTML tags in sequences > terminated by U+E003E ">" (and containing HTML element names using > lowercase letter tags, Only "b", "i", "u", and "s" by definition. > possibly digit tags in these names, No. > and "/" for HTML tags terminator, possibly also U+E0020 SPACE TAG for > separating HTML attributes, U+003D "=" for attribute values, U+E0022 > (') or U+E0027 (") around attribute values, but a problem if the > mapped element names or attributes contain non-ASCII characters...) None of these are part of Andrew's mechanism. It's just b, i, u, and s. > is not standard Neither Andrew nor anyone else claimed it was. > (it's just an experiment in one font), It applies to any TrueType font, because the rendering engine can apply these four styles (in any combination) to any TrueType font. > and would in fact not be compatible with the existing specification > for tags. Good thing nobody claimed they were. > So only E+E0020 through U+E0040, and U+E005B through U+E007E remain > deprecated. Da capo. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Feb 2 14:54:43 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 2 Feb 2019 13:54:43 -0700 Subject: Proposal for BiDi in terminal emulators Message-ID: <027e01d4bb39$8686acd0$93940670$@ewellic.org> Richard Wordingham wrote: > Unicode may not deprecate the tag characters, but the characters of > Plane 14 are widely deplored, despised or abhorred. That is why I > think of it as the deprecated plane. Think of it as the deplored plane, then, or the despised plane or the abhorred plane or the Plane That Shall Not Be Mentioned. "Deprecated" is a term of art in Unicode. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sat Feb 2 14:57:01 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 2 Feb 2019 20:57:01 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> Message-ID: <20190202205701.0b0a332d@JRWUBU2> On Sat, 2 Feb 2019 12:54:16 +0100 Egmont Koblinger via Unicode wrote: > Hi Richard, > > > > Are they okay to be present in visual order (the terminal's > > > explicit mode, what we're discussing now) too? > > > > Where do you define the order for explicit mode? > > In explicit mode, the application (Emacs, Vim, whatever) reorders the > characters, and passes visual order (left to right) to the terminal > emulator. The terminal emulator preserves this visual order, doesn't > reshuffle anything. Seriously, you need to give a definition of 'visual order' for this context. Not everyone shares your chiralist view. > How to handle ZW(N)J in visual order? What's the desired way? Is it > specified anywhere? As far as I know, they specify the relation > between two adjacent characters of the logical order, which might not > even become adjacent in the visual. Should they always "stick" to the > preceding character, for example? > The Unicode BiDi algorithm doesn't seem to make a difference between > base letters and combining accents for reordering. So, given in an RTL > text a base letter + a combining accent, the BiDi algorithm gives the > visual LTR order of the combining accent first (on the left), followed > by the base letter. This order is not okay for terminal emulators. > Combining accents have to be reordered in the output of the Unicode > BiDi algorithm, so that they come after the base letter even in the > visual LTR order. This is e.g. what FriBidi does by default, due to > the REORDER_NSM flag. > Presumably it doesn't just reorder non-spacing combining accents, but > also ZW(N)J and alike symbols too, which already smells pretty > problematic, doesn't it? Or is this what you need there, too? Even for logically ordered text, the positioning of the joiners is not spelt out. For example, I may have the sequence , and want to specify the ligating behavior of NA. I would chose , but this wouldn't let me choose between it ligating with NA or with TA. What happens when one selects text from the display? I think this may affect the choice of text representation for the cells. For storing an explicit string in unnatural order free of bidi controls, I would start with the equivalent implicit mode string, reverse it, and pass that. I believe the cell contents would then need to be reversed again for rendering. A good test case would be ; the ZWJ ligates the points, not base consonants. > > There may be complications in ensuring that > > gets > > stored as the content of a single cell. > > How should the terminal emulator know which cell (the previous or the > subsequent) do these two s belong to? I think this has to depend on convention. One scheme that might work is, storing the contents in logical order: => ZWJ and ZWJ ZWJ => ZWJ and ZWJ ZWNJ => and ZWJ ZWNJ => and ZWJ ZWNJ ZWJ => ZWJ and It may be better to have left and right conection bits in the cell attributes instead of characters, and restore ZWJ and ZWNJ when the text is cut and pasted from the terminal. Note that storing presentation forms in the terminal would, nowadays, normally cause cut and paste to obtain an unfaithful copy of the original text. > > > Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined > > > above. > > > > Example, please. > > Cropped strings, cropped strings that are adjacent to each other, and > faulty shaping could kick in there. > > Two fields on the UI. One in columns 36-40 with cyan background, > aiming to show ABCDEF, but due to limited room, can only show ABCDE > (let's say it's scrolled horizontally this way). Another in columns > 41-45 with yellow background, aiming to show UVWXYZ, but due to > limited space only VWXYZ is shown (it's scrolled horizontally like > this). > > What the terminal emulator sees is a continuous text of ABCDEVWXYZ. > What the application wants to have is to get E shaped as if there was > an F on its right, and get V shaped as if there was an U on its left. Task: So the text it's to show is parts of FEDCBA and ZYXWVU. They are not continuous with any other text in the terminal. The display command will not affect anything but columns 36 to 45. Assumptions: FEDCBA and ZYXWVU are each parts of right-to-left runs. Solution: The implicit mode text would be ZYXWVEDCBA (This assumes that Z, V, E and A could otherwise join with the contents of other cells.) So send left-to-right text: ABCDEVWXYZ > Once you address this problem, I'm not sure ZW(N)J are still > required/desireable, rather than applying this more generic solution > there as well. > > > At present, VTE positions LTR Indic preceding spacing combining > > marks after the consonant. I though your draft scheme corrected > > this very local bidi issue, which is so local that the bidi > > algorithm ignores it. > > Indic spacing combining marks are handled incorrectly by VTE and are > being addressed in bug 584160 which I've already linked. This > particular issue I don't consider BiDi at all. It's something totally > different. The spacing accent can be to the right, somewhat on top of > and somewhat to the right, on top of, somewhat to the left and > somewhat on top of, or fully on the left. It's not binary left or > right. Proper rendering should be done by font, and not at all by the > BiDi of the terminal. The terminal is unaware of how much the base > glyph is shifted to the right and the accent to its left. All that the > terminal needs to do (and VTE gets it wrong now) is to pass these two > into whichever font rendering engine in one single step. How many cells do consonant plus combining mark get between them? > > So ???? > KHMER LETTER RO, U+17C8 KHMER SIGN > _preah_ 'prefix denoting > > repect for gods, kings, etc.' will be three cells = > > <(COENG, RA), PO, YUUKALEAPINTU> and cause no confusion? Or will > > the cells be ? > > First it's a base character followed by a non-spacing mark. As in most > terminal emulators (and now we're absolutely not talking about my BiDi > proposal) they are stored in the same cell. The first cell contains > (PO, COENG). > The next two are a base character followed by a spacing mark. In VTE > 584160 I outline two possible approaches, but the one I'm in favor of, > is that the row's second cell contains RO and the third cell contains > YUUKALEAPINTU, which two are combined together properly when the > logical contains get displayed. Another possibility which I'm > pondering about is whether the emulation layer should combine them, > that is, have the second cell store the "first half of (RO, YUUKA)" > and the third cell store the "second half of (RO, YUUKA)". > > Does this make any sense? A visible U+17D2 has no r?le in the Khmer writing system. On computers, it is a warning that the input of a subscript consonant is only half done. There are three units of the writing system in that word - KHMER LETTER PO, KHMER CONSONANT SIGN COENG RO*, and KHMER SIGN YUUKALEAPINTU. *a named sequence > If not, could you please explain what and > why is the desired behavior? Why: ???? is the rendering, What: (a) Cell-by-cell rendering: with dotted circles removed. or (b) Cell-by-cell rendering: with dotted circles removed. A better scheme would be to render the three or two cells together using a (sensu lato) monospaced font and display the result for the cells. > Anyway, here we're talking about something that's totally independent > from my BiDi work. It's also something that should be standardized > across terminals, sure, but maybe not right now :) It relates to the insistence that the number of cells assigned to a character shall not depend on its context. With the two-cell solution, LETTER RO gets no cells - it is stored in the cell claimed by LETTER PO. Richard. From unicode at unicode.org Sat Feb 2 15:49:40 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 2 Feb 2019 21:49:40 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <8736p69d4h.fsf@turtle-trading.net> References: <8736p69d4h.fsf@turtle-trading.net> Message-ID: <20190202214940.17a05b8d@JRWUBU2> On Sat, 02 Feb 2019 20:58:06 +0100 Benjamin Riefenstahl via Unicode wrote: > Hi Egmont, hi all, > > > This is a interesting discussion here. If only because I would have > thought that there is only minimal interest by the actual target > audience in supporting these scripts in a terminal, given the severe > limitations of that environment. Eli will probably tell me I'm behind the times, but there are a few places where a Gnome-terminal is better than an Emacs GUI window. One is colour highlighting of text found by grep. Another is that screen overwriting doesn't work in an Emacs window. My main interest in this, though, is in improving the general run of Indic terminal cell editors. If we can get Gnome-terminal working for Kharoshthi, things should improve for LTR Indic. Even working on the false assumption that Indic scripts are like Devanagari would be an improvement, despite my comments about Khmer. > Presentation forms: Termshape uses the Arabic presentation forms > available and so it is somewhat limited as mentioned by Eli. Given > that we need to keep the implementation simple anyway, I am not sure > that significantly more is really needed, at least given what Emacs > provides already. Additional character forms could be added, where > the Unicode repertoire is not sufficient. This could use PUA > characters or other means like terminal control sequences. In both > cases a common understanding would be needed between the terminal (or > the font used by it) and the application, outside of Unicode. You do not need PUA. For U+0756 ARABIC LETTER BEH WITH SMALL V, we can form: Initial form: 200C 0756 200D Medial form: 200D 0756 200D Final form: 200D 0756 200C Isolated form: 200C 0756 200C The tricky bit is to get the terminal to accept them as cell contents. > A real problem is a combination of diacritics and ligatures. Any > diacritic applies to only one character in the ligature, and between > the application and the terminal it is currently not possible to > determine which one. This is one area where an implementation in the > terminal would clearly have the advantage. But a terminal control > sequence could also help. IMO we are talking about a luxury problem > here, though. Do we want to set as our first goal showing complete > quranic verses in all their glory, or are we satisfied with everyday > Arabic like say the website of a modern Arabic newspaper? Just get Kharoshthi working :-) Some of the Arabic 'mark-up' characters might be tricky. Richard. From unicode at unicode.org Sat Feb 2 16:02:10 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 2 Feb 2019 23:02:10 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190202205701.0b0a332d@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> Message-ID: Hi Richard, On Sat, Feb 2, 2019 at 9:57 PM Richard Wordingham wrote: > Seriously, you need to give a definition of 'visual order' for this > context. Not everyone shares your chiralist view. When I look at the Unicode BiDi algorithm, or go to an online demo at https://unicode.org/cldr/utility/bidic.jsp, or look at the FriBidi API etc., their very basic functionality is that I pass the logical order (as the string is expected to be stored in text files etc.), and the result of the algorithm is the visual order. On top of this, I make the clarification that combining marks need to be reordered to be sent out to the terminal emulator _after_ their base letter, because that's how terminal emulators work. The BiDi problem area can only be reasonably addressed in the display layer, by leaving the emulation layer pretty much unchanged. I find it unreasonable to introduce a new mode where the combining accents are sent to the terminal emulator _before_ their base letter. (On an offtopic note, I wish that was the only mode in Unicode, it would simplify a couple of things in the handling of streams. But this ship has sailed decades ago.) This reordering for the combining accents to come after (that is: to the right) of the base letter in the LTR visual order is what e.g. FriBidi does by default, due to the REORDER_NSM flag being set by default. Essentially, the "explicit mode" that my specification introduces is the exact same behavior that most terminal emulators do now, and the one that e.g. Emacs requires. They lay out the codepoints they receive, from left to right. Nothing is going to change there. What I add is another mode (the technically less problematic "implicit" mode where the terminal displays the contents just as any BiDi-aware graphical text editor, browser etc. would do) for the sake of "cat"-like simple utilities, while being unsuitable for Emacs and friends. My work also specifies how/when exactly to toggle back and forth between these two modes. What else do I need to further specify in the concept of "visual order"? > A visible U+17D2 has no r?le in the Khmer writing system. On > computers, it is a warning that the input of a subscript consonant is > only half done. There are three units of the writing system in that > word - KHMER LETTER PO, KHMER CONSONANT SIGN COENG RO*, and KHMER SIGN > YUUKALEAPINTU. > [and I could quote a whole lot more] Richard, you are obviously magnitudes more savvy in shaping and stuff than me, and I can't quickly pick up your knowledge to properly answer to all the issues you mentioned. What you probably still haven't realized is that I aimed to address a much lower level issue than the ones you keep bringing up. Currently, no matter what terminal emulator you pick, as soon as you start doing BiDi (vim, emacs, cat, echo...), you end up with words being written backwards. I mean, maybe they show up correctly with emacs, but they show up incorrectly with vim and cat. Then you switch to a different emulator, or toggle a setting, and suddenly vim and cat will be okay, and emacs won't. This is bad. This is the low level issue I'm trying to address, to make sure that letters of words are always shown in the correct order. There's no way you could do shaping underneath this level, it makes no sense to talk about shaping, zero-width (non)joining, special Khmer symbols and whatnot on reversed words, right? The order of the letters need to be fixed first, which is what I'm doing, and then all the bells and whistles needed for shaping might come on top of this. Right now I'm doing this BiDi work all voluntarily. As much as I'd love to solve all the problems of the world, I don't have capacity for that. As for shaping, chances are that I'm not going to get there, unless someone offers a decent paid job :P. What I'm looking for right now is feedback on whether the low-level BiDi work makes sense, and whether it really creates proper grounds for building shaping etc. on top of it one day. Hope this clarifies a lot. And again, thanks for all your precious input, but we've heavily diverged from the scope of my work. cheers, egmont From unicode at unicode.org Sat Feb 2 16:15:02 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 2 Feb 2019 23:15:02 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190202214940.17a05b8d@JRWUBU2> References: <8736p69d4h.fsf@turtle-trading.net> <20190202214940.17a05b8d@JRWUBU2> Message-ID: Hi Richard, > My main interest in this, though, is in improving the general run of > Indic terminal cell editors. If we can get Gnome-terminal working for > Kharoshthi, things should improve for LTR Indic. Even working on the > false assumption that Indic scripts are like Devanagari would be an > improvement, despite my comments about Khmer. So, as for concrete bugs, there's the aforementioned VTE bug 584160. You might want to give the pending patches a try, or (to keep the relevant discussion at one place) comment over there about your desired priorities etc. We've also set up a "Terminal WG" on freedesktop (https://gitlab.freedesktop.org/terminal-wg), a place intended for specifications. If you/we feel like certains bits around Devanagari/Khmer/etc. handling need a proper specification before we could jump to the implementation, probably that would be the best platform to discuss that. Reason being that I don't know when I'd be able to address them, if ever, but there are multiple terminal emulator developers waiting there for such challenges. Also, IMHO a bugtracker is a better forum than a mailing list if parties can't all immediately work on the problem :) I'm definitely aiming to fix the basic Devanagari rendering (that is: spacing marks), for this autumn's VTE release. Maybe even for this spring's. I probably won't do more (like Virama), they'll have to wait for the HarfBuzz port. cheers, egmont From unicode at unicode.org Sat Feb 2 17:34:54 2019 From: unicode at unicode.org (Benjamin Riefenstahl via Unicode) Date: Sun, 03 Feb 2019 00:34:54 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190202214940.17a05b8d@JRWUBU2> (Richard Wordingham via Unicode's message of "Sat, 2 Feb 2019 21:49:40 +0000") References: <8736p69d4h.fsf@turtle-trading.net> <20190202214940.17a05b8d@JRWUBU2> Message-ID: <87mund9335.fsf@turtle-trading.net> Hi Richard, > Benjamin Riefenstahl wrote: >> the severe limitations of that environment. Richard Wordingham writes: > Eli will probably tell me I'm behind the times, but there are a few > places where a Gnome-terminal is better than an Emacs GUI window. One > is colour highlighting of text found by grep. Another is that screen > overwriting doesn't work in an Emacs window. I have not followed all of this thread, but is that on-topic? Anyway I did not mean to talk about Emacs GUI windows, they are a completely different animal from terminal windows in my mind. Where Emacs GUI windows lack features in their interaction with other programs, people who care about that are implementing those features. There is no theory or research necessary, beyond understanding the existing codebase. >> Additional character forms could be added, where the Unicode >> repertoire is not sufficient. This could use PUA characters > You do not need PUA. For U+0756 ARABIC LETTER BEH WITH SMALL V, we > can form: > > Initial form: 200C 0756 200D > Medial form: 200D 0756 200D > Final form: 200D 0756 200C > Isolated form: 200C 0756 200C > > The tricky bit is to get the terminal to accept them as cell contents. If you want to implement in the terminal that it should interprete these sequences, you can just as well implement shaping as a whole, i.e. interprete any sequence that needs shaping. There is no reason for control characters here, I think. I was looking at it from the standpoint of what works now, sending presentation forms to the terminal, and what than could be simple means to extend that mechanism to support more shaping variants. PUA characters could work without changes in the terminal emulators themself. You would only need the font that supports those PUA characters, which is easy if you start from a Truetype font that already supports that script and thus presumably already has that glyph. From my POV that is a very simple technique. benny From unicode at unicode.org Sat Feb 2 19:01:18 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Sun, 03 Feb 2019 02:01:18 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190202151247.0118dec4@JRWUBU2> Message-ID: Den 2019-02-02 16:12, skrev "Richard Wordingham via Unicode" : > On Sat, 02 Feb 2019 14:01:46 +0100 > Kent Karlsson via Unicode wrote: > >> Well, I guess you may need to put some (practical) limit to the number >> of non-spacing marks (like max two above + max one below; overstrikes >> are an edge case). Otherwise one may need to either increase the line >> height (bad idea for a terminal emulator I think) or the marks start >> to visually interfere with text on other lines (even with the hinted >> limits there may be some interference), also a bad idea for a terminal >> emulator. So I'm not so sure that non-spacing marks is a piece of >> cake... (I.e., need to limit them.) > > Doesn't Jerusalem in biblical Hebrew sometime have 3 marks below the > lamedh? The depth then is the maximum depth, not the sum of the > depths. Do you want to view/edit such texts on a terminal emulator? (Rather than a GUI window.) > Tai Lue has 'mai sat 3 lem' - that's three marks above for a > combination common enough to have a name. Throw in the repetition mark > and that's four marks above if you treat the subscript consonant as a > mark (or code it to comply with the USE's erroneous grammar). I don't question that as such. But again, do you want to view/edit such texts on a **terminal emulator**? It is just that such things are likely to graphically overflow the "cell" boundaries, unless the cells are disproportionately high (i.e. double or so line spacing). Doesn't really sound like a terminal emulator... I do not think terminal emulators should be used for ALL kinds of text. /Kent K From unicode at unicode.org Sat Feb 2 19:30:26 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 3 Feb 2019 01:30:26 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> Message-ID: <20190203013026.5f12605e@JRWUBU2> On Sat, 2 Feb 2019 23:02:10 +0100 Egmont Koblinger via Unicode wrote: > Hi Richard, > > On Sat, Feb 2, 2019 at 9:57 PM Richard Wordingham > wrote: > > > Seriously, you need to give a definition of 'visual order' for this > > context. Not everyone shares your chiralist view. > > When I look at the Unicode BiDi algorithm, or go to an online demo at > https://unicode.org/cldr/utility/bidic.jsp, or look at the FriBidi API > etc., their very basic functionality is that I pass the logical order > (as the string is expected to be stored in text files etc.), and the > result of the algorithm is the visual order. That first reference doesn't even use the word 'visual'. When I look in Standard Annex 9, 'Unicode Bidirectional Algorithm', I find, 'In combination with the following rule, this means that trailing whitespace will appear at the visual end of the line (in the paragraph direction)'. Paragraph direction, of course, can be left-to-right or right-to-left. Your best hope there is, 'No bidirectional formatting. This implies that the system does not visually interpret characters from right-to-left scripts.' It's a shame that that statement is not true; one could build a system using N'ko decimal digits that only visually interpreted characters from right-to-left scripts. > What else do I need to further specify in the concept of "visual > order"? All I am saying is that your proposal should define what it means by visual order. > This is the low level issue I'm trying to address, to make sure that > letters of words are always shown in the correct order. There's no way > you could do shaping underneath this level, it makes no sense to talk > about shaping, zero-width (non)joining, special Khmer symbols and > whatnot on reversed words, right? > The order of the letters need to be > fixed first, which is what I'm doing, and then all the bells and > whistles needed for shaping might come on top of this. Shaping for RTL scripts happens on strings stored in logical order. These are then laid out right to left, though the dominant usage of the term 'advance width' for right-to-left glyph sequences feels perversely different from the use for left to right glyph sequences. Passing text in the form of characters in left-to-right order is an annoying distraction, presumably forced on you by the attempt to maximise compatibility with existing systems. Casting text into grids of 'characters' requires consideration of all types of writing elements. The division into panes is an awkward complication; panes in the application not shared with the terminal is even worse for shaping. Richard. From unicode at unicode.org Sat Feb 2 20:02:13 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sun, 3 Feb 2019 03:02:13 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190203013026.5f12605e@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <20190203013026.5f12605e@JRWUBU2> Message-ID: Hi Richard, On Sun, Feb 3, 2019 at 2:32 AM Richard Wordingham via Unicode wrote: > That first reference doesn't even use the word 'visual'. The Unicode BiDi algorithm does speak about "visual positions for display", "reordering for display" etc. > All I am saying is that your proposal should define what it means by > visual order. Are you nitpicking on me not giving a precise definition on the otherwise IMO freaking obvious "visual order", or am I missing something fundamental? > Shaping for RTL scripts happens on strings stored in logical order. That's what I recommend in my current v0.1, which was vetoed by you guys, claming that the terminal emulator should do it even in cases when it's only aware of the visual order. > Passing text in the form of characters in left-to-right order is an > annoying distraction, presumably forced on you by the attempt to > maximise compatibility with existing systems. Nope; passing text in visual order(*) is a technical necessity for Emacs (as Eli confirmed it) and all other fullscreen apps (text editors and such), as I provide a detailed proof for that in my proposal. It's literally impossible to perform visual cropping on a string (required by practically all fullscreen text editors), visual concatenation of strings (e.g. a line of tmux which has two panes next to each other), and in the mean time preserve the logical order that's passed on. You just can't define a logical order after visual operations. (*) To be pedantic, they could pass the text in whatever order they want to, with random cursor movements in between. The point is that the terminal emulator won't reshuffle the cells, that is, they should write into column 1 whichever they want to appear at the leftmost position, into column 2 whichever they want to appear in column 2, and so on. And unless the cursor is moved explicitly, the cursor keeps moving forward to higher numbered columns, that is, the terminal expects to receive visual order. > Casting text into grids of 'characters' requires consideration of all > types of writing elements. The division into panes is an awkward > complication; panes in the application not shared with the terminal is > even worse for shaping. I'm really not sure what you're trying to say here. The feeling I get, and I'm happy if you can prove me wrong, is that while you're truly knowledgeable about shaping, you haven't yet understood the very fundamentals why terminals are vastly different from let's say web browsers, which results in the technical necessity of often relying on visual order. There's even a separate section dedicated to explaining this in my spec. If terminals weren't vastly different, BiDi there would've been solved along with the birth of the Unicode BiDi algorithm, I wouldn't have spent months working on this proposal, and we wouldn't be having this discussion right now :) Remember, this whole story is about finding a compromise between what a terminal emulator is, and what BiDi scripts require (incl. shaping). If you want to do BiDi and shaping without compromises, you should get away from terminal emulators (as Kent has also suggested). Having a strict grid of characters is such a compromise. The terminal emulator not being aware of the entire logical string, only the currently onscreen bits (that is, a cropped version of the string), which results in the need for the explicit mode (visual order) is another such compromise. cheers, egmont From unicode at unicode.org Sat Feb 2 20:43:06 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 3 Feb 2019 02:43:06 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: References: <20190202151247.0118dec4@JRWUBU2> Message-ID: <20190203024306.44800aba@JRWUBU2> On Sun, 03 Feb 2019 02:01:18 +0100 Kent Karlsson via Unicode wrote: > Den 2019-02-02 16:12, skrev "Richard Wordingham via Unicode" > : > > Doesn't Jerusalem in biblical Hebrew sometime have 3 marks below the > > lamedh? The depth then is the maximum depth, not the sum of the > > depths. > > Do you want to view/edit such texts on a terminal emulator? (Rather > than a GUI window.) > > > Tai Lue has 'mai sat 3 lem' - that's three marks above for a > > combination common enough to have a name. > I don't question that as such. But again, do you want to view/edit > such texts on a **terminal emulator**? Oddly, I feel happier running bash on Gnome-terminal than an emacs shell process. What GUI window Perhaps I'm spoilt by some of the features like colour. Maybe I'd be happier if I could work how to get bash's emacs mode to work when running under emacs. I'd be grepping such material rather than viewing it. Moreover, I may be looking through a lot of files rather than viewing a single one. > It is just that such things are likely to graphically overflow the > "cell" boundaries, unless the cells are disproportionately high (i.e. > double or so line spacing). Doesn't really sound like a terminal > emulator... I do not think terminal emulators should be used for > ALL kinds of text. I don't need fixed-width cells. But otherwise, there are uses for both terminal emulators and teletype emulators. Different scripts (and languages within a script for that matter) merit different cell aspect ratios. So, what do you recommend I run grep from for Hebrew or Tai Lue? Richard. From unicode at unicode.org Sun Feb 3 10:03:37 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 03 Feb 2019 18:03:37 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: (message from Egmont Koblinger via Unicode on Sun, 3 Feb 2019 03:02:13 +0100) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <20190203013026.5f12605e@JRWUBU2> Message-ID: <834l9kx3ja.fsf@gnu.org> > Date: Sun, 3 Feb 2019 03:02:13 +0100 > Cc: unicode at unicode.org > From: Egmont Koblinger via Unicode > > > All I am saying is that your proposal should define what it means by > > visual order. > > Are you nitpicking on me not giving a precise definition on the > otherwise IMO freaking obvious "visual order" Most probably. The definition is trivial: the order of characters on display, from left to right. The only possible reason to split hairs here could be when some characters don't appear on display, like control characters. Other than that, there should be no doubt what visual order means. From unicode at unicode.org Sun Feb 3 10:05:49 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 03 Feb 2019 18:05:49 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190202214940.17a05b8d@JRWUBU2> (message from Richard Wordingham via Unicode on Sat, 2 Feb 2019 21:49:40 +0000) References: <8736p69d4h.fsf@turtle-trading.net> <20190202214940.17a05b8d@JRWUBU2> Message-ID: <8336p4x3fm.fsf@gnu.org> > Date: Sat, 2 Feb 2019 21:49:40 +0000 > From: Richard Wordingham via Unicode > > Eli will probably tell me I'm behind the times, but there are a few > places where a Gnome-terminal is better than an Emacs GUI window. One > is colour highlighting of text found by grep. ??? The Emacs 'grep' command also highlights the matches, by interpreting the escape sequences emitted by Grep the program it invokes. > Another is that screen overwriting doesn't work in an Emacs window. What is "screen overwriting" in this context? From unicode at unicode.org Sun Feb 3 10:10:15 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 03 Feb 2019 18:10:15 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: (message from Egmont Koblinger via Unicode on Sat, 2 Feb 2019 23:02:10 +0100) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> Message-ID: <831s4ox388.fsf@gnu.org> > Date: Sat, 2 Feb 2019 23:02:10 +0100 > Cc: unicode at unicode.org > From: Egmont Koblinger via Unicode > > On top of this, I make the clarification that combining marks need to > be reordered to be sent out to the terminal emulator _after_ their > base letter That is true in general regarding any text shaping: the shaping engine needs the characters to be submitted in the logical order. When Emacs works on a text-mode terminal, it sends characters to be shaped together, such as base character and its combining marks, in logical order, even when the surrounding text is reordered into visual order. > What I add is another mode (the technically less problematic > "implicit" mode where the terminal displays the contents just as any > BiDi-aware graphical text editor, browser etc. would do) for the > sake of "cat"-like simple utilities I think there are hard problems even for such "simple" utilities, and I will start a separate thread about this. From unicode at unicode.org Sun Feb 3 10:13:06 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 03 Feb 2019 18:13:06 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190203013026.5f12605e@JRWUBU2> (message from Richard Wordingham via Unicode on Sun, 3 Feb 2019 01:30:26 +0000) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <20190203013026.5f12605e@JRWUBU2> Message-ID: <83zhrcvoj1.fsf@gnu.org> > Date: Sun, 3 Feb 2019 01:30:26 +0000 > From: Richard Wordingham via Unicode > > Shaping for RTL scripts happens on strings stored in logical order. > These are then laid out right to left, though the dominant usage of > the term 'advance width' for right-to-left glyph sequences feels > perversely different from the use for left to right glyph sequences. > > Passing text in the form of characters in left-to-right order is an > annoying distraction, presumably forced on you by the attempt to > maximise compatibility with existing systems. Actually, you pass the characters to be shaped in logical order, and then display the produced grapheme clusters in visual order. From unicode at unicode.org Sun Feb 3 10:14:53 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 03 Feb 2019 18:14:53 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190203024306.44800aba@JRWUBU2> (message from Richard Wordingham via Unicode on Sun, 3 Feb 2019 02:43:06 +0000) References: <20190202151247.0118dec4@JRWUBU2> <20190203024306.44800aba@JRWUBU2> Message-ID: <83y36wvog2.fsf@gnu.org> > Date: Sun, 3 Feb 2019 02:43:06 +0000 > Cc: Kent Karlsson > From: Richard Wordingham via Unicode > > So, what do you recommend I run grep from for Hebrew or Tai Lue? Inside Emacs, of course: "M-x grep RET" etc. From unicode at unicode.org Sun Feb 3 10:35:40 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 03 Feb 2019 18:35:40 +0200 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <831s4ox388.fsf@gnu.org> (message from Eli Zaretskii via Unicode on Sun, 03 Feb 2019 18:10:15 +0200) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> Message-ID: <83womgvnhf.fsf@gnu.org> > Date: Sun, 03 Feb 2019 18:10:15 +0200 > Cc: richard.wordingham at ntlworld.com, unicode at unicode.org > From: Eli Zaretskii via Unicode > > I think there are hard problems even for such "simple" utilities, and > I will start a separate thread about this. I think we spent enough time discussing issues of complex script shaping in terminal emulators, something that IMO took us too far aside. The basic problems with bidi reordering of text-mode output start much sooner, and are much more fundamental. I think they should be considered first. The document cited at the beginning of the parent thread states that "simple" text-mode utilities, such as 'echo', 'cat', 'ls' etc. should use the "implicit" mode of bidi reordering, with automatic guessing of the base paragraph direction. I think this already present non-trivial problems. The fundamental problem here is that most "simple" utilities use hard newlines to present text in some visually plausible format. Even when these utilities just emit text read from files (as opposed to generating the text from the program), you will normally see each line end with a hard newline, because the absolute majority of text files have a hard newline and the end of each line. When bidirectional text is reordered by the terminal emulator, these hard newlines will make each line be a separate paragraph. And this is a problem, because the result will be completely random, depending on the first strong directional character in each line, and will be visually very unpleasant. Just take the output produced by any utility when invoked with, say, the --help option, and try imagining how this will look when translated into a language that uses RTL script. So I think determination of the paragraph direction even in this simplest case cannot be left to the UBA defaults, and there's a need to use "higher-level" protocols for paragraph direction. IOW, the implicit mode described in the above-mentioned document needs to be augmented by a smarter method of determining the base paragraph direction. (I might have a suggestion for that, if people agree with the above reasoning.) From unicode at unicode.org Sun Feb 3 10:54:25 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sun, 3 Feb 2019 17:54:25 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <83womgvnhf.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> Message-ID: Hi Eli, > The document cited at the beginning of the parent thread states that > "simple" text-mode utilities, such as 'echo', 'cat', 'ls' etc. should > use the "implicit" mode of bidi reordering, with automatic guessing of > the base paragraph direction. Not exactly. I take the SCP escape sequence from ECMA TR/53 (and slightly reinterpret it) so that it specifies the paragraph direction, plus introduce a new one that specifies whether autodetection is enabled. I'm arguing, although my reasons are not rock solid, that IMHO the default should be the strict direction as set by SCP, without autodetection. > The fundamental problem here is that most "simple" utilities use hard > newlines to present text in some visually plausible format. Could you please list examples? What I have in mind are "echo", "cat", "grep" and alike, they don't care about the terminal width. If an app cares about the terminal width, how does it care about it? What does it use this information for? To truncate overlong strings, for example? At this very moment I'd argue that such applications need to do BiDi on their own, and thus set the terminal to explicit mode. In ap app does any kind of string truncation, it can no longer delegate the task of BiDi to the terminal emulator. I'm also mentioning that you cannot both logically and visually truncate a BiDi string at once. Either you truncate the logical string, which may result in a visual nonsense, or you truncate the visual string, risking that it's not an initial fragment of the data that ends up getting displayed. Along these lines I'm arguing that basic utilities like "cut" shouldn't care about BiDi, the logical behavior there is more important than the visual one. There could, of course, be sophisticated "bidi-cut" and similar utilities at one point which cut the visual string, but they should use the terminal's explicit mode. > Even when > these utilities just emit text read from files (as opposed to > generating the text from the program), you will normally see each line > end with a hard newline, because the absolute majority of text files > have a hard newline and the end of each line. How does a BiDi text file look like, to begin with? Can a heavily BiDi text file be formatted to 72 (or whatever) columns using explicit newlines, keeping BiDi both semantically and visually correct? I truly doubt that. Can you show me such files? > When bidirectional text is reordered by the terminal emulator, these > hard newlines will make each line be a separate paragraph. And this > is a problem, because the result will be completely random, depending > on the first strong directional character in each line, and will be > visually very unpleasant. Just take the output produced by any > utility when invoked with, say, the --help option, and try imagining > how this will look when translated into a language that uses RTL > script. First, having no autodetection by default but rather an explicit control for the overall direction hopefully mitigates this problem. Second, I outline a possible future extension with a different definition of a "paragraph", maybe something between empty lines, or other kinds of explicit markers. > So I think determination of the paragraph direction even in this > simplest case cannot be left to the UBA defaults, and there's a need > to use "higher-level" protocols for paragraph direction. That higher level protocol is part of my recommendation, part of ECMA TR/53, as the SCP sequence. Does this make sense? cheers, egmont From unicode at unicode.org Sun Feb 3 11:45:06 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 3 Feb 2019 17:45:06 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <83y36wvog2.fsf@gnu.org> References: <20190202151247.0118dec4@JRWUBU2> <20190203024306.44800aba@JRWUBU2> <83y36wvog2.fsf@gnu.org> Message-ID: <20190203174506.608cdb41@JRWUBU2> On Sun, 03 Feb 2019 18:14:53 +0200 Eli Zaretskii via Unicode wrote: > > Date: Sun, 3 Feb 2019 02:43:06 +0000 > > Cc: Kent Karlsson > > From: Richard Wordingham via Unicode > > > > So, what do you recommend I run grep from for Hebrew or Tai Lue? > > Inside Emacs, of course: "M-x grep RET" etc. That assumes you like using bindings for all the commands; I don't. Command recall and having completion options serve me very well. Your suggestion comes unstuck when I attempt to switch between the window's keyboard and the MULE keyboard in the middle of the command. 'M-x' isn't recursive. Still, your suggestion should be useful for grepping for ASCII stuff. Richard. From unicode at unicode.org Sun Feb 3 11:50:50 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 03 Feb 2019 19:50:50 +0200 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: (message from Egmont Koblinger on Sun, 3 Feb 2019 17:54:25 +0100) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> Message-ID: <83pns8vk05.fsf@gnu.org> > From: Egmont Koblinger > Date: Sun, 3 Feb 2019 17:54:25 +0100 > Cc: unicode at unicode.org > > I'm arguing, although my reasons are not rock solid, that IMHO the > default should be the strict direction as set by SCP, without > autodetection. I think it's unreasonable and impractical to expect 'echo', 'cat', and its ilk to emit bidi controls (or any other controls) to force paragraph direction. For starters, they won't know what direction to force, because they don't understand the text they are processing. No, this simple case must work reasonably well with the application _completely_ oblivious to the bidi aspects. If this can't work reasonably well, I submit that the entire concept of having a bidi-aware terminal emulator doesn't "hold water". > > The fundamental problem here is that most "simple" utilities use hard > > newlines to present text in some visually plausible format. > > Could you please list examples? Just redirect any of them to a file, and look at the file with a hex editor. You will see a hard newline character, 0x0A, at the end of each line. > What I have in mind are "echo", "cat", "grep" and alike, they don't > care about the terminal width. Terminal width is not always relevant here, and I didn't mention it. However, as long as you allude to that, I think your garden-variety text utility does assume the width of a terminal window is 80 columns, and the messages displayed by these programs are formatted accordingly. > If an app cares about the terminal width, how does it care about it? > What does it use this information for? To truncate overlong strings, > for example? To break long lines at appropriate places, and to emit text that fits on a line in the first place. Just try invoking any such utility with the --help option, and you will see what I mean. I give one example below. > At this very moment I'd argue that such applications need > to do BiDi on their own, and thus set the terminal to explicit mode. > In ap app does any kind of string truncation, it can no longer > delegate the task of BiDi to the terminal emulator. I'm afraid this won't fly, because most "simple" utilities do it that way. If you insist on them doing their own bidi, you've just lost your cause. No upstream developer will be interested in adapting their utilities to a terminal emulator that requires them to do their own bidi. > I'm also mentioning that you cannot both logically and visually > truncate a BiDi string at once. I don't understand why you talk about truncation; I didn't. Here, look at this random example: Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license. Zip 3.0 (July 5th 2008). Usage: zip [-options] [-b path] [-t mmddyyyy] [-n suffixes] [zipfile list] [-xi list] The default action is to add or replace zipfile entries from list, which can include the special name - to compress standard input. If zipfile and list are omitted, zip compresses stdin to stdout. -f freshen: only changed files -u update: only changed or new files -d delete entries in zipfile -m move into zipfile (delete OS files) -r recurse into directories -j junk (don't record) directory names -0 store only -l convert LF to CR LF (-ll CR LF to LF) -1 compress faster -9 compress better -q quiet operation -v verbose operation/print version info -c add one-line comments -z add zipfile comment -@ read names from stdin -o make zipfile as old as latest entry -x exclude the following names -i include only the following names -F fix zipfile (-FF try harder) -D do not add directory entries -A adjust self-extracting exe -J junk zipfile prefix (unzipsfx) -T test zipfile integrity -X eXclude eXtra file attributes -! use privileges (if granted) to obtain all aspects of WinNT security -$ include volume label -S include system and hidden files -e encrypt -n don't compress these suffixes -h2 show more help Do you see how this is carefully formatted to avoid overflowing an 80-column line of a typical terminal? Now suppose this is translated into a RTL language, which causes the Copyright line to start with a strong R letter (because "Copyright" is translated). You will see the first line flushed to the right margin, then the next line flushed to the left margin (because it's a separate paragraph, and starts with a strong L letter). Then the line which says "The default action..." will again start at the right. And so on and so forth -- the result is extremely ugly. > > Even when > > these utilities just emit text read from files (as opposed to > > generating the text from the program), you will normally see each line > > end with a hard newline, because the absolute majority of text files > > have a hard newline and the end of each line. > > How does a BiDi text file look like, to begin with? Exactly like any other text file, just with some of the characters belonging to RTL scripts. > Can a heavily BiDi text file be formatted to 72 (or whatever) > columns using explicit newlines, keeping BiDi both semantically and > visually correct? Of course. > I truly doubt that. Why is that? > Can you show me such files? See, for example, the Hebrew tutorial in Emacs, TUTORIAL.he. Please note that Emacs bumps into this problem all the time, because almost always text buffers in Emacs use hard newlines, whether their text came from files or was just typed by the user. E.g., most plain-text email messages use hard newlines, and the Emacs built-in MUAs produce such plain-text messages; using "flowed" text is much more rare. Emacs has an "Auto-Fill mode" which automatically inserts a hard newline and starts a new line when the current line exceeds a given column number, and Emacs users typing text usually enable this mode (as did I when typing this message). So how to determine base paragraph direction in a sane way was about the first problem I needed to solve when I made Emacs support bidi. > First, having no autodetection by default but rather an explicit > control for the overall direction hopefully mitigates this problem. It doesn't, IMO, because it requires the applications to understand enough to emit the correct control. Most simple text-processing utilities are not that smart. > Second, I outline a possible future extension with a different > definition of a "paragraph", maybe something between empty lines, or > other kinds of explicit markers. I think this kind of extension cannot be deferred to some "future", it must be there in the very first version you produce. Otherwise, the result will be so unpleasant that people will be averted. > > So I think determination of the paragraph direction even in this > > simplest case cannot be left to the UBA defaults, and there's a need > > to use "higher-level" protocols for paragraph direction. > > That higher level protocol is part of my recommendation, part of ECMA > TR/53, as the SCP sequence. It must be the default, a necessary part of any compliant emulator. That's my opinion based on my experience, anyway. From unicode at unicode.org Sun Feb 3 12:07:51 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 03 Feb 2019 20:07:51 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190203174506.608cdb41@JRWUBU2> (message from Richard Wordingham via Unicode on Sun, 3 Feb 2019 17:45:06 +0000) References: <20190202151247.0118dec4@JRWUBU2> <20190203024306.44800aba@JRWUBU2> <83y36wvog2.fsf@gnu.org> <20190203174506.608cdb41@JRWUBU2> Message-ID: <83o97svj7s.fsf@gnu.org> > Date: Sun, 3 Feb 2019 17:45:06 +0000 > From: Richard Wordingham via Unicode > > > > So, what do you recommend I run grep from for Hebrew or Tai Lue? > > > > Inside Emacs, of course: "M-x grep RET" etc. > > That assumes you like using bindings for all the commands; I don't. What bindings? "M-x grep" just shows the Grep hits in a separate window, you don't need to do anything except reading them. The advantage is that you get bidi reordering and text shaping for free, something you won't get from most terminals. > Command recall and having completion options serve me very well. Your > suggestion comes unstuck when I attempt to switch between the window's > keyboard and the MULE keyboard in the middle of the command. 'M-x' > isn't recursive. This isn't an Emacs forum, so I will leave it at that; but you are wrong on all counts. From unicode at unicode.org Sun Feb 3 14:35:18 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 3 Feb 2019 20:35:18 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <8336p4x3fm.fsf@gnu.org> References: <8736p69d4h.fsf@turtle-trading.net> <20190202214940.17a05b8d@JRWUBU2> <8336p4x3fm.fsf@gnu.org> Message-ID: <20190203203518.611fd3a8@JRWUBU2> On Sun, 03 Feb 2019 18:05:49 +0200 Eli Zaretskii via Unicode wrote: > > Date: Sat, 2 Feb 2019 21:49:40 +0000 > > From: Richard Wordingham via Unicode > > > > Eli will probably tell me I'm behind the times, but there are a few > > places where a Gnome-terminal is better than an Emacs GUI window. > > One is colour highlighting of text found by grep. > > ??? The Emacs 'grep' command also highlights the matches, by > interpreting the escape sequences emitted by Grep the program it > invokes. > > > Another is that screen overwriting doesn't work in an Emacs > > window. > > What is "screen overwriting" in this context? When instead of adding lines to the bottom, new lines are added on top of and replace existing lines. I prefer the scrollable terminal behaviour to the teletype behaviour of Emacs when running the Linux(?) monitor program 'top', but being a fuddy duddy I prefer the teletype behaviour of Emacs for 'man'. From an error message from 'info', it seems that the Emacs buffer is classified as a 'dumb' terminal. Richard. From unicode at unicode.org Sun Feb 3 14:50:03 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 3 Feb 2019 20:50:03 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <83o97svj7s.fsf@gnu.org> References: <20190202151247.0118dec4@JRWUBU2> <20190203024306.44800aba@JRWUBU2> <83y36wvog2.fsf@gnu.org> <20190203174506.608cdb41@JRWUBU2> <83o97svj7s.fsf@gnu.org> Message-ID: <20190203205003.7e34f0fd@JRWUBU2> On Sun, 03 Feb 2019 20:07:51 +0200 Eli Zaretskii via Unicode wrote: > > Date: Sun, 3 Feb 2019 17:45:06 +0000 > > From: Richard Wordingham via Unicode > > > > > > So, what do you recommend I run grep from for Hebrew or Tai > > > > Lue? > > > > > > Inside Emacs, of course: "M-x grep RET" etc. > > > > That assumes you like using bindings for all the commands; I > > don't. > > What bindings? "M-x grep" just shows the Grep hits in a separate > window, you don't need to do anything except reading them. > > The advantage is that you get bidi reordering and text shaping for > free, something you won't get from most terminals. Which is why I try to remember to issue the emacs command 'M-x shell' command and issue grep commands from the buffer created thereby. The point I'm making is that this emacs command hasn't made terminal emulators obsolete, even though it also does graphics. Richard. From unicode at unicode.org Sun Feb 3 17:36:23 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Mon, 4 Feb 2019 00:36:23 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <83pns8vk05.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> Message-ID: Hi Eli, (I'm responding in multiple emails.) The Unicode BiDi algorithm states that it operates on paragraphs of text, and leaves it up to a higher protocol to define what a paragraph exactly is. What's the definition of "paragraph" in the context of plain text files? I don't think there's a single well-established practice. In some particular text files, every explicit newline character starts a new paragraph. In some (e.g. COPYING.GPL and friends), an empty line (that is: two consecutive newline characters) separates two paragraphs. In some, e.g. in Emacs's TUTORIAL.he, or markdown files, it's way more complicated, probably there isn't a well-defined grammar for how exactly bullet list entries and alike should become new paragraphs. In the output of "dpkg -s packagename" consecutive lines indented by 1 space ? except for those where there's only a single dot after the space ? form the human-perceived paragraphs. There are sure several other syntaxes out there. If the producer of a text file uses a different definition than the viewer software, bugs can arise. I think this should be intuitively obvious, but just in case, let me give a concrete example. In this example I'll assume LTR paragraph direction set up by some external means; with autodetected paragraph direction it's much easier to come up with such breakages. I wish to store and deliver the following text, as it's layed out here in logical order. That is, the order as the bytes appear in the text file, as I typed them from the keyboard, is laid out here strictly from left to right, with uppercase standing for RTL letters, and no mirroring: lorem ipsum ABC <[ DEF foobar The visual representation, what I expect to see in any decent viewer software, is this one according to the BiDi algorithm this: lorem ipsum FED ]> CBA foobar The visual representation, in a narrower viewport, might wrap for example like this: lorem ipsum CBA FED ]> foobar which is still correct, given that logical "ABC <[ DEF" is a single RTL run. (This assumes a viewer which, unlike Emacs, follows the Unicode BiDi algorithm for wrapping a paragraph into multiple lines.) Let's assume that me, as the producer of the text file, wish to create a typical README in the spirit of COPYING.GPL and similar text files, with the paragraph definition that two consecutive newline characters (that is: a single empty line) delimit paragraphs; and a single newline is equivalent to a space. Since I'd prefer to keep a margin of 16 characters in the source file (for demo purposes), I can take the liberty of replacing the space after "ABC" by a single newline. (Maybe my text editor does this automatically.) The file's contents, again the logical order laid out from left to right, top to bottom, becomes this: lorem ipsum ABC <[ DEF foobar This file, accoring to the paragraph definition chosen earlier, is equivalent to the unwrapped version shown before, and thus should convey the same message. If I view this file in a piece of software which uses the same paragraph definition for BiDi purposes, the contents will appear as expected. An example for such a viewer is a markdown converter's (that leaves single newlines as-is, and adds a "

" at double newlines) output viewed as an html file in a browser. Here comes the twist. Let's view this latter file with a viewer that uses a _different_ definition for paragraph. Let's view it in Gedit, Emacs, or the work-in-progress BiDi-aware VTE by "cat"ing it, where every newline begins a new paragraph ? that's how these viewers define the notion of "paragraph" for the sake of BiDi. The visual layout in these viewers becomes: lorem ipsum CBA <[ FED foobar which is just not correct. Since here BiDi is run on the two lines separately, the initial "<[" is treated as LTR, placed at the wrong location in the wrong order, and the glyphs aren't mirrored. Now, Emacs ships a TUTORIAL.he which, for most of its contents (but not everywhere) seems to treat runs between empty lines as paragraphs, while Emacs itself is a viewer that treats runs between single newlines as paragraphs. That is, Emacs is inconsistent with itself. In case you think I got something wrong with Emacs: Could you please give exact definitions: - What are the exact units (so-called "paragraphs" by UAX9) that it runs BiDi on when it loads and displays a file? - What are the exact units (so-called "paragraphs" by UAX9) in TUTORIAL.he on which BiDi needs to be run in order to get the desired readable version? What most likely happens is that in order to see a difference, you'd need to have more special symbols, or at least a more special constellation of them. Probably TUTORIAL.he is just luckily simple enough that such a difference isn't hit. Another possibility is (and I cannot check because I can't speak Hebrew) that somewhere TUTORIAL.he "cheats" with the logical order to get the desired visual one. ----- Now, back to terminals. The smallest possible viable definition of a "paragraph" in terminal emulators is stuff between one newline and the next one. It would require a hell lot of work, redesigning (overcomplicating) plenty of basics of terminal emulation to be able to come up with smaller units, e.g. cells of a table ? a concept that doesn't currently exist in this world ?, I don't find any such approach feasible at all. This definition of paragraph (stuff between a newline and the next one) is the same as the one of Gedit, Emacs etc. when it comes to displaying BiDi text. Now, it's possible to ponder about other, larger units as possible definitions. For certain files, surely the right approach would be to treat parts delimited by empty lines as paragraphs. But how far should we go? Should terminals understand markdown (one of the most terrible grammars I've ever seen) and all its popular flavors? Should it understand Emacs's TUTORIAL.he? Should it understand dpkg's format? What else? There's another conceptual problem here. Most terminal emulators don't understand a single bit of what happens inside them. They don't know where an application's output begins, where it ends. They don't know where the shell prompt is. In fact, they have no idea what a shell prompt is. They only see a single stream of incoming data to process (print printable characters, and obey to control instructions). With the paragraph definition of "between a newline and the next one" this is not a problem, everything is doable based on what terminals already know. With any other definition, e.g. if you define paragraphs as "separated by empty lines", still I'm sure you'd need the shell prompt to terminate the previous paragraph, start a new one (the prompt's and command line's), and even below the command line where the next utility's output begins it would also need to start a new paragraph. But we just don't have this information now. There are extensions used by some terminal emulators, and perhaps they'll get "standardized" and more widely adopted to at least let the terminal emulator know where the shell prompt and command line begins and ends. But even if they're adopted by many emulators, there's still a problem: is it going to be the shells (binaries) emit these themselves, or should the user configure the prompt to contain them? It's quite unlikely that we'll have buy-in from all the popular shells. The prompts are maintained by all the users themselves, with .bashrc or so defining them, this file is copied over from /etc/skel once and then cannot be updated by distributions. Even if it's going to happen, it'll take many-many years to come until we can safely rely on this information being generally available. For the problem set of having the same paragraph direction for multiple paragraphs (e.g. an entire file, as cat'ed), we're also hit by this limitation. Once the knowledge of where a command's output begins and ends becomes available, we'll be able to do this, for example say that the direction is autodetected on the command's output as one unit, but then BiDi is applied on each line or each emptyline-delimited fragment. We just don't have the necessary information now, and won't have for a looong time. This is why the only reasonable thing I can imagine is to define paragraph as newline-delimited segments, and leave it up for future enhancements to introduce other "paragraph" definitions as further options. cheers, egmont From unicode at unicode.org Sun Feb 3 18:53:07 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 4 Feb 2019 00:53:07 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <83zhrcvoj1.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <20190203013026.5f12605e@JRWUBU2> <83zhrcvoj1.fsf@gnu.org> Message-ID: <20190204005307.65bbc833@JRWUBU2> On Sun, 03 Feb 2019 18:13:06 +0200 Eli Zaretskii via Unicode wrote: > Actually, you pass the characters to be shaped in logical order, and > then display the produced grapheme clusters in visual order. Some early systems supporting computerised Hebrew script did pass characters in left-to-right order. This works fairly well when the contents of character cells do not interact. Richard. From unicode at unicode.org Sun Feb 3 19:19:21 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 4 Feb 2019 01:19:21 +0000 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <83pns8vk05.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> Message-ID: <20190204011921.3411cc77@JRWUBU2> On Sun, 03 Feb 2019 19:50:50 +0200 Eli Zaretskii via Unicode wrote: > Do you see how this is carefully formatted to avoid overflowing an > 80-column line of a typical terminal? Now suppose this is translated > into a RTL language, which causes the Copyright line to start with a > strong R letter (because "Copyright" is translated). You will see the > first line flushed to the right margin, then the next line flushed to > the left margin (because it's a separate paragraph, and starts with a > strong L letter). Then the line which says "The default action..." > will again start at the right. And so on and so forth -- the result > is extremely ugly. Depending on the environment. If you look at it in Notepad, all lines will be LTR or all lines will be RTL. Would not a careful translator either ensure that each non-blank line had a strong character and that all first strong characters were (a) L, (b) R or (c) AL? Text in LTR scripts tends not to be so careful. Richard. From unicode at unicode.org Sun Feb 3 19:41:07 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 4 Feb 2019 01:41:07 +0000 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> Message-ID: <20190204014107.378a54b6@JRWUBU2> On Mon, 4 Feb 2019 00:36:23 +0100 Egmont Koblinger via Unicode wrote: > I wish to store and deliver the following text, as it's layed out here > in logical order. That is, the order as the bytes appear in the text > file, as I typed them from the keyboard, is laid out here strictly > from left to right, with uppercase standing for RTL letters, and no > mirroring: > > lorem ipsum ABC <[ DEF foobar > Let's assume that me, as the producer of the text file, wish to create > a typical README in the spirit of COPYING.GPL and similar text files, > with the paragraph definition that two consecutive newline characters > (that is: a single empty line) delimit paragraphs; and a single > newline is equivalent to a space. Since I'd prefer to keep a margin of > 16 characters in the source file (for demo purposes), I can take the > liberty of replacing the space after "ABC" by a single newline. (Maybe > my text editor does this automatically.) The file's contents, again > the logical order laid out from left to right, top to bottom, becomes > this: > > lorem ipsum ABC > <[ DEF foobar That split is wrong if you want the non-HTML text to lay out reasonably well in anything but a higher order protocol forcing RTL. You need to it split as: lorem ipsum ABC <[ DEF foobar or lorem ipsum ABC <[ DEF foobar Richard. From unicode at unicode.org Sun Feb 3 20:16:52 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 4 Feb 2019 02:16:52 +0000 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> Message-ID: <20190204021652.6fbe1898@JRWUBU2> On Mon, 4 Feb 2019 00:36:23 +0100 Egmont Koblinger via Unicode wrote: > Now, back to terminals. > > The smallest possible viable definition of a "paragraph" in terminal > emulators is stuff between one newline and the next one. > > It would require a hell lot of work, redesigning (overcomplicating) > plenty of basics of terminal emulation to be able to come up with > smaller units, e.g. cells of a table ? a concept that doesn't > currently exist in this world ?, I don't find any such approach > feasible at all. The concept appears to exist in the form of the fields of the fifth edition of ECMA-48. Have you digested this ambitious standard? ECMA-48 has the concept of hyphenation and wrapping! (Well, in Appendix C it does. I haven't fully tied it in with the receipt of characters.) Richard. From unicode at unicode.org Sun Feb 3 21:25:43 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Mon, 04 Feb 2019 05:25:43 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190203203518.611fd3a8@JRWUBU2> (message from Richard Wordingham via Unicode on Sun, 3 Feb 2019 20:35:18 +0000) References: <8736p69d4h.fsf@turtle-trading.net> <20190202214940.17a05b8d@JRWUBU2> <8336p4x3fm.fsf@gnu.org> <20190203203518.611fd3a8@JRWUBU2> Message-ID: <83lg2wute0.fsf@gnu.org> > Date: Sun, 3 Feb 2019 20:35:18 +0000 > From: Richard Wordingham via Unicode > > > What is "screen overwriting" in this context? > > When instead of adding lines to the bottom, new lines are added on top > of and replace existing lines. I prefer the scrollable terminal > behaviour to the teletype behaviour of Emacs when running the > Linux(?) monitor program 'top', but being a fuddy duddy I prefer the > teletype behaviour of Emacs for 'man'. From an error message from > 'info', it seems that the Emacs buffer is classified as a 'dumb' > terminal. Try customizing scroll-conservatively, it sounds like you want that. From unicode at unicode.org Mon Feb 4 09:37:33 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Mon, 04 Feb 2019 17:37:33 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <83lg2wute0.fsf@gnu.org> (message from Eli Zaretskii via Unicode on Mon, 04 Feb 2019 05:25:43 +0200) References: <8736p69d4h.fsf@turtle-trading.net> <20190202214940.17a05b8d@JRWUBU2> <8336p4x3fm.fsf@gnu.org> <20190203203518.611fd3a8@JRWUBU2> <83lg2wute0.fsf@gnu.org> Message-ID: <83imxzva2q.fsf@gnu.org> > Date: Mon, 04 Feb 2019 05:25:43 +0200 > Cc: unicode at unicode.org > From: Eli Zaretskii via Unicode > > Try customizing scroll-conservatively, it sounds like you want that. Ignore me: I misunderstood what you were looking for. You are right: Emacs doesn't support such scrolling method. From unicode at unicode.org Mon Feb 4 10:51:12 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Mon, 04 Feb 2019 18:51:12 +0200 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators) In-Reply-To: (message from Egmont Koblinger on Mon, 4 Feb 2019 00:36:23 +0100) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> Message-ID: <83d0o7v6nz.fsf@gnu.org> > From: Egmont Koblinger > Date: Mon, 4 Feb 2019 00:36:23 +0100 > Cc: unicode at unicode.org > > The Unicode BiDi algorithm states that it operates on paragraphs of > text, and leaves it up to a higher protocol to define what a paragraph > exactly is. > > What's the definition of "paragraph" in the context of plain text files? > > I don't think there's a single well-established practice. Actually, UAX#9 defines "paragraph" as the chunk of text delimited by paragraph separator characters. This means characters whose bidi category is B, which includes Newline, the CR-LF pair on Windows, U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. > In some, e.g. in Emacs's TUTORIAL.he, or markdown files, it's way > more complicated, probably there isn't a well-defined grammar for > how exactly bullet list entries and alike should become new > paragraphs. Actually, Emacs implements the rule that paragraphs are separated by empty lines. This is documented in the Emacs manuals. (That's by default, users and Lisp programs can control that to some extent.) This rule is global, and applied to any file or buffer, including TUTORIAL.he. > lorem ipsum FED ]> CBA foobar > > The visual representation, in a narrower viewport, might wrap for > example like this: > > lorem ipsum CBA > FED ]> foobar I suggest to leave line wrapping alone for the moment: it is a further complication. Let's first talk about text whose every line ends in a hard newline -- this is what you see in most "simple" text-mode utilities which we are talking about. If/when we solve the problems there, we can then look at the issues with wrapping. > Here comes the twist. Let's view this latter file with a viewer that > uses a _different_ definition for paragraph. Let's view it in Gedit, > Emacs, or the work-in-progress BiDi-aware VTE by "cat"ing it, where > every newline begins a new paragraph ? that's how these viewers define > the notion of "paragraph" for the sake of BiDi. > > The visual layout in these viewers becomes: > > lorem ipsum CBA > <[ FED foobar > > which is just not correct. Since here BiDi is run on the two lines > separately, the initial "<[" is treated as LTR, placed at the wrong > location in the wrong order, and the glyphs aren't mirrored. This kind of problems happens all the time, and you cannot avoid it. Different programs display bidi text differently. I propose not to try to solve this problem, because IME it cannot be solved in general. Let's focus on the terminal emulators that should comply with your guidelines, and let's try to decide what should they do about base paragraph direction of text emitted by "simple" text utilities. If they all make decisions by the same rule, they all will show the same text identically. > Now, Emacs ships a TUTORIAL.he which, for most of its contents (but > not everywhere) seems to treat runs between empty lines as paragraphs, Correct. > while Emacs itself is a viewer that treats runs between single > newlines as paragraphs. That is, Emacs is inconsistent with itself. Incorrect. Emacs always treats a run of text between empty lines as a single paragraph, in TUTORIAL.he and everywhere else. There's nothing special about TUTORIAL.he, it is just a plain text file with a few dozen of bidi formatting controls, needed to show the key sequences with weak and neutral characters in correct visual order. (Some of those controls can probably be removed nowadays, since we now have the BPA of Unicode 6.3 -- the file was written before Unicode 6.3 was released.) In fact, I wrote that tutorial as an exercise, to prove to myself that Emacs can be useful for editing non-trivial bidi text. > In case you think I got something wrong with Emacs: Could you please > give exact definitions: > - What are the exact units (so-called "paragraphs" by UAX9) that it > runs BiDi on when it loads and displays a file? See above: for the purpose of the Emacs UBA implementation, paragraphs are separated by empty lines. That is the only rule in EMacs regarding paragraph determination. > - What are the exact units (so-called "paragraphs" by UAX9) in > TUTORIAL.he on which BiDi needs to be run in order to get the desired > readable version? The same. There's nothing special about that file. > What most likely happens is that in order to see a difference, you'd > need to have more special symbols, or at least a more special > constellation of them. Probably TUTORIAL.he is just luckily simple > enough that such a difference isn't hit. No, TUTORIAL.he is neither "lucky" nor "simple". I deliberately used there almost every bidi formatting control there is, where appropriate, to make sure this stiff works as intended in an otherwise plain text file. > Another possibility is (and I cannot check because I can't speak > Hebrew) that somewhere TUTORIAL.he "cheats" with the logical order to > get the desired visual one. There's no cheating there, I assure you. > This definition of paragraph (stuff between a newline and the next > one) is the same as the one of Gedit, Emacs etc. when it comes to > displaying BiDi text. At least with Emacs, it is not the same. I think considering each line as a separate paragraph makes writing bidi plain-text documents that look right almost impossible, if each line ends in a newline, as customary in Emacs (and with "simple" text utilities). > Now, it's possible to ponder about other, larger units as possible > definitions. For certain files, surely the right approach would be to > treat parts delimited by empty lines as paragraphs. But how far should > we go? Should terminals understand markdown (one of the most terrible > grammars I've ever seen) and all its popular flavors? Should it > understand Emacs's TUTORIAL.he? Should it understand dpkg's format? > What else? My personal recommendation is to adopt the empty line rule. It's simple enough and gives good results IME. > There's another conceptual problem here. Most terminal emulators don't > understand a single bit of what happens inside them. They don't know > where an application's output begins, where it ends. They don't know > where the shell prompt is. In fact, they have no idea what a shell > prompt is. They only see a single stream of incoming data to process > (print printable characters, and obey to control instructions). > > With the paragraph definition of "between a newline and the next one" > this is not a problem, everything is doable based on what terminals > already know. > > With any other definition, e.g. if you define paragraphs as "separated > by empty lines", still I'm sure you'd need the shell prompt to > terminate the previous paragraph, start a new one (the prompt's and > command line's), and even below the command line where the next > utility's output begins it would also need to start a new paragraph. > But we just don't have this information now. I'm surprised that you describe this as such a complex problem. I think you explained up-thread that terminal emulators should cope with lines of text arriving piecemeal, which I interpreted as meaning that text is stored in the emulator's memory. Modern emulators running on windowed desktops also provide scroll-back buffers, and react to expose events. So I think the text that is currently in the viewport, and also some text previously shown, are stored in memory, and can be consulted. However, I'm not an expert on this, so I will take your word that this is a significant complication. My point is that this is a complication that must be solved; it cannot be ignored. If you ignore it and go for the "each line is a paragraph" rule, you will lose many users; you will lose me for sure. > This is why the only reasonable thing I can imagine is to define > paragraph as newline-delimited segments, and leave it up for future > enhancements to introduce other "paragraph" definitions as further > options. IME, this is a grave mistake. I hope I explained why; it is now up to you to decide what to do about that. From unicode at unicode.org Mon Feb 4 10:53:22 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Mon, 04 Feb 2019 18:53:22 +0200 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <20190204011921.3411cc77@JRWUBU2> (message from Richard Wordingham via Unicode on Mon, 4 Feb 2019 01:19:21 +0000) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> Message-ID: <83bm3rv6kd.fsf@gnu.org> > Date: Mon, 4 Feb 2019 01:19:21 +0000 > From: Richard Wordingham via Unicode > > On Sun, 03 Feb 2019 19:50:50 +0200 > Eli Zaretskii via Unicode wrote: > > > Do you see how this is carefully formatted to avoid overflowing an > > 80-column line of a typical terminal? Now suppose this is translated > > into a RTL language, which causes the Copyright line to start with a > > strong R letter (because "Copyright" is translated). You will see the > > first line flushed to the right margin, then the next line flushed to > > the left margin (because it's a separate paragraph, and starts with a > > strong L letter). Then the line which says "The default action..." > > will again start at the right. And so on and so forth -- the result > > is extremely ugly. > > Depending on the environment. If you look at it in Notepad, all lines > will be LTR or all lines will be RTL. That's because Notepad implements _only_ the higher-level protocol for base paragraph direction: there's no way to make Notepad determine the direction by looking at the text. > Would not a careful translator either ensure that each non-blank > line had a strong character and that all first strong characters > were (a) L, (b) R or (c) AL? This is very hard in practice, and is a tremendous annoyance when translating message catalogs to RTL languages. Translation is a hard enough job even without this complication. From unicode at unicode.org Mon Feb 4 13:21:01 2019 From: unicode at unicode.org (Costello, Roger L. via Unicode) Date: Mon, 4 Feb 2019 19:21:01 +0000 Subject: Does "endian-ness" apply to UTF-8 characters that use multiple bytes? Message-ID: Hello Unicode Experts! As I understand it, endian-ness applies to multi-byte words. Endian-ness does not apply to ASCII characters because each character is a single byte. Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), UTF-32BE and UTF32-LE because each character uses multiple bytes. Clearly endian-ness does not apply to single-byte UTF-8 characters. But what about UTF-8 characters that use multiple bytes, such as the character ?, which uses two bytes C3 and A9; does endian-ness apply? For example, if a file is in Little Endian would the character ? appear in a hex editor as A9 C3 whereas if the file is in Big Endian the character ? would appear in a hex editor as C3 A9? /Roger From unicode at unicode.org Mon Feb 4 13:29:43 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 4 Feb 2019 11:29:43 -0800 Subject: Does "endian-ness" apply to UTF-8 characters that use multiple bytes? In-Reply-To: References: Message-ID: <9f0af9a7-e1ec-cc4a-37a7-e304771e0ceb@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 4 13:45:13 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 4 Feb 2019 19:45:13 +0000 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <83bm3rv6kd.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> Message-ID: <20190204194513.7377857e@JRWUBU2> On Mon, 04 Feb 2019 18:53:22 +0200 Eli Zaretskii via Unicode wrote: > Date: Mon, 4 Feb 2019 01:19:21 +0000 > From: Richard Wordingham via Unicode >> If you look at it in Notepad, all >> lines will be LTR or all lines will be RTL. > That's because Notepad implements _only_ the higher-level protocol for > base paragraph direction: there's no way to make Notepad determine the > direction by looking at the text. Yes. If one has a text composed of LTR and RTL paragraphs, one has to choose how far apart their starting margins are. I think that could get complicated for plain text if the terminal has unbounded width. Richard. From unicode at unicode.org Mon Feb 4 13:47:49 2019 From: unicode at unicode.org (Clive Hohberger via Unicode) Date: Mon, 4 Feb 2019 13:47:49 -0600 Subject: Does "endian-ness" apply to UTF-8 characters that use multiple bytes? In-Reply-To: <9f0af9a7-e1ec-cc4a-37a7-e304771e0ceb@ix.netcom.com> References: <9f0af9a7-e1ec-cc4a-37a7-e304771e0ceb@ix.netcom.com> Message-ID: Asmus, I believe it also applies to the bit order in the bytes I believe UTF-16 and UTF-32 are transmitted as single 16 or 32-bit numbers. UTF-8 is a stream of 8-bit numbers Clive *Clive P. Hohberger, PhD MBA* Managing Director Clive Hohberger, LLC +1 847 910 8794 cph13 at case.edu *Inventor of the Ultracode Bar Code Symbology* *2017 Label Industry Global Award for Innovation* On Mon, Feb 4, 2019 at 1:29 PM Asmus Freytag via Unicode < unicode at unicode.org> wrote: > On 2/4/2019 11:21 AM, Costello, Roger L. via Unicode wrote: > > Hello Unicode Experts! > > As I understand it, endian-ness applies to multi-byte words. > > Endian-ness does not apply to ASCII characters because each character is a single byte. > > Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), UTF-32BE and UTF32-LE because each character uses multiple bytes. > > Clearly endian-ness does not apply to single-byte UTF-8 characters. But what about UTF-8 characters that use multiple bytes, such as the character ?, which uses two bytes C3 and A9; does endian-ness apply? For example, if a file is in Little Endian would the character ? appear in a hex editor as A9 C3 whereas if the file is in Big Endian the character ? would appear in a hex editor as C3 A9? > > /Roger > > > > UTF-8 is a byte stream. Therefore, the order of bytes in a multiple byte > integer does not come into it. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 4 14:27:00 2019 From: unicode at unicode.org (James Tauber via Unicode) Date: Mon, 4 Feb 2019 15:27:00 -0500 Subject: Does "endian-ness" apply to UTF-8 characters that use multiple bytes? In-Reply-To: References: Message-ID: Endian-ness only affects ordering of bytes within a code unit. Because UTF-8 has single byte code units, the order is not affected by endian-ness, only the UTF-8 bit mapping itself. Note also that endian-ness only affects individual 16-bit code units in UTF-16. If you have a surrogate pair, endian-ness doesn't effect the ordering of each 16-bit unit that makes up the pair, only the two bytes within each of the units. James On Mon, Feb 4, 2019 at 2:25 PM Costello, Roger L. via Unicode < unicode at unicode.org> wrote: > Hello Unicode Experts! > > As I understand it, endian-ness applies to multi-byte words. > > Endian-ness does not apply to ASCII characters because each character is a > single byte. > > Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), > UTF-32BE and UTF32-LE because each character uses multiple bytes. > > Clearly endian-ness does not apply to single-byte UTF-8 characters. But > what about UTF-8 characters that use multiple bytes, such as the character > ?, which uses two bytes C3 and A9; does endian-ness apply? For example, if > a file is in Little Endian would the character ? appear in a hex editor as > A9 C3 whereas if the file is in Big Endian the character ? would appear in > a hex editor as C3 A9? > > /Roger > > -- *James Tauber* Eldarion | jktauber.com (Greek Linguistics) | Modelling Music | Digital Tolkien -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 4 14:39:07 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Mon, 04 Feb 2019 22:39:07 +0200 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <20190204194513.7377857e@JRWUBU2> (message from Richard Wordingham via Unicode on Mon, 4 Feb 2019 19:45:13 +0000) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> Message-ID: <834l9juw44.fsf@gnu.org> > Date: Mon, 4 Feb 2019 19:45:13 +0000 > From: Richard Wordingham via Unicode > > Yes. If one has a text composed of LTR and RTL paragraphs, one has to > choose how far apart their starting margins are. I think that could > get complicated for plain text if the terminal has unbounded width. But no real-life terminal does. The width is always bounded. From unicode at unicode.org Mon Feb 4 15:00:55 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 4 Feb 2019 21:00:55 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <834l9kx3ja.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <20190203013026.5f12605e@JRWUBU2> <834l9kx3ja.fsf@gnu.org> Message-ID: <20190204210055.78cbcd83@JRWUBU2> On Sun, 03 Feb 2019 18:03:37 +0200 Eli Zaretskii via Unicode wrote: > > Date: Sun, 3 Feb 2019 03:02:13 +0100 > > Cc: unicode at unicode.org > > From: Egmont Koblinger via Unicode > > > > > All I am saying is that your proposal should define what it means > > > by visual order. > > > > Are you nitpicking on me not giving a precise definition on the > > otherwise IMO freaking obvious "visual order" > > Most probably. The definition is trivial: the order of characters on > display, from left to right. The only possible reason to split hairs > here could be when some characters don't appear on display, like > control characters. Other than that, there should be no doubt what > visual order means. To me, 'visual order' means in the dominant order of the script. So, if one takes it as natural that a decimal number starts with the most significant digits, the decimal numbers used with Arabic are *not* stored in visual order if considered as part of that script. Furthermore, let me quote from the Bidi Algorithm: "In combination with the following rule, this means that trailing whitespace will appear at the visual end of the line (in the paragraph direction)." The 'visual end' is clearly not always the right-hand end! Richard. From unicode at unicode.org Mon Feb 4 15:02:11 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 4 Feb 2019 21:02:11 +0000 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190203205003.7e34f0fd@JRWUBU2> References: <20190202151247.0118dec4@JRWUBU2> <20190203024306.44800aba@JRWUBU2> <83y36wvog2.fsf@gnu.org> <20190203174506.608cdb41@JRWUBU2> <83o97svj7s.fsf@gnu.org> <20190203205003.7e34f0fd@JRWUBU2> Message-ID: <20190204210211.4fbb5337@JRWUBU2> On Sun, 3 Feb 2019 20:50:03 +0000 Richard Wordingham via Unicode wrote: > On Sun, 03 Feb 2019 20:07:51 +0200 > Eli Zaretskii via Unicode wrote: > Which is why I try to remember to issue the emacs command 'M-x shell' > command and issue grep commands from the buffer created thereby. The > point I'm making is that this emacs command hasn't made terminal > emulators obsolete, even though it also does graphics. I now discover that 'M-x term' brings up an Emacs terminal emulator. That gives grep's output the colouring appropriate for a terminal. The cell widths vary from line-to-line. Richard. From unicode at unicode.org Mon Feb 4 15:02:22 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 04 Feb 2019 14:02:22 -0700 Subject: Does "endian-ness" apply to UTF-8 characters that use multiple =?UTF-8?Q?bytes=3F?= Message-ID: <20190204140222.665a7a7059d7ee80bb4d670165c8327d.a154a7d81b.wbe@email03.godaddy.com> http://www.unicode.org/faq/utf_bom.html#utf8-2 -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Feb 4 15:27:39 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Mon, 4 Feb 2019 22:27:39 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <20190204021652.6fbe1898@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204021652.6fbe1898@JRWUBU2> Message-ID: Hi Richard, > The concept appears to exist in the form of the fields of the > fifth edition of ECMA-48. Have you digested this ambitious standard? To be honest: No, I haven't. And I have no idea what those "fields" are. I spent (read: wasted) way too much time studying ECMA TR/53 to get to understand what it's talking about, to realize that the good parts were already obvious to me, and to be able to argue why I firmly believe that the bad parts are bad. Remember: These documents were created in 1991, that is, 28 years ago. (I'm emphasizing it because I did the math wrong for a long time, I though it was 18 years ago :-D.) Things have a changed a lot since then. As for the BiDi docs, I found that the current state of the art, current best practices, exisiting BiDi algorithm differ so much from ECMA's approach (which no one I'm aware of cared to implement for 28 years) that the standard is of pretty little use. Only a few good parts could be kept (but needed tiny corrections), and plenty of other things needed to be build up anew. This is the only reasonable way to move forward. If you designed a house 2 or 3 years ago, and finally have the money to get it built, you can reasonably start building it. If you designed a house 28 years ago and finally have the chance to build it (including the exact same heating technologies, electrical system etc.), you wouldn't, would you? I'm sure you looked at those plans, and started at the very least heavily updating them, or started to design a brand new one, perhaps somewhat based on your old ideas. I don't expect it to be any different with "fields" of ECMA-48. I'm not aware of any terminal emulator implementing anything like them, whatever they are. Probably there's a good reason for that. Whatever purpose they aimed to serve apparently wasn't important enough for such a long time. By now, if they're found important, they should probably be solved by some new design (or at the very least, just like I did with TR/53, the work should begin by evaluating that standard to see if it's still feasible). Instead of spending a huge amount of work on my BiDi proposal, I could have just said: "guys, let's go with ECMA for BiDi handling". The thing is, I'm pretty sure it wouldn't have taken us anywhere. I don't expect it to be different with "fields" either. The starting point for my work was the current state of terminal emulators and the surrounding ecosystem, plus the current BiDi algorithm; not some ancient plan that was buried deep in some drawer for almost three decades. I hope this makes sense. That being said, I'd really, honestly love to see if someone evaluated ECMA's "fields" and created a feasibility study for current terminal emulators, similarly to how I did it with TR/53. cheers, egmont From unicode at unicode.org Mon Feb 4 15:36:22 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Mon, 4 Feb 2019 22:36:22 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <20190204014107.378a54b6@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204014107.378a54b6@JRWUBU2> Message-ID: Hi Richard, > That split is wrong if you want the non-HTML text to lay out reasonably > well in anything but a higher order protocol forcing RTL. You need to > it split as: > > lorem ipsum ABC > <[ DEF foobar Okay, so you should use LRMs or other similar tricks when wrapping a human-perceived paragraph of text. I take it as: - The expected definition of "paragraph", for the technical sake of running the BiDi algorithm, is lines of the text file (that is, between a newline and the next one). - On top of this technical definition, the document is crafted so that lines are not longer than a certain threshold, and the human-perceived paragraphs are usually delimited by empty lines (sometimes by other means, like bullets of a list). Sounds like a reasonable approach to me, probably the best to have. And, by the way, aligns with my BiDi proposal if the higher level protocol (escape sequences) set the paragraph direction correctly and disable autodetection. cheers, egmont From unicode at unicode.org Mon Feb 4 15:51:35 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Mon, 4 Feb 2019 22:51:35 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <834l9juw44.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: > > Yes. If one has a text composed of LTR and RTL paragraphs, one has to > > choose how far apart their starting margins are. I think that could > > get complicated for plain text if the terminal has unbounded width. > > But no real-life terminal does. The width is always bounded. Allegedly the no longer maintained FinalTerm, and maybe another one or two not so popular terminal emulators experimented with this. VTE and a few other emulators have also received such a feature request; VTE has rejected it. See https://bugzilla.gnome.org/show_bug.cgi?id=769440 if you're curious. Indeed BiDi becomes problematic in the sense that Richard pointed out: how far should the starting margins be from each other? By terminal emulators rejecting the idea of unbounded width, this is not a problem for them. It might still be a problem for BiDi aware text viewers/edtiors, though. I mean one possible, obvious approach could be to adjust them according to the terminal's width. Another is to take it from the file's contents (e.g. longest line). But maybe there's demand for other options, e.g. to have those margins 80 characters away from each other even when the file is viewed on a mobile phone where the viewport is narrower and the user wishes to scroll horizontally. This is up for text viewers/editors to decide. cheers, egmont From unicode at unicode.org Mon Feb 4 15:58:57 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 4 Feb 2019 13:58:57 -0800 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190204210055.78cbcd83@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <20190203013026.5f12605e@JRWUBU2> <834l9kx3ja.fsf@gnu.org> <20190204210055.78cbcd83@JRWUBU2> Message-ID: <17144faa-29ce-5663-e37c-7acd9305ca2d@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 4 16:15:50 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 4 Feb 2019 22:15:50 +0000 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <834l9juw44.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: <20190204221550.1e52a0ad@JRWUBU2> On Mon, 04 Feb 2019 22:39:07 +0200 Eli Zaretskii via Unicode wrote: > > Date: Mon, 4 Feb 2019 19:45:13 +0000 > > From: Richard Wordingham via Unicode > > > > Yes. If one has a text composed of LTR and RTL paragraphs, one has > > to choose how far apart their starting margins are. I think that > > could get complicated for plain text if the terminal has unbounded > > width. > > But no real-life terminal does. The width is always bounded. The Emacs terminal (M-x term) seems to be a reasonable approximation, with the scroll-left and scroll-right commands changing the margins' separations. This is an example of a terminal that has lines with left-to-right character paths and lines with right-to-left character paths. (Such lines are necessarily separated by blank lines.) Geometrically, column positions on left-to-right and right-to-left character paths are incomparable - resizing the window and scrolling move them differently. Richard. From unicode at unicode.org Mon Feb 4 16:24:06 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Mon, 4 Feb 2019 23:24:06 +0100 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190204210055.78cbcd83@JRWUBU2> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <20190203013026.5f12605e@JRWUBU2> <834l9kx3ja.fsf@gnu.org> <20190204210055.78cbcd83@JRWUBU2> Message-ID: Hi, > To me, 'visual order' means in the dominant order of the script. This is not a definition I've come across anywhere else, nor matches my intuition of "visual order" : the exact visual order (recursive definition, yay!) of how you see the glyphs being displayed in the row. > So, > if one takes it as natural that a decimal number starts with the most > significant digits, the decimal numbers used with Arabic are *not* > stored in visual order if considered as part of that script. The visual order is: You get the string rendered properly. You scan with your eyes in one strict direction, and take note of what you see in that order. For example, let's say: "Hello Shalom" (the latter word in Hebrew): HELLO ??????? The logical order: H E L L O space ??? ? ?? ? The visual order, from left to right is: H E L L O space ? ?? ? ??? Similarly, the visual order from right to left (a much more rarely seen concept, the exact reverse of the visual LTR order) is: ??? ? ?? ? space O L L E H "Visual order" most of the time means "visual left to right order", although strictly speaking, "visual right to left order" is just as much a visual order. This is all independent from the script's dominant order. > "In combination with the following rule, this means that trailing > whitespace will appear at the visual end of the line (in the paragraph > direction)." > > The 'visual end' is clearly not always the right-hand end! Yes, that's right. (And it doesn't contradict the definition of "visual order". For RTL paragraphs, those trailing whitespaces appear at the beginning of the "visual LTR order"). e. From unicode at unicode.org Mon Feb 4 17:08:10 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Tue, 5 Feb 2019 00:08:10 +0100 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators) In-Reply-To: <83d0o7v6nz.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> Message-ID: Hi Eli, > Actually, UAX#9 defines "paragraph" as the chunk of text delimited by > paragraph separator characters. This means characters whose bidi > category is B, which includes Newline, the CR-LF pair on Windows, > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. Indeed, this was an oversight on my side. So, with this definition, every single newline character starts a new paragraph. The result of printf "Hello\nWorld\n" > world.txt is a text file consisting of two paragraphs, with 5 characters in each. Correct? > Actually, Emacs implements the rule that paragraphs are separated by > empty lines. This is documented in the Emacs manuals. That is, Emacs overrides UAX#9 and comes up with a different definition? Furthermore, you argue that in terminals I should follow Emacs's definition rather than Unicode's? Or please clarify if I misunderstood you here. > > while Emacs itself is a viewer that treats runs between single > > newlines as paragraphs. That is, Emacs is inconsistent with itself. > > Incorrect. Emacs always treats a run of text between empty lines as a > single paragraph, in TUTORIAL.he and everywhere else. There's nothing > special about TUTORIAL.he, it is just a plain text file with a few > dozen of bidi formatting controls, needed to show the key sequences > with weak and neutral characters in correct visual order. [...] Thanks for the clarification, I believe it's clear to me now. > At least with Emacs, it is not the same. I think considering each > line as a separate paragraph makes writing bidi plain-text documents > that look right almost impossible, if each line ends in a newline [...] > My personal recommendation is to adopt theempty line rule. It's > simple enough and gives good results IME. [...] > I'm surprised that you describe this as such a complex problem. I > think you explained up-thread that terminal emulators should cope with > lines of text arriving piecemeal, which I interpreted as meaning that > text is stored in the emulator's memory. Modern emulators running on > windowed desktops also provide scroll-back buffers, and react to > expose events. So I think the text that is currently in the viewport, > and also some text previously shown, are stored in memory, and can be > consulted. The problem is not the memory management. Let's look at the following session: ---snip--- prompt$ cat file1.txt This is the first human-perceived paragraph. And this is the second. prompt$ cat file2.txt Here this is the third paragraph. And this one is the fourth. prompt$ ---snip--- If you load the files to Emacs, it is perfectly aware of the contents of the two files. It can define paragraphs however it wants to, and BiDi the files accordingly. The terminal emulator doesn't know what's a shell prompt, what's a command that the user types, what's the output of that command. (You don't know either from this snippet. Maybe I only cat'ed file1.txt, and "prompt$ cat file2.txt" is just the sixth line of this eleven-line file.) In the terminal emulator's eyes, with Emacs's definition (empty line delimited), this is one paragraph: prompt$ cat file1.txt This is the first human-perceived paragraph. and this is another paragraph: And this is the second prompt$ cat file2.txt Here this is the third paragraph. and similarly for the third one. I believe I understand your concerns with the per-line paragraph definition, but this interpretation that I've just shown most likely leads to even more broken behavior. It's a really nontrivial technical problem to let the terminal emulator know where each prompt, and/or each command's output begins and ends. There's work going on for letting the terminal emulator recognize the prompts, but even if it's successful, it'll probably take 5-10 years to reach the majority of the users. And it probably still wouldn't solve the case of knowing the boundary between the two outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if they're concatenated with "cat file1.txt file2.txt". So, what you're arguing for, is that the default behavior should be something that's: - currently not implementable in a semantically correct way (to stop around shell prompts) due to technical limitations, and - isn't what Unicode says. You have not convinced me that the pros outweigh the cons. That being said, I'm more than open to see such a behavior as a future extension, subject of course to the semantic prompt stuff being available. cheers, egmont From unicode at unicode.org Mon Feb 4 18:05:47 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 5 Feb 2019 00:05:47 +0000 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> Message-ID: <20190205000547.0a38260b@JRWUBU2> On Tue, 5 Feb 2019 00:08:10 +0100 Egmont Koblinger via Unicode wrote: > Hi Eli, > > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited > > by paragraph separator characters. This means characters whose bidi > > category is B, which includes Newline, the CR-LF pair on Windows, > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. It actually gives two different definitions. Table UAX#9 4 restricts the type B to *appropriate newline functions; not all newlines are paragraph separators. > Indeed, this was an oversight on my side. So, with this definition, > every single newline character starts a new paragraph. The result of > printf "Hello\nWorld\n" > world.txt > is a text file consisting of two paragraphs, with 5 characters in > each. Correct? No, it depends on when a newline function is 'appropriate'. TUS 5.8 Rule R2b applies - 'In simple text editors, interpret any NLF the same as LS'. > > Actually, Emacs implements the rule that paragraphs are separated by > > empty lines. This is documented in the Emacs manuals. > > That is, Emacs overrides UAX#9 and comes up with a different > definition? Furthermore, you argue that in terminals I should follow > Emacs's definition rather than Unicode's? Or please clarify if I > misunderstood you here. He's deriving 'B' from a protocol. Richard. From unicode at unicode.org Mon Feb 4 18:32:34 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Tue, 5 Feb 2019 01:32:34 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <83pns8vk05.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> Message-ID: Hi Eli, > I think it's unreasonable and impractical to expect 'echo', 'cat', and > its ilk to emit bidi controls (or any other controls) to force > paragraph direction. For starters, they won't know what direction to > force, because they don't understand the text they are processing. I agree, it is unreasonable for 'echo', 'cat' etc. to emit BiDi controls. There could be some higher level helper utiities though, let's say a "bidi-cat" that examines the file, makes a guess, emits the corresponding escape sequences and cats the file. It's not necessarily a good approach, but a possible one (at least temporarily until terminals implement a better one). On the other hand, it's not unreasonable for higher level stuff (e.g. shell scripts, or tools like "zip") to use such control characters. > No, this simple case must work reasonably well with the application > _completely_ oblivious to the bidi aspects. If this can't work > reasonably well, I submit that the entire concept of having a > bidi-aware terminal emulator doesn't "hold water". There isn't a magic wand. I can't magically fix every BiDi stuff by changing the terminal emulator's source code. Not because I'm clumsy, but because it just can't be done. If it was possible, I wouldn't have written a long specification, I would have just done it. (Actually, if it was possible, others would have sure done it long before I joined terminal emulator development.) There need to be multiple modes, some of them due to the technical particularities of terminal emulation that aren't seen elsewhere (e.g. explicit vs. implicit), and some of them because they are present everywhere where it comes to BiDi (e.g. paragraph direction). And if the mode is not set correctly, things might break, there's nothing new in it. What my specification essentially modifies is that with this specification, you at least will have a chance to get the mode right. Currently there are perhaps like 4 different behaviors implemented across terminal emulators when it comes to BiDi. An application cannot control and cannot query the behavior. In order to get Emacs behave properly, you have to ask your users to adjust a setting (and I cannot repeat enough times that I find this an unacceptable user experience). If the settings of the terminal aren't what Emacs expects, the result could be broken (RTL words might even show up in reverse, LTR order). The same goes for the random example of "zip -h", assuming that they add Hebrew translation. Given the current set of popular terminal emulators, there's no way zip could emit some Hebrew text in a reliably readable way. Whatever it does, there will be terminal emulators (and settings thereof) where the result is totally broken (reversed), or at least unpleasant (wrong paragraph direction used). Moreover, if "zip" emits the Hebrew text in the semantically correct logical order (e.g. they use whatever existing framework, like gettext and a popular .po editor), as opposed to the visual LTR order seen in some legacy systems, it will need different terminal emulator settings than Emacs, so if someone uses both zip and Emacs regularly, they'll have to continuously toggle their terminal's settings back and forth ? have I mentioned how unacceptable I find this as a user? :) One of the key points of my specification is that applications will be able to automatically set the mode. Emacs will be able to switch to the mode it requires, and so will be zip. They will have the opportunity. If they don't live with this opportunity, it's not my problem, and there's nothing I could do about it. Let's say hypothetically that zip adds Hebrew translations, but refuses to emit the escape sequence that switches to RTL paragraph direction, and thus its result doesn't look perfect. Can terminal emulators, can my specification, can me be blamed in this case? I don't think so. If zip knows exactly what it wants to print (as with the help page it knows for sure), and is given all the technical infrastructure to reliably achieve that, it'd be solely them to blame if they refused to properly use it. It's absolutely out of the scope of my work to try to fix this case. "cat" is substantially different. In case of "zip", the creators of that software know exactly how the output should look like, and according to my specification (assuming a confirming terminal emulator, of course) nothing stops them from achieving it. "cat" doesn't know, cannot know the desired look, since the file itself lacks this information. Paragraph direction is a concept that sucks big time. (I have no idea how Unicode could have got it better, though.) It's a piece of information that needs to be carried externally along with the text, in order to make sure it'll be displayed correctly. It's a pain in the butt, just as much carrying the encoding in the pre-Unicode days was, and hardly anyone cared about, resulting in incorrect accented letters way too often. Practically everyone's lazy and doesn't carry the paragraph direction, or there isn't even a place for carrying. Should there be a meta bit on the filesystem for plain text files, or what? In practice, often you just guess. I understand your worries that for the "cat file" use case, it would be great to have a mode of the terminal emulator where the entire file's direction is guessed at once, and then applied to each of its paragraphs (whereas "paragraph" can still be reasonably defined in at least two ways). I second that it would be great to have such a mode, but as I've detailed in a previous mail, we don't have the necessary technical information (boundaries of a command's output) for this. That is why I put this on hold for now. "zip -h" is in a much better situation, it knows what it wants to print, knows what mode (e.g. what paragraph direction) is required for that, and as of my proposal, will be able to switch the terminal to that mode. > -A adjust self-extracting exe -J junk zipfile prefix (unzipsfx) > -T test zipfile integrity -X eXclude eXtra file attributes > -! use privileges (if granted) to obtain all aspects of WinNT security > > Do you see how this is carefully formatted to avoid overflowing an > 80-column line of a typical terminal On a totally side note: If you're about to internationalize your software, this layout is a pretty bad choice. It's handcrafted, and requires each translator to handcraft the translation, manually fiddle with the number of spaces. Which in turn requires that translators are familiar enough with the software to compile and test it with the work-in-progress translations, which often isn't the case. It's only translateable as one giant unit of text, rather than each message separately. (Or, a somewhat complex formatting engine needs to be implemented in the app, e.g. to decide which string spans across both columns.) With one giant unit, translators have a hard time to spot changes, and every change results in fuzziness, that is, when a new option is introduced, even the old ones will revert to English until the translator catches up. This kind of formatting also ignores that English is a pretty dense language, in other languages the strings tend to become longer. Anyway, with my BiDi proposal, zip will have the chance to produce whatever beautiful handcrafted two-column layout that it wants to have for its Hebrew help page. (How they handcraft and later maintain it is not my concern.) In addition to printing the translated text, they'll also have to switch the terminal into whichever BiDi mode of their choice which corresponds to the text. I just cannot reliably guess it in the terminal for them. cheers, egmont From unicode at unicode.org Mon Feb 4 19:28:50 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Tue, 5 Feb 2019 02:28:50 +0100 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators) In-Reply-To: <83d0o7v6nz.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> Message-ID: Hi Eli, > IME, this is a grave mistake. I hope I explained why; it is now up to > you to decide what to do about that. Let me share one more thought. I have to admit, I'm not an Emacs user, I only have some vague ideas how powerful a tool it is. But in its very core I still believe it's a text editor ? is it fair to say this? It could be used for example to conveniently create TUTORIAL.he. I'm not aware of all the kinds of works you can do in Emacs, but I have a feeling that the kind of work you do in a terminal emulator is potentially more diverse. (Let's not nitpick that a terminal can run emacs and emacs has a terminal inside so mathematically speaking it's all the same...) "cat TUTORIAL.he" is indeed one of the commands you can execute in a terminal, and unfortunately, given what terminals currently understand from their contents, I just cannot make it display as you would prefer (and I agree would make a lot of sense). But it's just one use case. There are plenty of line-oriented tools. Think of "head" and "tail". They operate on lines of files, which end up being paragraphs in the terminal according to my definition. According to your definition, they could cut a paragraph in half, they could render differently than as if the entire file was printed. According to my definition, you'll always get the same visual repsesentation, just on the given fragment of the file. Think of "grep", possibly combined with "-r" to process files recursively, and "-C" to print context lines. Not only it can cut paragraphs (of your definition) in half when it displays the matching line (plus context), but also how would you locate in its output when it switches from one match's context to the next match's context within the same file, or to a match in another file? How would you define a paragraph, and how would you define the bigger unit on which the paragraph direction is guessed? I think it's again a use case where my definition of paragraph is less problematic than yours. Think of ad-hoc shell scripts that use "echo"/"printf" to inform the user, "read" to read data etc. Or utilities written in C or whatever that don't care about terminals at all, just print output. In these cases there's no one formatting / wrapping at 80 columns performed by the app. A logical segment is typically printed as a single line, which will be wrapped by the terminal if doesn't fit in the current width (and in some terminals rewrapped when the terminal is resized), this matches my definition of paragraph. There's rarely an empty line injected in these cases; if there is, it is most likely to separate some even bigger semantical units. There are just sooooooo many use cases, it's impossible to perfectly address all of them at once. "cat TUTORIAL.he" is just one of them, not necessarily the most typical, not necessarily the one that should drive the BiDi design. Let's note that the four "BiDi-aware" terminals that I could test all define paragraphs as lines ? I mean visual lines on their own canvas. If the terminal is 80 characters wide, and a utility prints a line of 100 characters, it'll obviously wrap into 80+20 characters. And then these terminals treat them as two separate paragraphs, one with 80 characters and one with 20, and run BiDi separately on them. I'm confident that my specification which says that it should be preserved as a 100 character long paragraph and passed to BiDi accordingly is already a significant step forward. cheers, egmont From unicode at unicode.org Mon Feb 4 19:44:23 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 5 Feb 2019 01:44:23 +0000 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204021652.6fbe1898@JRWUBU2> Message-ID: <20190205014423.6dedc82f@JRWUBU2> On Mon, 4 Feb 2019 22:27:39 +0100 Egmont Koblinger via Unicode wrote: > Hi Richard, > > > The concept appears to exist in the form of the fields of the > > fifth edition of ECMA-48. Have you digested this ambitious > > standard? > > To be honest: No, I haven't. And I have no idea what those "fields" > are. (Taken out of order) > That being said, I'd really, honestly love to see if someone evaluated > ECMA's "fields" and created a feasibility study for current terminal > emulators, similarly to how I did it with TR/53. They mostly seem to be security, protection and checking features. They seem to make sense for a captive system used as a till or for stock look-up by customers. For example, fields can be restricted as to how they are overwritten, e.g. not at all, or only with numbers, and some fields cannot be copied from the terminal. HTML forms seem to provide most of this functionality nowadays. Fields are persistent attributes. On reading further, the pane boundary functionality seems to be provided by the 'line home position' and 'line limit position'. These would have to be re-established whenever a pane became the active pane, but they seem to support the notion of writing a paragraph into a pane, with the terminal sorting out the splitting into lines. I'm not sure that this would be portable between ECMA-48 terminals; I get the impression that there would be a reliance on unstandardised behaviour being appropriate. I could be wrong; the specification may be there. > I spent (read: wasted) way too much time studying ECMA TR/53 to get to > understand what it's talking about, to realize that the good parts > were already obvious to me, and to be able to argue why I firmly > believe that the bad parts are bad. Remember: These documents were > created in 1991, that is, 28 years ago. (I'm emphasizing it because I > did the math wrong for a long time, I though it was 18 years ago :-D.) > Things have a changed a lot since then. It took me a while to work out that the recommendations of ECMA TR/53 had been implemented in Issue 5 of ECMA-48. > As for the BiDi docs, I found that the current state of the art, > current best practices, exisiting BiDi algorithm differ so much from > ECMA's approach (which no one I'm aware of cared to implement for 28 > years) that the standard is of pretty little use. Only a few good > parts could be kept (but needed tiny corrections), and plenty of other > things needed to be build up anew. This is the only reasonable way to > move forward. The relationship between the data store and the presentation store don't seem to be very well defined. There may be room for the BiDi algorithm there. > If you designed a house 2 or 3 years ago, and finally have the money > to get it built, you can reasonably start building it. If you designed > a house 28 years ago and finally have the chance to build it > (including the exact same heating technologies, electrical system > etc.), you wouldn't, would you? I'm sure you looked at those plans, > and started at the very least heavily updating them, or started to > design a brand new one, perhaps somewhat based on your old ideas. But a scheme may be more persuasive if it can be said to conform to ECMA-48. One thing that is very unclear in ECMA-48 is how characters are allocated to cells in 'implicit' mode. As the Arabic encoding considered contained harakat, it looks as though the allocation is defined by 'unspecified protocols'. I note that in the scheme apparently given most consideration, forced Arabic presentation forms are selected by a combination of escape sequences and Arabic letters. The 'unspecified protocols' could be interpreted as one grapheme cluster* per group of cells. The typical groups would be one cell and the two cells for a CJK character. *Grapheme cluster is a customisable concept. > I don't expect it to be any different with "fields" of ECMA-48. I'm > not aware of any terminal emulator implementing anything like them, > whatever they are. Probably there's a good reason for that. Whatever > purpose they aimed to serve apparently wasn't important enough for > such a long time. By now, if they're found important, they should > probably be solved by some new design (or at the very least, just like > I did with TR/53, the work should begin by evaluating that standard to > see if it's still feasible). > Instead of spending a huge amount of work on my BiDi proposal, I could > have just said: "guys, let's go with ECMA for BiDi handling". The > thing is, I'm pretty sure it wouldn't have taken us anywhere. I don't > expect it to be different with "fields" either. Your interpretation document would have explored the issues. > The starting point for my work was the current state of terminal > emulators and the surrounding ecosystem, plus the current BiDi > algorithm; not some ancient plan that was buried deep in some drawer > for almost three decades. I hope this makes sense. You're assuming that the committee process didn't add much value to the standard. Richard. From unicode at unicode.org Mon Feb 4 21:27:30 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 05 Feb 2019 05:27:30 +0200 Subject: Proposal for BiDi in terminal emulators In-Reply-To: <20190204210055.78cbcd83@JRWUBU2> (message from Richard Wordingham via Unicode on Mon, 4 Feb 2019 21:00:55 +0000) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <20190203013026.5f12605e@JRWUBU2> <834l9kx3ja.fsf@gnu.org> <20190204210055.78cbcd83@JRWUBU2> Message-ID: <831s4mvrrx.fsf@gnu.org> > Date: Mon, 4 Feb 2019 21:00:55 +0000 > From: Richard Wordingham via Unicode > > > The definition is trivial: the order of characters on > > display, from left to right. The only possible reason to split hairs > > here could be when some characters don't appear on display, like > > control characters. Other than that, there should be no doubt what > > visual order means. > > To me, 'visual order' means in the dominant order of the script. That is not the correct definition, IMO. > Furthermore, let me quote from the Bidi Algorithm: > > "In combination with the following rule, this means that trailing > whitespace will appear at the visual end of the line (in the paragraph > direction)." > > The 'visual end' is clearly not always the right-hand end! This talks about the "visual end", not about "visual order". From unicode at unicode.org Mon Feb 4 23:20:27 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 5 Feb 2019 05:20:27 +0000 Subject: Ancient Greek apostrophe marking elision In-Reply-To: <20190128205839.7b06658c@JRWUBU2> References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org> <20190127181928.2d5225a4@JRWUBU2> <20190127233840.72bd25cb@JRWUBU2> <20190128205839.7b06658c@JRWUBU2> Message-ID: On 2019-01-28 8:58 PM, Richard Wordingham wrote: > On Mon, 28 Jan 2019 03:48:52 +0000 > James Kass via Unicode wrote: > >> It?s been said that the text segmentation rules seem over-complicated >> and are probably non-trivial to implement properly.? I tried your >> suggestion of WORD JOINER U+2060 after tau ( ???????? ?? ), but it >> only added yet another word break in LibreOffice. > > I said we *don't* have a control that joins words.? The text of TUS > used to say we had one in U+2060, but that was removed in 2015.? I > pleaded for the retention of this functionality in document > L2/2015/15-192, but my request was refused.? I pointed out in ICU > ticket #11766 that ICU's Thai word breaker retained this facility. ... Sorry for sounding obtuse there.? It was your *post* which suggested the use of WORD JOINER.? You did clearly assert that it would not work.? So, human nature, I /had/ to try it and see. It. did. not. work.? (No surprise.)? But it /should/ have worked. It?s a JOINER, for goodness sake! When the author/editor puts any kind of JOINER into a text string, what?s the intent?? What?s the po?nt of having a JOINER that doesn?t? Recently I put a ZWJ between the ?c? and the ?t? in the word ?Respec?tfully? as an experiment.? Spellchecker flagged both ?respec? and ?tfully? as being misspelt, which they probably are.? A ZWNJ would have been used if there had been any desire for the string to be *split* there, e.g., to forbid formation of a discretionary ligature.? Instead the ZWJ was inserted, signalling authorial intent that a ?more joined? form of the ?c-t? substring was requested. Text a man has JOINED together, let not algorithm put asunder. From unicode at unicode.org Tue Feb 5 00:04:46 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 5 Feb 2019 06:04:46 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> Message-ID: <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> Philippe Verdy responded to William Overington, > the proposal would contradict the goals of variation selectors and would > pollute ther variation sequences registry (possibly even creating conflicts). > And if we admit it for italics, than another VSn will be dedicated to bold, > and another for monospace, and finally many would follow for various > style modifiers. > Finally we would no longer have enough variation selectors for all requests). There are 256 variation selector characters.? Any use of variation sequences not registered by Unicode would be non-conformant. William?s suggestion of floating a proposal for handling italics with VS14 might be an example of the old saying about ?putting the cart before the horse?.? Any preliminary proposal would first have to clear the hurdle of the propriety of handling italic information at the plain-text level.? Such a proposal might list various approaches for accomplishing that, if that hurdle can be surmounted. From unicode at unicode.org Tue Feb 5 03:06:52 2019 From: unicode at unicode.org (James Tauber via Unicode) Date: Tue, 5 Feb 2019 04:06:52 -0500 Subject: Ancient Greek apostrophe marking elision In-Reply-To: References: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com> <20190127013739.3eb50597@JRWUBU2> <20190127052149.1baaf1b2@JRWUBU2> <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org> <20190127181928.2d5225a4@JRWUBU2> <20190127233840.72bd25cb@JRWUBU2> <20190128205839.7b06658c@JRWUBU2> Message-ID: On Tue, Feb 5, 2019 at 12:23 AM James Kass via Unicode wrote: > Text a man has JOINED together, let not algorithm put asunder. > I was hoping so much that ? ??? ? ???? ?????????? ???????? ?? ???????? would have an apostrophe but alas no. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 5 04:23:55 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 5 Feb 2019 10:23:55 +0000 Subject: Encoding italic In-Reply-To: <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> Message-ID: William Overington wrote, > Well, a proposal just about using VS14 to indicate a request for an > italic version of a glyph in plain text, including a suggestion of to > which characters it could apply, would test whether such a proposal > would be accepted to go into the Document Register for the Unicode > Technical Committee to consider or just be deemed out of scope and > rejected and not considered by the Unicode Technical Committee. As long as ?italics in plain-text? is considered out-of-scope by Unicode, any proposal for handling italics in plain-text would probably be considered out-of-scope, as well.? But I could be wrong and wouldn?t mind seeing a proposal. From unicode at unicode.org Tue Feb 5 04:03:49 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Tue, 5 Feb 2019 10:03:49 +0000 (GMT) Subject: Encoding italic In-Reply-To: <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> Message-ID: <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> James Kass wrote: > William?s suggestion of floating a proposal for handling italics with > VS14 might be an example of the old saying about ?putting the cart > before the horse?. Well, a proposal just about using VS14 to indicate a request for an italic version of a glyph in plain text, including a suggestion of to which characters it could apply, would test whether such a proposal would be accepted to go into the Document Register for the Unicode Technical Committee to consider or just be deemed out of scope and rejected and not considered by the Unicode Technical Committee. If the proposal were allowed to become included in the Document Register of the Unicode Technical Committee then if other people wish to submit comments and other proposals then that would be possible as it would have become established that such a topic is deemed acceptable for placing into the Document Register of the Unicode Technical Committee. William Overington Tuesday 5 February 2019 From unicode at unicode.org Tue Feb 5 10:01:41 2019 From: unicode at unicode.org (Andrew West via Unicode) Date: Tue, 5 Feb 2019 16:01:41 +0000 Subject: Encoding italic In-Reply-To: <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> Message-ID: On Tue, 5 Feb 2019 at 15:34, wjgo_10009 at btinternet.com via Unicode wrote: > > italic version of a glyph in plain text, including a suggestion of to > which characters it could apply, would test whether such a proposal > would be accepted to go into the Document Register for the Unicode > Technical Committee to consider or just be deemed out of scope and > rejected and not considered by the Unicode Technical Committee. Just reminding you that "The initial character in a variation sequence is never a nonspacing combining mark (gc=Mn) or a canonical decomposable character" (The Unicode Standard 11.0 ?23.4). This means that a variation sequence cannot be defined for any precomposed letters and diacritics, so for example you could not italicize the word "f?te" by simply adding VS14 after each letter because "?" (in NFC form) cannot act as the base for a variation sequence. You would have to first convert any text to be italicized to NFD, then apply VS14 to each non-combining character. This alone would make a VS solution unacceptable in my opinion. Andrew From unicode at unicode.org Tue Feb 5 10:05:07 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 05 Feb 2019 18:05:07 +0200 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators) In-Reply-To: (message from Egmont Koblinger on Tue, 5 Feb 2019 00:08:10 +0100) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> Message-ID: <83y36ute4s.fsf@gnu.org> > From: Egmont Koblinger > Date: Tue, 5 Feb 2019 00:08:10 +0100 > Cc: unicode at unicode.org > > every single newline character starts a new paragraph. The result of > printf "Hello\nWorld\n" > world.txt > is a text file consisting of two paragraphs, with 5 characters in each. Correct? Yes. > > Actually, Emacs implements the rule that paragraphs are separated by > > empty lines. This is documented in the Emacs manuals. > > That is, Emacs overrides UAX#9 and comes up with a different > definition? Yes, Emacs uses the "higher-level protocols" clause in HL1, when the paragraph direction is to be determined from the text. (There's also a way for the user or a Lisp program to force a certain base paragraph direction on all paragraphs in a window that displays some text.) > Furthermore, you argue that in terminals I should follow > Emacs's definition rather than Unicode's? IME, what Emacs uses gives much better results, yes. > I believe I understand your concerns with the per-line paragraph > definition, but this interpretation that I've just shown most likely > leads to even more broken behavior. I don't see how the result could be more broken, when the decisions about base paragraph direction are made much more rarely. The places in text where the paragraph direction will be determined under my proposal is a small subset of the places where it will be determined by the default UBA rules. So it will make the same mistakes as the each-line-is-a-new-paragraph method, but there will be much fewer of such mistakes. In addition to this theoretical argument, I have 10 years of using this in Emacs to back me up. The only difference between Emacs and your example is the very first paragraph. > It's a really nontrivial technical problem to let the terminal > emulator know where each prompt, and/or each command's output begins > and ends. There's work going on for letting the terminal emulator > recognize the prompts, but even if it's successful, it'll probably > take 5-10 years to reach the majority of the users. And it probably > still wouldn't solve the case of knowing the boundary between the two > outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if > they're concatenated with "cat file1.txt file2.txt". I think you are trying to find a perfect solution, and because it probably doesn't exist, or at least is hard to come by, you conclude that a solution that is imperfect should be rejected. But I'm not saying my proposal is the perfect solution, just that it is better (sometimes, way better) than the default of considering each line a paragraph. > So, what you're arguing for, is that the default behavior should be > something that's: > - currently not implementable in a semantically correct way (to stop > around shell prompts) due to technical limitations, and > - isn't what Unicode says. The first point has to do with the search for a perfect solution. My advice is to settle for something reasonable even if it is not perfect. The second point is incorrect: the UBA explicitly allows the implementation to apply higher-level protocols for paragraph direction, see HL1 in UAX#9. > You have not convinced me that the pros outweigh the cons. There are no cons in my proposal that aren't already present in the default each-line-is-a-new-paragraph rule. So even if the pros don't outweigh the cons, the balance should be better than under the default. > That being said, I'm more than open to see such a behavior as a > future extension, subject of course to the semantic prompt stuff > being available. I think the default should provide reasonably good display, and each-line-is-a-new-paragraph doesn't. From unicode at unicode.org Tue Feb 5 10:06:03 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 05 Feb 2019 18:06:03 +0200 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: <20190205000547.0a38260b@JRWUBU2> (message from Richard Wordingham via Unicode on Tue, 5 Feb 2019 00:05:47 +0000) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> Message-ID: <83womete38.fsf@gnu.org> > Date: Tue, 5 Feb 2019 00:05:47 +0000 > From: Richard Wordingham via Unicode > > > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited > > > by paragraph separator characters. This means characters whose bidi > > > category is B, which includes Newline, the CR-LF pair on Windows, > > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. > > It actually gives two different definitions. Table UAX#9 4 restricts > the type B to *appropriate newline functions; not all newlines are > paragraph separators. For what exactly is "appropriate newline function" one should read the Unicode Standard, section 5.8. My conclusions from that are different from yours; see below. > > Indeed, this was an oversight on my side. So, with this definition, > > every single newline character starts a new paragraph. The result of > > printf "Hello\nWorld\n" > world.txt > > is a text file consisting of two paragraphs, with 5 characters in > > each. Correct? > > No, it depends on when a newline function is 'appropriate'. TUS 5.8 > Rule R2b applies - 'In simple text editors, interpret any NLF the same > as LS'. That's not all of what the Standard says. Just a couple of paragraphs above Rule R2b, there's this text: Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpretation. Only on output is it necessary to distinguish between them. So in practice, IMO the above example does constitute 2 paragraphs, regardless of the underlying platform's conventions. From unicode at unicode.org Tue Feb 5 10:07:04 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 05 Feb 2019 18:07:04 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: (message from Egmont Koblinger on Tue, 5 Feb 2019 01:32:34 +0100) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> Message-ID: <83va1yte1j.fsf@gnu.org> > From: Egmont Koblinger > Date: Tue, 5 Feb 2019 01:32:34 +0100 > Cc: unicode at unicode.org > > On the other hand, it's not unreasonable for higher level stuff (e.g. > shell scripts, or tools like "zip") to use such control characters. Yes, but most of them won't ever do that. > > No, this simple case must work reasonably well with the application > > _completely_ oblivious to the bidi aspects. If this can't work > > reasonably well, I submit that the entire concept of having a > > bidi-aware terminal emulator doesn't "hold water". > > There isn't a magic wand. I can't magically fix every BiDi stuff by > changing the terminal emulator's source code. I didn't say "magically fix", I said "work reasonably well". I think it would be a mistake to demand that any alternative to the default each-line-is-a-new-paragraph method must be perfect. It should be enough if an alternative is better. > What my specification essentially modifies is that with this > specification, you at least will have a chance to get the mode right. My experience is that this is an important feature to have, but it will (maybe even should) be used rather rarely. In most cases you will just have plain text. Moreover, emitting the control sequences that set the mode is in itself a complication, because if the terminal doesn't support them, the result could be corrupted display. You will need methods of detecting the support, and those detection methods usually involve sending another control sequence to the terminal and waiting for response, something that complicates applications and causes delays in displaying output. > In case of "zip", the creators of that software know exactly how the > output should look like Not necessarily true. The translations are normally prepared by people who are experts only in translating messages, they don't necessarily consider layout issues, because for that you'd need to look at the code or even run the program, something translators are unlikely to do. > If you're about to internationalize your software, this layout is a > pretty bad choice. Tell me about that! But the reality is that this is what you get, and IMO the solution for displaying this on a terminal should work reasonably well with that. > This kind of formatting also ignores that English is a pretty dense > language, in other languages the strings tend to become longer. Actually, some/many RTL scripts tend to produce shorter text, because vowels are not written, and because many words have very short roots. But this is a tangent. From unicode at unicode.org Tue Feb 5 10:07:59 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 05 Feb 2019 18:07:59 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: (message from Egmont Koblinger on Tue, 5 Feb 2019 02:28:50 +0100) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> Message-ID: <83tvhite00.fsf@gnu.org> > From: Egmont Koblinger > Date: Tue, 5 Feb 2019 02:28:50 +0100 > Cc: unicode at unicode.org > > I have to admit, I'm not an Emacs user, I only have some vague ideas > how powerful a tool it is. But in its very core I still believe it's a > text editor ? is it fair to say this? It could be used for example to > conveniently create TUTORIAL.he. It is a text editing/processing environment which has a lot of text-based applications built on top of it. It could (and was) used to create TUTORIAL.he, but it can and is used for much more. > There are plenty of line-oriented tools. > [...] Actually, for every utility you mention, Emacs has a command that either invokes the utility and presents its output, or does the same job by using built-in features. So most/all of the jobs you mention are routinely done in Emacs. After all, Emacs is a programmer's editor at its core, so every job programmers routinely do from the shell prompt has an equivalent feature in Emacs. You can even run shells inside Emacs, with Emacs serving as a terminal emulator (which then supports bidi ;-). > There are just sooooooo many use cases, it's impossible to perfectly > address all of them at once. I don't think you need to look for a perfect solution. You need to look for one that works reasonably well in practice. It is my experience in Emacs that the empty line as paragraph delimiter produces much better results than if you treat each line as a separate paragraph. We do have in Emacs features that allow to override the default paragraph direction, but experience shows that they are used relatively rarely. > I'm confident that my specification which says that it should be > preserved as a 100 character long paragraph and passed to BiDi > accordingly is already a significant step forward. I agree, but I urge you to make one more step, which IME is really necessary. From unicode at unicode.org Tue Feb 5 10:51:51 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 5 Feb 2019 17:51:51 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <834l9juw44.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: I think that before making any decision we must make some decision about what we mean by "newlines". There are in fact 3 different functions: - (1) soft line breaks (which are used to enforce a maximum display width between paragraph margins): these are equivalent to breakable and compressible whitespaces, and do not change the logical paragraph direction, they don't insert any additionnal vertical gap between lines, so the logicial line-height is preserved and continues uninterrupted. If text justification applies, this whitespace will be entirely collapsed into the end margin, and any text before it will stilol be justified to match the end margin (until the maximum expansion of other whitespaces in the middle is reached, and the maximum intercharacter gap is also reached (in which case, that line will not longer be expanded more), but this does not apply to terminal emulators that noramlly never use text justification, so the text will just be aligned to the start margin and whitespaces before it on the same line are preserved, and collapsed only at end of the line (just before the soft line break itself) - (2) hard line breaks: they break to a new line but continue the paragraph within its same logical direction, but they are not compressible whitespaces (and do not depend on the logical end margin of the paragraph. - (3) paragraph breaks: generally they introduce an addition vertical gap with top and bottom margins The problem in terminals is that they usually cannot distinguish types (1) and (2), they are simply encoded by a single CR, or LF, or CR+LF, or NEL. Type (1) is only existing within the framework of a higher level protocol which gives additional interpretation to these "newlines". The special control LS is almost never used but may be used for type (1) i.e. soft line-breaks, and will fallback to type (2) which is represented by the legacy "simple" newlines (single CR, or single LF, or single CR+LF, or single NEL). I have seen very little or no use of the LS (line separator) special control. Type (3) may be encoded with PS (paragraph separator), but in terminals (and common protocols line MIME) it is usually encoded using a couple of newline (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with additional whitespaces (and additional presentation characters such as ">" in quotations inserted in mail responses) between them (needed for MIME and HTTP) which may be collapsed when rendering or interpreting them. Some terminal protocols can also use other legacy ASCII separators such as FS, GS, RS, US for grouping units containing multiple paragraphs, or STX/EOT pairs for encapsulating whole text documents in an protocol-specific enveloppe format (and will also use some escaping mechanism for special controls found in the middle, such as DLE+control to escape the control, or DLE+0 to escape a NUL, or DLE+# to escape a DEL, or DEL+x+NN where N are a fixed number of hexadecimal, decimal or octal digits. There's a wide variety of escaping mechanisms used by various higher-layer protocols (including transport protocols or encoding syntaxes used just below the plain-text layer, in a lower layer than the transport protocol layer). Le lun. 4 f?vr. 2019 ? 21:46, Eli Zaretskii via Unicode a ?crit : > > Date: Mon, 4 Feb 2019 19:45:13 +0000 > > From: Richard Wordingham via Unicode > > > > Yes. If one has a text composed of LTR and RTL paragraphs, one has to > > choose how far apart their starting margins are. I think that could > > get complicated for plain text if the terminal has unbounded width. > > But no real-life terminal does. The width is always bounded. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Feb 5 11:55:35 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 5 Feb 2019 17:55:35 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> Message-ID: <20190205175535.5d0d0a3a@JRWUBU2> On Tue, 5 Feb 2019 16:01:41 +0000 Andrew West via Unicode wrote: > You would > have to first convert any text to be italicized to NFD, then apply > VS14 to each non-combining character. This alone would make a VS > solution unacceptable in my opinion. What is so unacceptable about having to do this? Richard. From unicode at unicode.org Wed Feb 6 08:30:24 2019 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Wed, 6 Feb 2019 14:30:24 +0000 (GMT) Subject: mildly OT from bidi - curious email Message-ID: The current bidi discussion prompts me to post a curiosity I received today. I ordered something from a (UK) company, and the payment receipt came via Stripe. So far, so common. The curious thing is that the (entirely ASCII) company name was enclosed in a left-to-right direction, thus: Subject: Your Aaaaaaa Ltd receipt [#nnnn-nnnn] where and are the bidi control characters. I don't think I've seen this before - I wonder why it happened? Also today I got an otherwise ASCII message where every paragraph started with BOM (or ZWNBSP as my font prefers to call it). I see from the web that people used to do this - anybody know what the most common software packages that do it are? -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Wed Feb 6 08:53:49 2019 From: unicode at unicode.org (Arthur Reutenauer via Unicode) Date: Wed, 6 Feb 2019 15:53:49 +0100 Subject: mildly OT from bidi - curious email In-Reply-To: References: Message-ID: <20190206145349.qm3fb4k6jzvruiy6@phare.normalesup.org> On Wed, Feb 06, 2019 at 02:30:24PM +0000, Julian Bradfield via Unicode wrote: > So far, so common. The curious thing is that the (entirely > ASCII) company name was enclosed in a left-to-right direction, thus: > > Subject: Your Aaaaaaa Ltd receipt [#nnnn-nnnn] > > where and are the bidi control characters. > > I don't think I've seen this before - I wonder why it happened? Maybe Stripe stores merchant names with surrounding bidi control characters, so that they?re always rendered in the appropriate direction, even by systems that don?t implement the bidi algorithm? Since the subject is clearly generated automatically from at least three different sources, I can imagine wanting this sort of weak guarantee that merchant names are always marked with the correct writing direction, even if they?re embedded in a different-language string. The directional characters would only need to be added once. Best, Arthur From unicode at unicode.org Wed Feb 6 15:01:59 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Wed, 6 Feb 2019 22:01:59 +0100 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: <83womete38.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> Message-ID: Hi Eli, (I'm getting lost where to reply, and how the subject gets mangled and the thread split into different ones.) I've thought about it a lot, experimented with Emacs's behavior, and I've arrived at the conclusion that we are actually much closer to each other than I had thought. Probably there's a lot of misunderstanding due to different terminology we used. I've set my terminal to RTL paragraph direction (via the relevant escape sequence), then did a "cat TUTORIAL.he" (the file taken from 26.1), and compared to what I see in Emacs 25.2.2 ? both the graphical one, and the one running in a terminal of no BiDi. Apart from a few minor irrelevant differences, they look the same! Hooray!!! (The differences are: - I had to slightly modify TUTORIAL.he to make sure none of the lines start with a BiDi control (I added a preceding character) because currently VTE doesn't support them, there's no character cell to store this data. This definitely needs to be fixed in the second version of my proposal. - Emacs running in a terminal shows an underscore wherever there's a BiDi control in the source file ? while the graphical one doesn't. This looks like a simple bug to me, right? - Line 1007, the copyright line of this file uses visual indentation, and Emacs detects LTR paragraph for that line. I think it should rather use BiDi controls to have an overall RTL paragraph direction detected, and within that BiDi controls to force LTR for the text. The terminal shows it with RTL direction, as I manually set it. Again, all these three details are irrelevant to my point, namely that in WIP gnome-terminal it looks the same as in Emacs.) You define paragraphs as emptyline-separated blocks on which you perform autodetection of the paragraph direction. This is great! As I've mentioned, I'd love to have such a mode in terminals, but it's subject to underlying improvements, like knowing when a prompt starts and ends, because prompts also have to be paragraph delimiters. You convinced me that it's much more important than I thought, thanks a lot for that! I will try to see if I can push for addressing the prerequisite issues sooner. Indeed I had to manually set RTL paragraph direction; with manual LTR or with per-line autodetection (as VTE can do now) the result would be much worse. Here's how the story continues from here. Here is where we misunderstood each other (or at the very least I misunderstood you), although we are talking about the same, doing things the same way: The BiDi algorithm takes a paragraph of text at a time, and somehow reshuffles its letters. UAX#9 section 3 starts by saying that the first main phase is separation into "paragraphs". What are those "paragraphs" that we're takling about _now_? The thing is, both in Emacs as well as in my specification, it's a logical line of the text (that is: delimited by single newlines). No, in these steps, when UBA is run, the paragraph is no longer defined as emptyline-delimited segments, it's defined as lines of the text. To recap: The _paragraph direction_ is determined in Emacs for emptyline-delimited segments of data, which I honestly find a great thing, and would love to do in terminals too, alas at this point it's blocked by some really nontrivial technical issues. But once you have decided on a direction, each _line_ within that data is passed separately to the BiDi algorithm to get reshuffled; this is what Emacs does, this is what my specification says, and this is the right thing. That is, for this step, the definition of "paragraph", as the BiDi algorithm uses this term, is a line of the text file. This is where I thought we had a disagreement, but we don't, we just misunderstood each other. ----- On a nitpicking side note: It's damn ugly not to terminate a text file with a newline. Newline is much better thought of a "terminator" than a "delimiter". For example, if you do a "cat file1 file2", you expect file2 to start on its own line. Shouldn't this apply to paragraphs, too, especially when BiDi is in the game? I'd argue that an empty line (double newline) shouldn't be a delimiter, it should be a terminator for a paragraph. I think "cat file1 file2" should make sure that the last paragraph of file1 and the first paragraph of file2 are printed as separate paragraphs (potentially with different paragraph direction), shouldn't it? I'd argue that if a text file is formatted like TUTORIAL.he, with empty lines denoting paragraph boundaries, then it should also end in an empty line (that is: two newline characters). ----- Feel free to skip the rest :) Let's make a thought experiment. Let's assume that for running the BiDi algorithm, we'd still stick to the emptyline-delimited paragraph definition. This is not what you do, this is not what I do, but I misunderstood that this is what you did, and I also thought this was a good idea as a potential extension for the BiDi specs ? I no longer think so. This definition is truly problematic, as I'll show below. The BiDi algorithm takes paragraphs of text, shuffles them, and somewhere in the middle, with cooperation with the caller, cuts into lines. It doesn't say a single word about the input potentially being cut into lines, how it would handle them, how they would interfere with the line breaks that the caller of the algorithm decides to add etc. It makes sense: the BiDi algorithm converts a logical text into a visual one, whereas single newlines within a paragraph would already be visual elements, so the input string would be a mixture of the two worlds (which probably doesn't make any sense per se). Let's assume that the message I want to deliver is, written in its logical order (left to right), is: abc DEFGHIJKLM NOPQ rstuvwxyz For whatever reason (e.g. I'd prefer to keep a 15 column margin in the source file) it's split into two lines, that is, in the middle that's a newline rather than a space: abcDEFGHIJKLMNOPQrstuvwxyz A completely non-BiDi application would show the contents as abc DEFGHIJKLM NOPQ rstuvwxyz If you run the BiDi algorithm on this unit as a whole paragraph, it would not handle newline any differently from a space. It sees one continous run of RTL text consisting of two words with a newline in between, and reverses their order: abcQPONMLKJIHGFEDrstuvwxyz Which would show up like this in a proper BiDi-aware viewer: abc QPON MLKJIHGFED rstuvwxyz I can see two significant problems with this. One is that because it can shuffle characters around the newline, it breaks the principle that the eyes never have to move upwards. The second is that the margin of 15 characters is no longer preserved. The visual character (newline) no longer serves the visual purpose it served in the logical order. Especially in terminals this could cause a whole bunch of troubles. E.g. when an application believes that printing some stuff moved the cursor down by 2 lines, it might have actually moved it by 3 (if the terminal's overall width is also 15-ish, in this example). It's unclear how cursor positions, mouse click positions (including on the "unused" area after the end of each line) could be mapped, and so on. It's such a complex area that I really wouldn't like to continue in this direction even if it was a correct one, which luckily it isn't. (I vaguely recall, from about a decade ago, that ? presumably for reasons along these lines ? browsers have a huge problem with "
" inside a paragraph when it comes to BiDi. I don't know where they stand now, I'll investigate if it's important, but I don't think it is.) Luckily both Emacs and my specification shuffles the contents separately within both lines (using LTR paragraph for both lines, as it's guessed from the union of them), resulting in the desired: abc MLKJIHGFED QPON rstuvwxyz Does this all make much more sense now? :) cheers, egmont On Tue, Feb 5, 2019 at 5:09 PM Eli Zaretskii via Unicode wrote: > > > Date: Tue, 5 Feb 2019 00:05:47 +0000 > > From: Richard Wordingham via Unicode > > > > > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited > > > > by paragraph separator characters. This means characters whose bidi > > > > category is B, which includes Newline, the CR-LF pair on Windows, > > > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. > > > > It actually gives two different definitions. Table UAX#9 4 restricts > > the type B to *appropriate newline functions; not all newlines are > > paragraph separators. > > For what exactly is "appropriate newline function" one should read the > Unicode Standard, section 5.8. My conclusions from that are different > from yours; see below. > > > > Indeed, this was an oversight on my side. So, with this definition, > > > every single newline character starts a new paragraph. The result of > > > printf "Hello\nWorld\n" > world.txt > > > is a text file consisting of two paragraphs, with 5 characters in > > > each. Correct? > > > > No, it depends on when a newline function is 'appropriate'. TUS 5.8 > > Rule R2b applies - 'In simple text editors, interpret any NLF the same > > as LS'. > > That's not all of what the Standard says. Just a couple of paragraphs > above Rule R2b, there's this text: > > Note that even if an implementer knows which characters represent > NLF on a particular platform, CR, LF, CRLF, and NEL should be > treated the same on input and in interpretation. Only on output is > it necessary to distinguish between them. > > So in practice, IMO the above example does constitute 2 paragraphs, > regardless of the underlying platform's conventions. From unicode at unicode.org Wed Feb 6 15:29:36 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Wed, 6 Feb 2019 22:29:36 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: Hi Philippe, Thanks a lot for your input! Another fundamental difficulty with terminal emulators is: These controls (CR, LF...) are control instructions that move the cursor in some ways, and then are forgotten. You cannot do BiDi on the instructions the terminal receives. You can only do BiDi on the result, the contents of the canvas after these instructions are executed. Here these controls are either lost, or you have to give a specification how exactly they need to be remembered, i.e. converted to being part of the canvas's data. Let's also mention that trying to get apps into using them is quite hopeless. The best you can do is design BiDi around what you already have, which pretty much means hard vs. soft line endings, and hopefully forthcoming semantical marks around shell prompts. (To overcomplicate the story, a received LF doesn't convert the line ending to hard wrapped in most terminal emulators. In some it does. I don't think there's an exact specification anywhere. Maybe the BiDi spec needs to create one. Lines are hard wrapped by default, turned to soft wrapped when the text gets wrapped at the end of the line, and a few random control functions turn them back to hard one, but in most terminals, a newline is not such a control function.) Anyway, please also see my previous email; I hope that clarifies a lot for you, too. cheers, egmont On Tue, Feb 5, 2019 at 5:53 PM Philippe Verdy via Unicode wrote: > > I think that before making any decision we must make some decision about what we mean by "newlines". There are in fact 3 different functions: > - (1) soft line breaks (which are used to enforce a maximum display width between paragraph margins): these are equivalent to breakable and compressible whitespaces, and do not change the logical paragraph direction, they don't insert any additionnal vertical gap between lines, so the logicial line-height is preserved and continues uninterrupted. If text justification applies, this whitespace will be entirely collapsed into the end margin, and any text before it will stilol be justified to match the end margin (until the maximum expansion of other whitespaces in the middle is reached, and the maximum intercharacter gap is also reached (in which case, that line will not longer be expanded more), but this does not apply to terminal emulators that noramlly never use text justification, so the text will just be aligned to the start margin and whitespaces before it on the same line are preserved, and collapsed only at end of the line (just before the soft line break itself) > - (2) hard line breaks: they break to a new line but continue the paragraph within its same logical direction, but they are not compressible whitespaces (and do not depend on the logical end margin of the paragraph. > - (3) paragraph breaks: generally they introduce an addition vertical gap with top and bottom margins > > The problem in terminals is that they usually cannot distinguish types (1) and (2), they are simply encoded by a single CR, or LF, or CR+LF, or NEL. Type (1) is only existing within the framework of a higher level protocol which gives additional interpretation to these "newlines". The special control LS is almost never used but may be used for type (1) i.e. soft line-breaks, and will fallback to type (2) which is represented by the legacy "simple" newlines (single CR, or single LF, or single CR+LF, or single NEL). I have seen very little or no use of the LS (line separator) special control. > > Type (3) may be encoded with PS (paragraph separator), but in terminals (and common protocols line MIME) it is usually encoded using a couple of newline (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with additional whitespaces (and additional presentation characters such as ">" in quotations inserted in mail responses) between them (needed for MIME and HTTP) which may be collapsed when rendering or interpreting them. > > Some terminal protocols can also use other legacy ASCII separators such as FS, GS, RS, US for grouping units containing multiple paragraphs, or STX/EOT pairs for encapsulating whole text documents in an protocol-specific enveloppe format (and will also use some escaping mechanism for special controls found in the middle, such as DLE+control to escape the control, or DLE+0 to escape a NUL, or DLE+# to escape a DEL, or DEL+x+NN where N are a fixed number of hexadecimal, decimal or octal digits. There's a wide variety of escaping mechanisms used by various higher-layer protocols (including transport protocols or encoding syntaxes used just below the plain-text layer, in a lower layer than the transport protocol layer). > > Le lun. 4 f?vr. 2019 ? 21:46, Eli Zaretskii via Unicode a ?crit : >> >> > Date: Mon, 4 Feb 2019 19:45:13 +0000 >> > From: Richard Wordingham via Unicode >> > >> > Yes. If one has a text composed of LTR and RTL paragraphs, one has to >> > choose how far apart their starting margins are. I think that could >> > get complicated for plain text if the terminal has unbounded width. >> >> But no real-life terminal does. The width is always bounded. From unicode at unicode.org Wed Feb 6 15:45:47 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Wed, 6 Feb 2019 22:45:47 +0100 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> Message-ID: Hi, I was loose with my terminology once again, which is not a wise thing when you're trying to clarify misunderstandings :) > But once you have > decided on a direction, each _line_ within that data is passed > separately to the BiDi algorithm to get reshuffled; this is what Emacs > does, this is what my specification says, and this is the right thing. > That is, for this step, the definition of "paragraph", as the BiDi > algorithm uses this term, is a line of the text file. I keep thinking of the BiDi algorithm as one that takes a single paragraph, because that's how I use it in VTE. But in fact, the BiDi algorithm starts by splitting into paragraphs. I keep forgetting about this outermost "for loop" of the BiDi algo. And with that, proper definition, you can of course pass the entire emptyline-delimited segment into the BiDi algorithm in a single step. In its first phase, the BiDi algorithm will split it at newlines, because for the BiDi algorithm (but not when detecting the paragraph direction in Emacs), newline is the paragraph delimiter. Then it will execute the rest of the algorithm for each paragraph (that is: line) separately. This is exactly the same as splitting manually, and then for each line invoking the BiDi algorithm. cheers, egmont From unicode at unicode.org Wed Feb 6 17:32:43 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 6 Feb 2019 23:32:43 +0000 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: References: <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> Message-ID: <20190206233243.7ebaafc1@JRWUBU2> On Wed, 6 Feb 2019 22:01:59 +0100 Egmont Koblinger via Unicode wrote: > Hi Eli, > > (I'm getting lost where to reply, and how the subject gets mangled and > the thread split into different ones.) > > > I've thought about it a lot, experimented with Emacs's behavior, and > I've arrived at the conclusion that we are actually much closer to > each other than I had thought. Probably there's a lot of > misunderstanding due to different terminology we used. > > I've set my terminal to RTL paragraph direction (via the relevant > escape sequence), then did a "cat TUTORIAL.he" (the file taken from > 26.1), and compared to what I see in Emacs 25.2.2 ? both the graphical > one, and the one running in a terminal of no BiDi. > > Apart from a few minor irrelevant differences, they look the same! > Hooray!!! > > (The differences are: > > - I had to slightly modify TUTORIAL.he to make sure none of the lines > start with a BiDi control (I added a preceding character) because > currently VTE doesn't support them, there's no character cell to store > this data. This definitely needs to be fixed in the second version of > my proposal. > > - Emacs running in a terminal shows an underscore wherever there's a > BiDi control in the source file ? while the graphical one doesn't. > This looks like a simple bug to me, right? > > - Line 1007, the copyright line of this file uses visual indentation, > and Emacs detects LTR paragraph for that line. I think it should > rather use BiDi controls to have an overall RTL paragraph direction > detected, and within that BiDi controls to force LTR for the text. The > terminal shows it with RTL direction, as I manually set it. > > Again, all these three details are irrelevant to my point, namely that > in WIP gnome-terminal it looks the same as in Emacs.) > > > You define paragraphs as emptyline-separated blocks on which you > perform autodetection of the paragraph direction. This is great! As > I've mentioned, I'd love to have such a mode in terminals, but it's > subject to underlying improvements, like knowing when a prompt starts > and ends, because prompts also have to be paragraph delimiters. Not necessarily. One could allow the first strong character in the prompt to determine the paragraph directions. That's what the Emacs terminal (invoked by M-x term; top level definition in term.el) does. > On a nitpicking side note: > > It's damn ugly not to terminate a text file with a newline. Newline is > much better thought of a "terminator" than a "delimiter". For example, > if you do a "cat file1 file2", you expect file2 to start on its own > line. Not necessarily. One might use cat to glue together files that had split into 1400k chunks, in which case it is not even reasonable to expect the end of file to be at a character boundary. (Yes, floppy disks still have their uses.) > Shouldn't this apply to paragraphs, too, especially when BiDi is in > the game? I'd argue that an empty line (double newline) shouldn't be a > delimiter, it should be a terminator for a paragraph. But the white space between paragraphs is a separator, not a terminator. One doesn't require it at the end when formatting paragraphs within the cell of a table. Richard. From unicode at unicode.org Wed Feb 6 17:45:55 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Thu, 7 Feb 2019 00:45:55 +0100 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: <20190206233243.7ebaafc1@JRWUBU2> References: <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <20190206233243.7ebaafc1@JRWUBU2> Message-ID: Hi Richard, > Not necessarily. One could allow the first strong character in the > prompt to determine the paragraph directions How does Emacs know what's a prompt? How can it tell it from the previous and next command's output? Whatever it does to know where the prompt is, can it be made into a standard, cross-terminal feature? > That's what the Emacs > terminal (invoked by M-x term; top level definition in term.el) does. I tried it. Executed my default shell, and inside that, a "cat TUTORIAL.he". All the paragraphs are rendered as LTR ones, left-aligned. Not the way the file is opened in Emacs. If you claim Emacs's built-in terminal emulator supports BiDi, I'm kindly asking you to present a documentation of its behavior, in similar spirit to my BiDi proposal. > Not necessarily. One might use cat to glue together files that had > split into 1400k chunks, in which case it is not even reasonable to > expect the end of file to be at a character boundary. (Yes, floppy > disks still have their uses.) I did not say anything about changing cat's behavior. I recommended to change the convention for such paragraph-oriented text files to end with two newlines. > But the white space between paragraphs is a separator, not a > terminator. One doesn't require it at the end when formatting > paragraphs within the cell of a table. Does this logic also apply to single newline characters? If not, why not, what's the conceptual difference? If it does, why do text files end in a newline? e. From unicode at unicode.org Wed Feb 6 20:50:27 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 7 Feb 2019 03:50:27 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: I read your email, you spoke for example about how a typical Unix/Linux tool shows its usage option (e.g. "anycommand --help") with a leading line then syntaxes and tabulated lists of options followed by translated help on the same line. There's some rules for correct display including with Bidi: - Separate paragraphs that need a different default Bidi by double newlines (to force a hard break) - use a single newline on continuation - if technical items are untranslatable, make sure they are at the begining of lines and indented by some leading spaces, before translated ones. - avoid breaking lists - try to separate as much as posible text in natural languages from technical texts. - Be careful about correcty usage of leading punctuations (notably for list items) - Be consistant about indentation - Normalize spaces, - Don't ussume that TAB controls have the same width (ban TABS except at the begining of lines) - In column output, separate colums always with at least two spaces, don't glue them as if they were sentences. - Don't use "soft line breaks" in the middle of short lines (less than 72 base characters) - Don't use any Bidi control ! With some cares, you can perfectly translate Linux/Unix tools in languages needing Bidi and get consistant output, but be careful if your text contains placeholders or technihcal untranslated terms (make sure to surround them with paired punctuation, or don't translate them at all. And avoid paragraphs that would mix natural and technical untranslatable terms (such as command names or command-line options). Make sure to test the output so that it will also work with varaible fonts (don't assume monospaced fonts are used, they do not exist for various scripts and don't work reliably for Arabic and most Asian scripts, and not even for Chinese or Japanese even if these don't need Bidi support). But the difficulty is not really in the terminal emulators but in the source texts given to translators, when they don't know the context in which the text will be used and have no hint about which terms should not be translated (because they can become inconsistant: there are many examples, even in Windows 10, where some of the command line tools are completely unusable with the translated UI and with examples of syntaxes that are not even working where some terms were randomly and inconsistantly translated or confused, or because tools assumed an LTR-only layout of the output, and monospaced fonts with one-to-one character per display cell, or requiring specific fonts that do not contain the characters in their monospaced variants: this is challenging notably for Asian scripts needing complex clusters if you made these Latin-based assumptions) Le mer. 6 f?vr. 2019 ? 22:30, Egmont Koblinger a ?crit : > Hi Philippe, > > Thanks a lot for your input! > > Another fundamental difficulty with terminal emulators is: These > controls (CR, LF...) are control instructions that move the cursor in > some ways, and then are forgotten. You cannot do BiDi on the > instructions the terminal receives. You can only do BiDi on the > result, the contents of the canvas after these instructions are > executed. Here these controls are either lost, or you have to give a > specification how exactly they need to be remembered, i.e. converted > to being part of the canvas's data. > > Let's also mention that trying to get apps into using them is quite > hopeless. The best you can do is design BiDi around what you already > have, which pretty much means hard vs. soft line endings, and > hopefully forthcoming semantical marks around shell prompts. (To > overcomplicate the story, a received LF doesn't convert the line > ending to hard wrapped in most terminal emulators. In some it does. I > don't think there's an exact specification anywhere. Maybe the BiDi > spec needs to create one. Lines are hard wrapped by default, turned to > soft wrapped when the text gets wrapped at the end of the line, and a > few random control functions turn them back to hard one, but in most > terminals, a newline is not such a control function.) > > Anyway, please also see my previous email; I hope that clarifies a lot > for you, too. > > > cheers, > egmont > > On Tue, Feb 5, 2019 at 5:53 PM Philippe Verdy via Unicode > wrote: > > > > I think that before making any decision we must make some decision about > what we mean by "newlines". There are in fact 3 different functions: > > - (1) soft line breaks (which are used to enforce a maximum display > width between paragraph margins): these are equivalent to breakable and > compressible whitespaces, and do not change the logical paragraph > direction, they don't insert any additionnal vertical gap between lines, so > the logicial line-height is preserved and continues uninterrupted. If text > justification applies, this whitespace will be entirely collapsed into the > end margin, and any text before it will stilol be justified to match the > end margin (until the maximum expansion of other whitespaces in the middle > is reached, and the maximum intercharacter gap is also reached (in which > case, that line will not longer be expanded more), but this does not apply > to terminal emulators that noramlly never use text justification, so the > text will just be aligned to the start margin and whitespaces before it on > the same line are preserved, and collapsed only at end of the line (just > before the soft line break itself) > > - (2) hard line breaks: they break to a new line but continue the > paragraph within its same logical direction, but they are not compressible > whitespaces (and do not depend on the logical end margin of the paragraph. > > - (3) paragraph breaks: generally they introduce an addition vertical > gap with top and bottom margins > > > > The problem in terminals is that they usually cannot distinguish types > (1) and (2), they are simply encoded by a single CR, or LF, or CR+LF, or > NEL. Type (1) is only existing within the framework of a higher level > protocol which gives additional interpretation to these "newlines". The > special control LS is almost never used but may be used for type (1) i.e. > soft line-breaks, and will fallback to type (2) which is represented by the > legacy "simple" newlines (single CR, or single LF, or single CR+LF, or > single NEL). I have seen very little or no use of the LS (line separator) > special control. > > > > Type (3) may be encoded with PS (paragraph separator), but in terminals > (and common protocols line MIME) it is usually encoded using a couple of > newline (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with > additional whitespaces (and additional presentation characters such as ">" > in quotations inserted in mail responses) between them (needed for MIME and > HTTP) which may be collapsed when rendering or interpreting them. > > > > Some terminal protocols can also use other legacy ASCII separators such > as FS, GS, RS, US for grouping units containing multiple paragraphs, or > STX/EOT pairs for encapsulating whole text documents in an > protocol-specific enveloppe format (and will also use some escaping > mechanism for special controls found in the middle, such as DLE+control to > escape the control, or DLE+0 to escape a NUL, or DLE+# to escape a DEL, or > DEL+x+NN where N are a fixed number of hexadecimal, decimal or octal > digits. There's a wide variety of escaping mechanisms used by various > higher-layer protocols (including transport protocols or encoding syntaxes > used just below the plain-text layer, in a lower layer than the transport > protocol layer). > > > > Le lun. 4 f?vr. 2019 ? 21:46, Eli Zaretskii via Unicode < > unicode at unicode.org> a ?crit : > >> > >> > Date: Mon, 4 Feb 2019 19:45:13 +0000 > >> > From: Richard Wordingham via Unicode > >> > > >> > Yes. If one has a text composed of LTR and RTL paragraphs, one has to > >> > choose how far apart their starting margins are. I think that could > >> > get complicated for plain text if the terminal has unbounded width. > >> > >> But no real-life terminal does. The width is always bounded. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 7 02:17:23 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 7 Feb 2019 08:17:23 +0000 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: References: <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <20190206233243.7ebaafc1@JRWUBU2> Message-ID: <20190207081723.42b6aa60@JRWUBU2> On Thu, 7 Feb 2019 00:45:55 +0100 Egmont Koblinger via Unicode wrote: > Hi Richard, > > > Not necessarily. One could allow the first strong character in the > > prompt to determine the paragraph directions > > How does Emacs know what's a prompt? How can it tell it from the > previous and next command's output? I don't believe the Emacs terminal does either. What's special about the prompt is that it starts a line, so most paragraphs start with a prompt. Not all prompts contain a strong character. To let a file's contents control directionality, instead of issuing the command 'cat file1' one would have to issue a shell command '(echo; cat file1)' or similar to terminate the paragraph containing the prompt. The 'echo' inserts an empty line. > > That's what the Emacs > > terminal (invoked by M-x term; top level definition in term.el) > > does. > > I tried it. Executed my default shell, and inside that, a "cat > TUTORIAL.he". All the paragraphs are rendered as LTR ones, > left-aligned. Not the way the file is opened in Emacs. See above. I don't know how what your shell is. > If you claim Emacs's built-in terminal emulator supports BiDi, I'm > kindly asking you to present a documentation of its behavior, in > similar spirit to my BiDi proposal. I've a feeling it has emergent behaviour, and may require a lot of experimentation to elucidate. > Does this logic also apply to single newline characters? If not, why > not, what's the conceptual difference? If it does, why do text files > end in a newline? I don't like the convention that removing the newline from the end of a non-empty line changes it into a binary file. The short answer is that some editors allow a text file not to have a final newline; such files are not handled well in the Unix environment. Some things are just untidy messes. Compare C, where a semicolon *terminates* statements, but some are terminated by '}', and a semicolon *separates* the expression within the control part of a for statement, and a comma *separates* the constant definitions in an enum declaration - for a long time, a trailing comma inside the braces was illegal. Richard. From unicode at unicode.org Thu Feb 7 06:29:09 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Thu, 7 Feb 2019 13:29:09 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: Hi Philippe, > There's some rules for correct display including with Bidi: In what sense are these "rules"? Where are these written, in what kind of specification or existing practice? > - Separate paragraphs that need a different default Bidi by double newlines (to force a hard break) There is currently no terminal emulator I'm aware of that uses empty lines as boundaries of BiDi treatment. While my recommendation uses a one smaller unit (logical lines), and I understand as per Eli's request that it would be desireable to go with emptyline-delimited boundaries, what in fact all the current self-proclaimed BiDi-aware terminal emulators that I came across do is use a unit two steps smaller than yours: they do BiDi on physical lines of the terminal, no matter how a logical line of the output had to wrap into physical ones because didn't fit in the width. (It's a terrible behavior.) The current behavior of terminal emulators is very far from what you describe. > - use a single newline on continuation Continuation of what exactly? But let's take a step back: Should the output be pre-formatted by some means, or do we rely on the terminal emulator wrapping overlong lines? (If pre-formatted then for what width? 80 columns, so that I waste precious real estate if my terminal is wider? Or is it a requirement for any app that produces output to implement a decent dynamic wrapping engine for nice formatting according to the actual width?) There's precedence for both of these different approaches. I don't think it's feasible to pick one, and claim that the other approach is discouraged/invalid/whatever. > - if technical items are untranslatable, make sure they are at the begining of lines and indented by some leading spaces, before translated ones. I firmly disagree. There shouldn't be any restriction on how a translator wishes to translate a sentence. The computer world has to adapt to the requirements of human languages, not the other way around! > - Don't use any Bidi control ! Why not? They do exist for a reason, for the very reason that any logical translation, which a translator might want to write (see my previous point) is presentable in a visually correct way. Use them for that, whenever needed. cheers, egmont From unicode at unicode.org Thu Feb 7 08:14:40 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Thu, 07 Feb 2019 16:14:40 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: (message from Egmont Koblinger on Wed, 6 Feb 2019 22:01:59 +0100) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> Message-ID: <83d0o3sn1r.fsf@gnu.org> > From: Egmont Koblinger > Date: Wed, 6 Feb 2019 22:01:59 +0100 > Cc: Richard Wordingham , unicode at unicode.org > > - Emacs running in a terminal shows an underscore wherever there's a > BiDi control in the source file ? while the graphical one doesn't. > This looks like a simple bug to me, right? Not a bug, a feature. Emacs doesn't remove the bidi controls from display (that's another deviation allowed by the UBA, see section 5.2). On GUI displays, these controls are displayed as thin 1-pixel spaces, but on text-mode terminals they are shown as space. The underscore you see is a special typeface used to indicate that this is not really a space. (This is the default; Emacs being Emacs, it allows to customize how these characters are displayed, and in particular not to display them at all.) > - Line 1007, the copyright line of this file uses visual indentation, > and Emacs detects LTR paragraph for that line. I think it should > rather use BiDi controls to have an overall RTL paragraph direction > detected, and within that BiDi controls to force LTR for the text. Why? As I said, the tutorial was written in part to demonstrate the UBA implementation, including the dynamic detection of base paragraph direction, and this is exactly one example of how it works in practice. > To recap: The _paragraph direction_ is determined in Emacs for > emptyline-delimited segments of data, which I honestly find a great > thing, and would love to do in terminals too, alas at this point it's > blocked by some really nontrivial technical issues. But once you have > decided on a direction, each _line_ within that data is passed > separately to the BiDi algorithm to get reshuffled Yes and no. You could keep your mental model if you like, but actually the UBA explicitly says that each line is to be reordered for display separately, see section 3.4 of UAX#9. > Let's make a thought experiment. Let's assume that for running the > BiDi algorithm, we'd still stick to the emptyline-delimited paragraph > definition. This is not what you do, this is not what I do, but I > misunderstood that this is what you did, and I also thought this was a > good idea as a potential extension for the BiDi specs ? I no longer > think so. This definition is truly problematic, as I'll show below. Which is why it is not what the UBA says one should do. From unicode at unicode.org Thu Feb 7 08:18:10 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Thu, 07 Feb 2019 16:18:10 +0200 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: <20190206233243.7ebaafc1@JRWUBU2> (message from Richard Wordingham via Unicode on Wed, 6 Feb 2019 23:32:43 +0000) References: <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <20190206233243.7ebaafc1@JRWUBU2> Message-ID: <83bm3nsmvx.fsf@gnu.org> > Date: Wed, 6 Feb 2019 23:32:43 +0000 > From: Richard Wordingham via Unicode > > > You define paragraphs as emptyline-separated blocks on which you > > perform autodetection of the paragraph direction. This is great! As > > I've mentioned, I'd love to have such a mode in terminals, but it's > > subject to underlying improvements, like knowing when a prompt starts > > and ends, because prompts also have to be paragraph delimiters. > > Not necessarily. One could allow the first strong character in the > prompt to determine the paragraph directions. That's what the Emacs > terminal (invoked by M-x term; top level definition in term.el) does. Emacs's built-in terminal emulator does that only because no one bothered to do something about this behavior. I personally don't consider this the correct behavior (but then I don't use M-x term in Emacs except for testing). Emacs does know where the prompt is, so it could implement the rule that whatever follows the prompt starts a new paragraph. From unicode at unicode.org Thu Feb 7 08:20:56 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 7 Feb 2019 15:20:56 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: Le jeu. 7 f?vr. 2019 ? 13:29, Egmont Koblinger a ?crit : > Hi Philippe, > > > There's some rules for correct display including with Bidi: > > In what sense are these "rules"? Where are these written, in what kind > of specification or existing practice? > "Rules" are not formally written, they are just a sense of best practices. Bidi plays very badly on terminals (even enhanced terminals like VT-* or ANSI that expose capabilities when, most of the time, these capabilities are not even accessible: it is too late and further modifications of the terminal properties (notably its display size) can never be taken into account (it is too late, the ouput has been already generated, and all what the terminal can do is to play with what is in its history buffers). Even on dual-channel protocols (input and output), terminal protocols are also not synchronizing the input and the output and these asynchrnous channels ignore the transmission time between the terminal and the aware application, so the terminal protocol must include a functio nthat allows flushing and redrawing the screen completely (but this requires long delays). With a common 9.6kbps serial link, refreshing a typical 80x25 screen takes about one half second, which is much longer than typical user input, so full screen refresh does not work for data input and editing, and terminals implement themselves the echo of user input, ignoring how and when the receiving application will handle the input, and also ignoring if the applciation is already sending ouput to the terminal. It's hard or impossible to synchroinize this and local echoes on the terminal causes havoc. I've not seen any way for a terminal to handle all these constraints. So the only way for them is to support them only plain-text basic documents, formatted reasonnably, and inserting layout "hints" in the format of their output so that termioanl can perform reasonnable guesses and adapt. But the concept of "line" or "paragraph" in a terminal protocols is extremely fuzzy. It's then very difficult to take into account the additiona Bidi contraints as it's impossible to conciliate BOTH the logical ordering (what is encoded in the transmitted data or kept in history buffers) and the visual ordering. That's why there are terminal protocols that absolutely don't want to play with the logical ordering and require all their data to be transmitted in visual order (in which case, there's no bidi handling at all). Then terminals will attempt to consiliate the visual line delimitations (in the transmitted data) with the local-only capabilities of the rendered frame. Many terminals will also not allow changing the display width, will not allow changing the display cell size, will force constraints on cell sizes and fonts, and then won't be able to correctly output many Asian scripts. In fact most terminal protocols are very defective and were never dessign to handle Bidi input, and Asian scripts with compelx clusters and variable fonts that are needed for them (even CJK scripts which use a mix of "half-wifth" and "full-width" characters. > - Separate paragraphs that need a different default Bidi by double > newlines (to force a hard break) > > There is currently no terminal emulator I'm aware of that uses empty > lines as boundaries of BiDi treatment. > These are hint in absence of something else, and it plays a role when the terminal disaply width is unpredicable by the application making the output and having no access to any return input channel. Take the example of terminal emulators in resizable windows: the display width is undefined, but there's not any document level and no buffering, scrolling text will flush the ouput partially, history is limited A terminal emulator then needs hints about where paragrpahs are delimited and most often don't have any other distinctions available even in their limited history that allows distinguishing the 3 main kinds of line breaks. > While my recommendation uses a one smaller unit (logical lines), and I > And here your unit (logical lines) is not even defined in the terminal protocol and not known from the meitting applications whjich has no input about the final output terminal properties. So the terminal must perform guesses. As it can insert additional linebreaks itself, and scroll out some portion of it, there's no way to delimit the effect of "bidi controls". The basic requirement for correctly handling bidi controls is to make sure that paragraph delimitations are known and stable. if additional breaks can occur anywhere on what you think is a "logical line" but which is different from the mietting application (or static text document which is ouput "as is" without any change to reformat it, these bidi controls just make things worse and it becomes impossible to make reasonnable guesses about paragraph delimitations in the terminal. The result become unpredictable and most often will not even make any sense as the terminal uses visual ordering always but looses the track of the logical ordering (and things get worse when there are complex clusters or characters that cannot even fit in a monospaced grid. The current behavior of terminal emulators is very far from what you > describe. > Terminal emulators only perform guesses, most of these guesees are valid only with "simple" scripts with one character per cell, assuming a minimum resolution of each cell (the minimum is a 8x8 pixel square, too small for Asian scripts, but typical for rendering on old analog TVs; the typical one is a half-width rectangle, not really much larger, but about 50% taller, and with many Asian scripts still do not fit well). These protocosl were just made for Latin, and similar simpler scripts (Cyrillic, Greek, and simple Japanese scripts, or Hangul jamos ignoring clusters and presented only with halfwidth characters, ignoring all complex clusters). For everything else, there's no defined behavior, no support, no reference documentation, everything is untested, you get extremely variable results, the ouput could be completely garbled and unreadable. The situation is then worse for interactive applications (notably full screen text editors, including vi(m) and emacs) using these terminal protocols over slow unsynchronized dual links. If you want to play well with most terminals you have to limit a lot wht you can do with "terminal protocols" and strictly limit your use of controls. In fact the only "stable" thing which works more or less is the basic MIME plain text profile which just need uses a single encoding for ALL kinds of newlines (and competely ignores the distinction between the 3 main kind of line breaks). That's where you need to insert hints: basically the encoded text have to assume a minimum display width, and any "line" longer than about 70 character cells is assumed to be fllowed on the next line, unless that next line is empty, and Bidi controls is not used at all but guessed from characters properties at "reasonnable" paragraph boundaries detemined heuristically by the terminal emulator but not encodable in the data stream itself. > > - use a single newline on continuation > > Continuation of what exactly? > Continuation of paragraphs on the next visual line. I think this did not required any precision, it was sufficient on the existing context where you extracted this word, or did not read anything. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 7 08:27:08 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Thu, 07 Feb 2019 16:27:08 +0200 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: (message from Egmont Koblinger via Unicode on Thu, 7 Feb 2019 00:45:55 +0100) References: <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <20190206233243.7ebaafc1@JRWUBU2> Message-ID: <83a7j7smgz.fsf@gnu.org> > Date: Thu, 7 Feb 2019 00:45:55 +0100 > Cc: unicode Unicode Discussion > From: Egmont Koblinger via Unicode > > > Not necessarily. One could allow the first strong character in the > > prompt to determine the paragraph directions > > How does Emacs know what's a prompt? How can it tell it from the > previous and next command's output? It uses a regular expression, see term-prompt-regexp. > Whatever it does to know where the prompt is, can it be made into a > standard, cross-terminal feature? Not sure. It's a kind of heuristic, which is why the regexp is customizable on user level, so that users could adapt it to their needs, should that be necessary. > > That's what the Emacs > > terminal (invoked by M-x term; top level definition in term.el) does. > > I tried it. Executed my default shell, and inside that, a "cat > TUTORIAL.he". All the paragraphs are rendered as LTR ones, > left-aligned. Not the way the file is opened in Emacs. In what version of Emacs is that? In the latest version 26 I have here, the tutorial displays with most paragraphs in RTL direction. > If you claim Emacs's built-in terminal emulator supports BiDi, I'm > kindly asking you to present a documentation of its behavior, in > similar spirit to my BiDi proposal. The Emacs terminal emulator displays text as any other text in any other Emacs buffer, so it supports the same bidi reordering as elsewhere. You could make it emulate other terminals by setting the variable bidi-paragraph-direction to either left-to-right or right-to-left, then all the paragraphs will have the base direction you specify. But the default value of this variable in term buffers is nil, which invokes dynamic determination of base paragraph direction. From unicode at unicode.org Thu Feb 7 08:52:41 2019 From: unicode at unicode.org (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?= via Unicode) Date: Thu, 7 Feb 2019 15:52:41 +0100 Subject: Two more ellispis-type interpunctations: ?.. and !.. Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 7 09:33:57 2019 From: unicode at unicode.org (Serik Serikbay via Unicode) Date: Thu, 7 Feb 2019 21:33:57 +0600 Subject: Two more ellispis-type interpunctations: ?.. and !.. In-Reply-To: References: Message-ID: Khakass language is much close to Kyrgyz .. On Thu, Feb 7, 2019 at 8:54 PM "J?rg Knappen" via Unicode < unicode at unicode.org> wrote: > While working on a corpus of Kyrgyz language, a Turkic language written in > the Cyrilic script, > I encountered two ellipsis-type interpunctations, namely ?.. and !.. > > Note that this is not (yet) a proposal to encode them a single Unicode > characters although I would definitely > use such characters when available because they make the text processing > tool chain much simpler and more > robust. It is a survey question: > > Do you have encountered ?.. or !.. in other languages than Kyrgyz? > > --J?rg Knappen > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 7 11:12:37 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Thu, 7 Feb 2019 18:12:37 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <83d0o3sn1r.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <83d0o3sn1r.fsf@gnu.org> Message-ID: On Thu, Feb 7, 2019 at 3:14 PM Eli Zaretskii wrote: > Not a bug, a feature. Emacs doesn't remove the bidi controls from > display (that's another deviation allowed by the UBA, see section > 5.2). On GUI displays, these controls are displayed as thin 1-pixel > spaces, but on text-mode terminals they are shown as space. Thanks for the clarification! > Why? As I said, the tutorial was written in part to demonstrate the > UBA implementation, including the dynamic detection of base paragraph > direction, and this is exactly one example of how it works in > practice. Fair enough, then. > > To recap: The _paragraph direction_ is determined in Emacs for > > emptyline-delimited segments of data, which I honestly find a great > > thing, and would love to do in terminals too, alas at this point it's > > blocked by some really nontrivial technical issues. But once you have > > decided on a direction, each _line_ within that data is passed > > separately to the BiDi algorithm to get reshuffled > > Yes and no. You could keep your mental model if you like, but > actually the UBA explicitly says that each line is to be reordered for > display separately, see section 3.4 of UAX#9. The very first step of the BiDi algorithm is to split at "paragraphs", however that's defined, and then do the rest for each paragraph. For one particular paragraph, there's a lot going on: determining embedded levels and such. At one point, at the very beginning of 3.4, a caller may split a paragraph into lines. Then the rest (actual reordering) happens on lines. This is _not_ the same as splitting into lines upfront (that is, define UBA's "paragraphs" as the text file's "lines"), and then determining embedded levels and reshuffling on these smaller units. Emacs does the latter, and so does my specification. I believe it's not my mental model that's weird, but your use of terminology that doesn't match UBA's that confused me. It's pretty confusing and obviously hard to use the proper terminology, since Emacs's definition and the user-perceived notion of a "paragraph" differs from what becomes a "paragraph" according to UBA's definition. Both in Emacs and in my spec, a "line" of the text file maps to a "paragraph" according to UBA's phrasing. (Except when determining the paragraph direction, where Emacs uses its own human-perceived emptyline-separated paragraph, rather than lines. Which is a nice thing to do.) Anyways, I'm glad it turned out we're on the same page, it's just the terminology that's truly confusing. cheers, egmont From unicode at unicode.org Thu Feb 7 11:20:02 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Thu, 7 Feb 2019 18:20:02 +0100 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: <83a7j7smgz.fsf@gnu.org> References: <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <20190206233243.7ebaafc1@JRWUBU2> <83a7j7smgz.fsf@gnu.org> Message-ID: Hi, On Thu, Feb 7, 2019 at 3:27 PM Eli Zaretskii wrote: > It uses a regular expression, see term-prompt-regexp. So, it's not automatic, needs user interaction, and for that reason, may not have worked for me. (I have other weird things in my prompt, like 256-color sequences that Emacs didn't recognize, perhaps this made the regexp matching fail. Nevermind.) > > Whatever it does to know where the prompt is, can it be made into a > > standard, cross-terminal feature? > > Not sure. It's a kind of heuristic, which is why the regexp is > customizable on user level, so that users could adapt it to their > needs, should that be necessary. iTerm2 has a "shell integration" where the prompt contains explicit markers so that no heuristics or user configuration is needed from the terminal. We're trying to somewhat standardize it at https://gitlab.freedesktop.org/terminal-wg/specifications/issues/4 and get more terminals support it. Not sure where this attempt will take us, we'll see. > In what version of Emacs is that? In the latest version 26 I have > here, the tutorial displays with most paragraphs in RTL direction. 25.2 here, it might have obviously changed for a newer version, glad to hear it. My distro will upgrade in about 2 months. Since I'm not an Emacs user myself, I hope you don't mind if I don't make extra rounds in upgrading now to verify this. cheers, egmont From unicode at unicode.org Thu Feb 7 11:33:31 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Thu, 07 Feb 2019 19:33:31 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: (message from Egmont Koblinger on Thu, 7 Feb 2019 18:12:37 +0100) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <83d0o3sn1r.fsf@gnu.org> Message-ID: <83r2cjqz9w.fsf@gnu.org> > From: Egmont Koblinger > Date: Thu, 7 Feb 2019 18:12:37 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > I believe it's not my mental model that's weird, but your use of > terminology that doesn't match UBA's that confused me. Well, let's just say that Emacs uses the HL1 rule, and determines the base direction for the entire chunk of text between empty lines. From unicode at unicode.org Thu Feb 7 11:39:48 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Thu, 7 Feb 2019 18:39:48 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <83r2cjqz9w.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <83d0o3sn1r.fsf@gnu.org> <83r2cjqz9w.fsf@gnu.org> Message-ID: On Thu, Feb 7, 2019 at 6:33 PM Eli Zaretskii wrote: > Well, let's just say that Emacs uses the HL1 rule, and determines the > base direction for the entire chunk of text between empty lines. Exactly! Now it's my turn to figure out how to add this behavior to terminals, preferably stopping before/after prompts too. cheers, egmont From unicode at unicode.org Thu Feb 7 11:53:11 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Thu, 07 Feb 2019 19:53:11 +0200 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: (message from Egmont Koblinger on Thu, 7 Feb 2019 18:20:02 +0100) References: <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <20190206233243.7ebaafc1@JRWUBU2> <83a7j7smgz.fsf@gnu.org> Message-ID: <83o97nqyd4.fsf@gnu.org> > From: Egmont Koblinger > Date: Thu, 7 Feb 2019 18:20:02 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > > It uses a regular expression, see term-prompt-regexp. > > So, it's not automatic, needs user interaction No, it needs no interaction. Unless the regexp doesn't work for you, which you should then report as a bug in Emacs. From unicode at unicode.org Thu Feb 7 12:01:33 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Thu, 7 Feb 2019 19:01:33 +0100 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: <83o97nqyd4.fsf@gnu.org> References: <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <20190206233243.7ebaafc1@JRWUBU2> <83a7j7smgz.fsf@gnu.org> <83o97nqyd4.fsf@gnu.org> Message-ID: On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii wrote: > No, it needs no interaction. Unless the regexp doesn't work for you, > which you should then report as a bug in Emacs. Do you mean you aim to maintain a regex that matches everyone's prompt in the world, without a significant amount of false positive matches on non-prompt lines? (It's getting damn off-topic though.) e. From unicode at unicode.org Thu Feb 7 12:37:46 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Thu, 7 Feb 2019 19:37:46 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: Hi Philippe, On Thu, Feb 7, 2019 at 3:21 PM Philippe Verdy wrote: > "Rules" are not formally written, they are just a sense of best practices. When it comes to BiDi in terminals, I haven't seen anything that I consider reasonably okay, let alone "best practice". It's a mess. That's why I decided to come up with something. > Bidi plays very badly on terminals Agreed. There's essentially two ways from here: just leave it as bad as it is (or even see various terminal emulators coming up with not well-thought-out hacks that just make it even worse) or try to improve. I picked the latter. > [...] refreshing a typical 80x25 screen takes about one half second, which is much longer than typical user input, so full screen refresh does not work for data input and editing, and terminals implement themselves the echo of user input, ignoring how and when the receiving application will handle the input, and also ignoring if the applciation is already sending ouput to the terminal. I'm really unsure where you're trying to get with it. For one, adding BiDi doesn't introduce the need for significantly larger updates. Whenever a partial repaint of the screen was sufficient, even with BiDi in the game it will remain sufficient. Another thing: I'm not sure that 9.6kbps is a bottleneck to worry about. It's present if you connect to a device via serial port, but will you really do this in combination with BiDi? The use case I much more have in mind is running a terminal emulator locally, or ssh'ing to a remote matchine, for getting various kinds of productive work done (e.g. wriiting a text file in someone's native RTL script in a text editor). These are magnitudes faster. > It's hard or impossible to synchroinize this and local echoes on the terminal causes havoc. If input mixes with output (e.g. you press some keys while you're waiting for make/gcc to compile your app, and these letters appear onscreen), the visual result is broken even without BiDi. I cannot elimite this kind of breakage by introducing BiDi, nor can I build up something from scratch that somewhat resembles the current terminal emulator world but fixes all of its oddnesses. > But the concept of "line" or "paragraph" in a terminal protocols is extremely fuzzy. It's then very difficult to take into account the additiona Bidi contraints as it's impossible to conciliate BOTH the logical ordering (what is encoded in the transmitted data or kept in history buffers) and the visual ordering. I don't try to conciliate logical and visual ordering within the same paragraph, I agree it's impossible, it's a semantical nonsense. But I try to conciliate them in the sense that sometimes the visual order is the desired one, sometimes the logical order, so let's make it possible to use one for one paragraph, and the other one for another paragraph. > That's why there are terminal protocols that absolutely don't want to play with the logical ordering and require all their data to be transmitted in visual order (in which case, there's no bidi handling at all). This is one of the modes in my recommendation. If your application requires this mode (as e.g. Emacs does), use this mode and you're good. > In fact most terminal protocols are very defective and were never dessign to handle Bidi input Maybe it's high time someone fixed this defect, then? :) > And here your unit (logical lines) is not even defined in the terminal protocol and not known from the meitting applications whjich has no input about the final output terminal properties. So the terminal must perform guesses. As it can insert additional linebreaks itself, and scroll out some portion of it, there's no way to delimit the effect of "bidi controls". The basic requirement for correctly handling bidi controls is to make sure that paragraph delimitations are known and stable. if additional breaks can occur anywhere on what you think is a "logical line" but which is different from the mietting application (or static text document which is ouput "as is" without any change to reformat it, these bidi controls just make things worse and it becomes impossible to make reasonnable guesses about paragraph delimitations in the terminal. The result become unpredictable and most often will not even make any sense as the terminal uses visual ordering always but looses the track of the logical ordering (and things get worse when there are complex clusters or characters that cannot even fit in a monospaced grid. If an exact definition of hard vs. soft wrapped lines is what you miss from the specification, okay, I'll add it to a future version. I don't know how terminals performing guesses occured to you, they sure don't (as for hard vs. soft newlines). > The basic requirement for correctly handling bidi controls is to make sure that paragraph delimitations are known and stable. Since we're talking about bidi controls being emitted, we must be talking about the implicit mode of the terminal (as per ECMA's and my specification). Even without BiDi, you can have something on the screen, move the prompt upwards, and then "cat" a file. The result will partially overwrite the existing contents, and partially leave them there. The result will be unreadable, broken. So will it be with BiDi. Now, with regular use case of printing to unused (empty) area, the handling of soft vs. hard newlines is consistent across all terminal emulators I could test. The terminals remember exactly when a newline was printed vs. where the contents wrapped to the next line, and nothing prevents them from doing BiDi accordingly ? which my specification says they need to do. Surprisingly all of PuTTY's, Konsole's, Mlterm's and Terminal.app's developers got it wrong and they do BiDi on the physical lines. This is just one example of how broken the current state of BiDi is, and why it should be fixed. > Terminal emulators only perform guesses, most of these guesees are valid only with "simple" scripts with one character per cell, assuming a minimum resolution of each cell (the minimum is a 8x8 pixel square, too small for Asian scripts, but typical for rendering on old analog TVs; the typical one is a half-width rectangle, not really much larger, but about 50% taller, and with many Asian scripts still do not fit well). These protocosl were just made for Latin, and similar simpler scripts (Cyrillic, Greek, and simple Japanese scripts, or Hangul jamos ignoring clusters and presented only with halfwidth characters, ignoring all complex clusters). For everything else, there's no defined behavior, no support, no reference documentation, everything is untested, you get extremely variable results, the ouput could be completely garbled and unreadable. I'm really lost: what kind of guesses are you talking about, and how are font sizes or anything else you're talking about relevant? If there's one thing terminal emulators really don't do, then that's guessing. All the terminal emulators are pretty much a deterministic state machine. > If you want to play well with most terminals you have to limit a lot wht you can do with "terminal protocols" and strictly limit your use of controls. In fact the only "stable" thing which works more or less is the basic MIME plain text profile which just need uses a single encoding for ALL kinds of newlines (and competely ignores the distinction between the 3 main kind of line breaks). That's where you need to insert hints: basically the encoded text have to assume a minimum display width, and any "line" longer than about 70 character cells is assumed to be fllowed on the next line, unless that next line is empty, and Bidi controls is not used at all but guessed from characters properties at "reasonnable" paragraph boundaries detemined heuristically by the terminal emulator but not encodable in the data stream itself. If you wish to create a terminal-based application that can display Hebrew text, assuming nothing more from the underlying terminal emulator that it can present these glyphs, you're already lost. Most of the terminal emulators can only lay out the glyphs from left to right. That is, you need to emit visual order, that is, the reverse of the logical order for Hebrew. Some terminal emulators, with their default settings, will again run the BiDi algorithm and reverse these back to the incorrect order. Bummer! VTE is about to join this latter team, but also introduces an escape sequence to turn this behavior off. You can start emitting this escape sequence from your application. VTE will understand it. Other emulators won't and will still display the word in reversed order. (A point in my specification is to get it standardized and get all these other BiDi-aware emulators catch up and recognize this escape sequence.) Some other terminal emulators misinterpret this escape sequence (although it's part of ECMA) and do something different. They'll also have to be fixed. The current state of terminal emulators literally doesn't give you a common minimum on top of which you can do any Hebrew or Arabic by any means. Of course my specification cannot fully fix this: If you still pick a terminal emulator not conforming to this spec, you'll still be out of luck. What it can do, among many things is: to bring all the BiDi-aware terminal emulators into a common base, one that also aligns with the non-BiDi-aware ones (subject to them not misinterpreting the BiDi-related sequence). >> > - use a single newline on continuation >> >> Continuation of what exactly?> > > Continuation of paragraphs on the next visual line. I think this did not required any precision, it was sufficient on the existing context where you extracted this word, or did not read anything. When I ask for clarification, I ask for clarification because I didn't understand, and not for assumptions that I may not have read anything, or so. As you can see from previous discussions, there's a whole lot of confusion about the terminology. E.g. even "paragraph" has multiple incompatible definitions, this has caused a lot of misunderstanding between Eli and me until we realized we were actually talking about the same thing. Thus, when you clarify as "continuation of paragraphs", I still cannot be fully sure that your message came along as you intended, because which "paragraph" among the multiple definitions? Plus there are a whole lot more things you can continue, e.g. the list of command line options with the next entry (to refer back to the previous example with "zip"). Nevermind, let's forget it. Philippe, with all due respect, I have the feeling that you have some fundamental problems with my work (and I'm temped to ask back: have you read it at all?), but your message what your problem is just doesn't come across to me. Could you please avoid all those irrelevant stories with baud rate and font size and Asian scripts and whatnot, and clearly get to your point? cheers, egmont From unicode at unicode.org Thu Feb 7 14:00:20 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Thu, 07 Feb 2019 22:00:20 +0200 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: (message from Egmont Koblinger on Thu, 7 Feb 2019 19:01:33 +0100) References: <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <20190206233243.7ebaafc1@JRWUBU2> <83a7j7smgz.fsf@gnu.org> <83o97nqyd4.fsf@gnu.org> Message-ID: <83mun7qsh7.fsf@gnu.org> > From: Egmont Koblinger > Date: Thu, 7 Feb 2019 19:01:33 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii wrote: > > > No, it needs no interaction. Unless the regexp doesn't work for you, > > which you should then report as a bug in Emacs. > > Do you mean you aim to maintain a regex that matches everyone's prompt > in the world, without a significant amount of false positive matches > on non-prompt lines? Yes. From unicode at unicode.org Thu Feb 7 15:38:22 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 7 Feb 2019 22:38:22 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: Le jeu. 7 f?vr. 2019 ? 19:38, Egmont Koblinger a ?crit : > As you can see from previous discussions, there's a whole lot of > confusion about the terminology. And it was exactly the subject of my first message sent to this thread ! you probably missed it. > Philippe, with all due respect, I have the feeling that you have some > fundamental problems with my work (and I'm temped to ask back: have > you read it at all?), but your message what your problem is just > doesn't come across to me. Could you please avoid all those irrelevant > stories with baud rate and font size and Asian scripts and whatnot, > and clearly get to your point? > I have never said anything about your work because I don't know where you spoke about it or where you made some proposals. I must have missed one of your messages (did it reach this list?). So don't take that as a personal attack because this only started on a reply I made (the one specifically speaking about the various ambiguities of encoded newlines in terminal protocols, which do not match the basic plain text definition (similar to MIME) made only for static documents, but never tuned for interactive bidirectional use (including for example text editors, which also requires a modelization of 2D layout, and also sets some assumptions about "characters" visible in a single cell of a regularly spaced grid, and a known number of lines and columns, independant of the lines of the text rendered and read on it. Terminals are not displaying plain text, they create their own upper layer protocol which requires and enforces the 2D layout (whereas Unicode is a purely linear protocol with only relations between one character and the next one in a 1D stream, and no assumption at all about their display width, which cannot be monospaced in all scripts and are definitely not encoded in logical order: try adding characters at end of a logical line, with a Bidi text you do not just replace the content of one cell, you have to scroll the content of surrounding cells and your input curet position does not necessarily changes or you'l reach a point where a visual line will be split in two part, but not at the rest position, and some parts moved up to down Bidi does not specify the 2D layout completely, it is purely 1D and speaks about left and right direction and does not specify what happens when contents do not fit on the visual line for the text which is already present there before inserting new text or even what will be replaced if you are in replace mode and not in insert mode: The Bidi algorithm is not designed to handle overwrites, and not even the whole Unicoidce standard itself, which is made as if all text was inserted only at end of lines and not replacing anything. For now terminal protocols, and emulators trying to implement them; that must mix the desynchronized input and output (especially when they have to do "local echo" of the input for performance reason over slow serial links where there's no synchronization between the local buffer of the terminal and the remote virtual buffer of the terminal emulator in the emitting app, even those using the best "termcap" definitions) have no easy way to do that. The logical encoding of Unicode does not play well and the time to resynchronize the local and remote buffers is a limiting factor (over a 9.6kbps link, refreshing the whole screen takes too long, and this cannot be done on every keystroke of input, or user input would have to be dramatically slow if local echoing is also enabled, or most user inputs that are too fast would have to be discarded, and this makes user input very unreliable, requiring constant correction; these protocols are definitely not human-friendly as they depend on strict timing which is not the way humans enter text; this timing is also unpredicatable and very variable over serial links and the protocols do not have any specification for timing requirements. In fact time is constantly ignored, even if it plays an evident role). If you look at historic "terminal" protocols, technics were used to control time: notably the XON/XOFF protocols, or mechanical constraints. Especially when the output was a printer (with a daisywheel or matrix head). But time was just control between one machine and another, a human could not really interact asynchronously. And it was in a time where full-screen text editors did not even exist (at most they were typing "on the flow" and text layout was completely forgotten. This changed radiucally when the ouput became a screen, with the assumption that the output was instantanous, but the mechanical restrictions were removed. Some older terminal protocols for mainframes notably were better than today's VT-like protocols: you did not transmit just what would be displayed, but you also described the screen area where user input is allowed and the position of fields and navigation between them: the terminal had then no difficulty to avoid breaking the output when entering text with local eacho with good performance. But still it was impossible to input Bidirectional text, only small separate unidirectional fields. We were not concerned by the possibility of multiple scripts and Bidi inside the input text, and Bidi on the ouput areas (where input was protected) was possible using the visual ordering (and the 2D layout of the terminal, which was also fixed, there was no resizable windows as of today; the input grid size was static and determined by terminal initialization, or a menu allowed selecting a new resolution, inform the remote application so that they can handle it in their own virtual terminal and compute a lyaout that could refresh correctly the user terminal, but this action was exceptional and in fact slow over serial links; we were not working with today's tens of megabits/s or more over virtual networking links). Today these links are better used with real protocols made for 2D and allowing an web application to mange the input with presentation layer (HTML) and with javascript helpers (that avoid the roundtrip time). Even the roundtrip is now much faster (typically around 30ms and transmission time for the whole page is much lower than that, except for complex contents having no or little user interaction (such as high resolution images, or videos, rendered in a well delimited rectangular area, separated from the user input areas, or just layered in transparence on top of output area with a multilayered 3D layout). But basic text terminals have never evolved and have lagged behind today's need. Most of them were never tested for internationalization needs: not just Bidi (Hebrew being simpler than Arabic...), but about sizing constraints, notably Asian scripts (except CJK languages which are modeled with characters occupying only 1 or 2 cells side by side and where linebreaks cannot occur in the middle of a pair but can break anywhere else, which is even simpler than for Latin with its complex breaking rules!). The extension of CJK allowed the Latin-based terminals to evolve in order to allow several discrete sizes, but still as an integter multiple of fixed cells: this was possible only by adding a few attribute bits to the cells to indicate if they display the whole character or the top or bottom hald, or the left or right half (using the same technics of attributes that were used to delimit input fields in input forms for mainframes, something that was completely forgotten and remains forgotten today with today's VT-* protocols, to indicate which side of the communcation link controls the content of specific areas). One solution would be to restore the attributes missing in VT-* protocols and incite terminal emulators to supprot them. And then respecify precisly the timing of events and when/how/where user input and output from input can interact, when l:ocal echoing of input is allowed for reactivity, or disabled (meaning that the user will have to wait for the response from the remote side). As well today's VT-* protocols have no possibility to be scriptable: implemeint a way to transport fragments of javascripts would be fine. But a modern terminal protocol should be more or less based now on HTML (the good experiments are those found in the few existing web browsers for text terminals: they are far superior to all existing legacy VT-* like protocols, including the Windows command line terminal, or the X11 terminal, and all the many protocols found in the existing "termcap"/"libterm" libraries on Unix/Linux systems for legacy displays terminals and printer terminals). Text-only terminals are now aging but no longer needed for user-friendly interaction, they are used for technical needs where the only need is to be able to render static documents without interactiving with it, except scrolling it down, and only if they provide help in the user's language. Printers also have abandoned these protocols (the last remnant was HPGL, but it is no logner necessary, mechanical printers have almost died (except for printing some payment tickets) replaced by inkjets and lasers (which are much faster, less noisy too). They continue to fill a niche in an area where people use very little natural text, but have to be able to view documents that will be rendered much more easily and in a more user-friendly way on rich-text protocols like HTML and 2D/3D layout engines. Why not imagine a text -terminal with attributes deliminting a rectangular area containing an object in which rich-text (HTML or other) will be rendered and controled by the 2D/3D engine and left unmanaged internally by the terminal protocols? Why these protocols do not allow more independant "side" streams to control multiple objects (such protocol exists, e.g. in X11, for example to transmit icons, or fonts, or control external areas such as the content of a notification bar, or a menu, or a title bar)? Old terminal protocols based on a single regular grid of equal cells are definitely not user friendly and not suitable for all international text (only a few scripts allow readable monospaced fonts). Today we need more flexibility. But it will be difficult to readapt these old protocols to support the necessary extensions and remain compatible with applications still using them. We've made the transition for printers, but still not for displays, and still not for other input devices than classical mechanical keyboards, even if we all know now the GUI, resizable/movable/stackable windows, mouse input, touch input on screens or pads, and many other kind of sensors. I see liuttle way to adapt the old VT-like protocols (including the DOS/Windows ANSI protocol). In fact we still don't have any standard model for interactive applications with multiple parallel data streams and interactions between them. Most of these efforts have converged to HTML (and related meta-protocols focusing on it for the final rendering such as XUL; there was an effort on Postscript but it has stalled since long; GUI libraries are proliferating after X11, win16, Win32, Swing...). May be it would be useless to try filling the gap missing in legacy VT-like protocols (not worth the effort when they continue to fill a niche whose usage is constantly decreasing). But may be we could renew the efforts made in HTML over text-terminals (but web browsers for them have lost most of their initial adopters: it's just easier to create web services and use them from PCs or mobile devices). Web APIs have largely taken the place, the rest is filled by propular desktop of mobile OSes in their standard GUI, and otherwise by multimedia audio/video codecs, OpenGL or similar, or the DirectX family on Windows and extensions for X11 on Unix/Linux, plus common internationalization frameworks for applications (and common databases like CLDR). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 7 16:35:23 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 7 Feb 2019 22:35:23 +0000 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: <83mun7qsh7.fsf@gnu.org> References: <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <20190206233243.7ebaafc1@JRWUBU2> <83a7j7smgz.fsf@gnu.org> <83o97nqyd4.fsf@gnu.org> <83mun7qsh7.fsf@gnu.org> Message-ID: <20190207223523.0517b753@JRWUBU2> On Thu, 07 Feb 2019 22:00:20 +0200 Eli Zaretskii via Unicode wrote: > > From: Egmont Koblinger > > Date: Thu, 7 Feb 2019 19:01:33 +0100 > > On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii wrote: > > > No, it needs no interaction. Unless the regexp doesn't work for > > > you, which you should then report as a bug in Emacs. > > Do you mean you aim to maintain a regex that matches everyone's > > prompt in the world, without a significant amount of false positive > > matches on non-prompt lines? > Yes. Wow! You'll do well to match a prompt such as '2p ', which I used for a while. Richard. From unicode at unicode.org Thu Feb 7 17:38:24 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 8 Feb 2019 00:38:24 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: Hi Philippe, > I have never said anything about your work because I don't know where you spoke about it or where you made some proposals. I must have missed one of your messages (did it reach this list?). This entire conversation started by me announcing here my work, aiming to bring usable BiDi to terminal emulators. > Terminals are not displaying plain text, they create their own upper layer protocol which requires and enforces the 2D layout [...] Bidi does not specify the 2D layout completely, it is purely 1D and speaks about left and right direction That's one of the reasons why it's not as simple as "let's just run the UBA inside the terminal", one of the reasons why gluing the two worlds together requires a substantial amount of design work. > For now terminal protocols, and emulators trying to implement them; that must mix the desynchronized input and output (especially when they have to do "local echo" of the input [...] I assume by "local echo" you're talking about the Send/Receive Mode (SRM) of terminals, and not the "stty echo" line discipline setting of the kernel, because as far as the terminal emulator is concerned, the kernel is already remote, and it's utterly irrelevant for us whether it's the kernel or the application sending back the character. SRM is only supported by a few terminal emulators, and we're about to drop it from VTE, too (https://gitlab.gnome.org/GNOME/vte/issues/69). > If you look at historic "terminal" protocols, I'm mostly interested in the present and future. In the past, only for curiosity, and to the extent necessary to understand the present and to plan for the future. > Some older terminal protocols for mainframes notably were better than today's VT-like protocols: you did not transmit just what would be displayed, but you also described the screen area where user input is allowed and the position of fields and navigation between them: This is not seen in today's graphical terminal emulators. > Today these links are better used with real protocols made for 2D and allowing an web application to mange the input with presentation layer (HTML) and with javascript helpers (that avoid the roundtrip time). Sure, if you need another tool, let's say a dynamic webpage in your browser, rather than a terminal emulator to perform your taks effectively, so be it. I'm not claiming terminal emulators are great for everything, I'm not claiming terminal emulators should be used for everything. > But basic text terminals have never evolved and have lagged behind today's need. I disagree with the former part. There are quite a few terminal emulators out there, and many have added plenty of new great features recently. Whether they're up to today's needs, depends on what your needs are. If you need something utterly different, go ahead and use whatever that is, such as maybe a web browser. If you're good with terminals, that's fine too. And there's a slim area where terminal emulators are mostly good for you, you'd just need a tiny little bit more from them. And maybe for some people this tiny little bit more happens to be BiDi. > Most of them were never tested for internationalization needs: Terminal emulators weren't created with internationalization in mind. I18n goals are added one by one. Nowadays combining accents and CJK are supported by most emulators. Time to stretch it further with BiDi, shaping, spacing combining marks for Devanagari, etc. > [...] delimit input fields in input forms for mainframes, something that was completely forgotten and remains forgotten today with today's VT-* protocols, to indicate which side of the communcation link controls the content of specific areas Something that was completely forgotten, probably for good reasons, and I don't see why it should be brought back. > As well today's VT-* protocols have no possibility to be scriptable: implemeint a way to transport fragments of javascripts would be fine. I have absolutely no incentive to work in this direction. > Text-only terminals are now aging but no longer needed for user-friendly interaction, they are used for technical needs where the only need is to be able to render static documents without interactiving with it, except scrolling it down, and only if they provide help in the user's language. Text-only terminals are no longer needed??? Well, strictly speaking, computers aren't needed either, people lived absolutely fine lives before they were invented :) If you get to do some work, depending on the kind of work, terminal emulators may or may not be a necessary or a useful tool for you. For certain tasks you don't really have anything else, or at least terminals are way more effective than other approaches. For other tasks (e.g. text editing) it's mostly a matter of taste whether you use a terminal or a graphical app. For yet other tasks, terminal emulators take you nowhere. My work aims to bring BiDi into terminal emulators in a reasonably well designed way, rather than the ad-hoc and pretty broken ways some emulators have already attempted this. If this is what you were looking for (as many people are), good for you. If you don't care about it, because let's say you'd rather use other tools to get your BiDi work done, so be it, that's also fine. > Why not imagine a text -terminal with attributes deliminting a rectangular area containing an object in which rich-text (HTML or other) will be rendered and controled by the 2D/3D engine and left unmanaged internally by the terminal protocols? Because I'm not redesigning the essentials of terminal emulators, just bringing BiDi into whatever these terminal emulators already are. Because probably all the terminal emulators are developed by a few enthusiasts as a hobby in their pretty limited free time, so they go for what is reasonable to implement, is likely to be used by applications, and what they think makes sense. > Old terminal protocols based on a single regular grid of equal cells are definitely not user friendly and not suitable for all international text (only a few scripts allow readable monospaced fonts). Terminals are, in many aspects, not user friendly. The best you can tell about them is that they're poweruser friendly and developer friendly. How much they are suitable for international text depends on how much compromise someone is willing to make, e.g. whether they're ready to accept monospace fonts for their language's script. If nothing else, my proposal at least makes terminals usable for Hebrew. However, in plenty of terminal emulator's bugtracker there's a request for BiDi, they usually demonstrate it with Arabic, and show examples of other terminal emulators that do some BiDi as reference. This means that for most people requesting BiDi, having monospace fonts for Arabic (plus shaping I assume) is apparently a good enough compromise. > Today we need more flexibility. Sure. You get it outside of terminal emulators. Or you can start changing terminal emulators to accomodate to the new needs, put all the work in that (or hire someone to do it), and see where it goes. I, for one, am not to the slightest bit interested in abandoning the character grid and allowing for proportional fonts. This would just break a gazillion of things. Nor am I interested in reviving local echo, introducing rectangle areas where data can be typed into; nor making emulators scriptable by an app sending e.g. JavaScript code to them. What I am interested in, for whatever reason, is bringing BiDi into the existing world of terminal emulators. > But it will be difficult to readapt these old protocols to support the necessary extensions and remain compatible with applications still using them. You see it as difficult, I see it as a challenge that requires buy-in from so many parties and modification to so many software that I don't see it viable. To summarize: Terminal emulators currently have a strict character grid model, and tons of other pecularities, limitations and legacy to live with (and features that have practically died over the decades). What I do is bring BiDi into this world (in its current state) with as little modifications to the basics of terminal emulation as absolutely necessary. If you have ideas for other small, incremental changes, I'm curious to hear them! If you'd like to see way more substantial changes to the very core of terminal emulation, like proportional fonts, JS code downloaded to the terminal etc., I'm not the right guy to talk to; thanks for your understanding! cheers, egmont From unicode at unicode.org Thu Feb 7 18:52:00 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 8 Feb 2019 01:52:00 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: Adding a single bit of protection in cell attributes to indicate they are either protected or become transparent (and the rest of the attributes/character field indicates the id of another terminal grid or rendering plugin crfeating its own layer and having its own scrolling state and dimensions) can allow convenient things, including the possibility of managing a grid-based system of stackable windows. You can design one of the layer to allow input (managed directly in the terminal, with local echo without transmission delays and without risks of overwriting surrounding contents. Asynchronous behavior can be defined as well between the remote application/OS and the local processing in the terminal. The protocol can also support an extension to provide alternate streams (take an example on MIME multipart). This can even be used to transport the inputs and outputs for each layer, and additional streams to support (java)scripts, or the content of an image, or a link to a video stream. And just like with classing graphics interface, you can have more than just solid RGB colors and add an alpha layer. The single-rectangular-flat grid design is not the only option. Layered approaches can then even be rendered on hardware easily by mapping these virtual layers and flattening them internally in the terminal emulator to the single flat grid supported by the hardware. The result is more or less equivalent to graphic RGB frames, except that the unit is not a single pixel but a whole cell with not just one color but a pair of colors and an encoded character and a font selected for that cell, or if a single font is supported, using a dynamic font and storing glyph ids in that font (prescaled for the cell size). The hardware then makes the rest to build the pixels of the frame, but it can be easily accelerated. The layered approache could also be used to link together the cells that use the same script and font settings, in order to use proportional fonts when monospaced fonts are not usable, and justify their text in the field (which may turn to be scrollable itself when needed for input). Having multiple communication streams between the terminal emulator and the remote application allows the application to query the properties and behave in a smarter way than with just static "termcaps" not taking into account the actual state of the remote terminal. All this requires some extension to TV-like protocols (using specific escape sequences, just like with the Xterm extensions for X11). You can also reconsider how "old" mainframes terminals worked: the user in fact never submitted characters one by one to the remote application: the application was sending a full screen and an input form, the user on its terminal could fill in the form and press a "submit/send" button when he had finished inputing the data. But while the user was inputing data, there was absolutely no need to communicate each typed keystroke to the application, all was taken in charge by the terminal itself which was instructed (and could even perform form data validation with input formats and some conditions, possibly as well a script). In other words, they worked mostly like an HTML input form with a submit button. Such mode is very useful for small devices because they don't have to react interactively with the user, the transmission delays (which may be slow) are no longer a problem, user can enter and correct data easily, and the editing facilities don'ty need to be handled by the remote application (which today could be a very tiny device with in fact much less processing power than the terminal emulator, and would have in fact no knowledge at all of the fonts needed) A terminal emulator can make a lot of things itself and locally. And this would also be useful on many modern application servers that need to serve lot of remote clients, possibly over very slow internet links and long roundtrip times. The idea behing this is to allow to distribute the workload and decide which side will handle part of all of the I/O. Of course it will transport text (preferably in an Unicode UTF), but text is not the only content to transport. There are also audio/video/images, security items (certificates, signatures, personal data that should remain private and be encrypted, or only sent to the application in a on-way-hashed form), plus some states/flags that could provide visual/audio hints to the user when working in the rendered input/output form with his local terminal emulator. I spoke about HTML because terminal-based browsers already exist since long, some of them which are still maintained in 2019 (w3m still used as a W3C-sponsored demo, Lynx is best known on Linux, or elinks): https://www.slant.co/topics/4702/~web-browsers-that-run-in-a-terminal This gives a good idea of what is needed, what a good terminal protocol can do, and what the many legacy VT-like protocol variants have never treid to unify. These browsers don't reinvent the wheel: HTML is already fine to support this. And w3m is not restricted to show only text cells, it showcases the multilayered approach (even if these extra layers are not visible in Gnome terminal and similar, because their VT-like protocol does not have the necessary cababilities (this is where the VT-extensions are needed). And this is not just for "geeks" or technicians (or programmers that don't care at all about languages given they all speak a "tenchglish" jargon): such text-based web browser is also what is needed for accessibility (think about people with visual or aural deficiencies: HTML5 was too much focused on precise rendering on a graphic device for people with "normal" visual and aural conditions, and we've seen many websites abusing the offered facilities so much that their applications or graphic design became unusable by many, and not supprotable at all even with assistive technologies). Le ven. 8 f?vr. 2019 ? 00:39, Egmont Koblinger a ?crit : > Hi Philippe, > > > I have never said anything about your work because I don't know where > you spoke about it or where you made some proposals. I must have missed one > of your messages (did it reach this list?). > > This entire conversation started by me announcing here my work, aiming > to bring usable BiDi to terminal emulators. > > > Terminals are not displaying plain text, they create their own upper > layer protocol which requires and enforces the 2D layout [...] Bidi does > not specify the 2D layout completely, it is purely 1D and speaks about left > and right direction > > That's one of the reasons why it's not as simple as "let's just run > the UBA inside the terminal", one of the reasons why gluing the two > worlds together requires a substantial amount of design work. > > > For now terminal protocols, and emulators trying to implement them; that > must mix the desynchronized input and output (especially when they have to > do "local echo" of the input [...] > > I assume by "local echo" you're talking about the Send/Receive Mode > (SRM) of terminals, and not the "stty echo" line discipline setting of > the kernel, because as far as the terminal emulator is concerned, the > kernel is already remote, and it's utterly irrelevant for us whether > it's the kernel or the application sending back the character. > > SRM is only supported by a few terminal emulators, and we're about to > drop it from VTE, too (https://gitlab.gnome.org/GNOME/vte/issues/69). > > > If you look at historic "terminal" protocols, > > I'm mostly interested in the present and future. In the past, only for > curiosity, and to the extent necessary to understand the present and > to plan for the future. > > > Some older terminal protocols for mainframes notably were better than > today's VT-like protocols: you did not transmit just what would be > displayed, but you also described the screen area where user input is > allowed and the position of fields and navigation between them: > > This is not seen in today's graphical terminal emulators. > > > Today these links are better used with real protocols made for 2D and > allowing an web application to mange the input with presentation layer > (HTML) and with javascript helpers (that avoid the roundtrip time). > > Sure, if you need another tool, let's say a dynamic webpage in your > browser, rather than a terminal emulator to perform your taks > effectively, so be it. I'm not claiming terminal emulators are great > for everything, I'm not claiming terminal emulators should be used for > everything. > > > But basic text terminals have never evolved and have lagged behind > today's need. > > I disagree with the former part. There are quite a few terminal > emulators out there, and many have added plenty of new great features > recently. > > Whether they're up to today's needs, depends on what your needs are. > If you need something utterly different, go ahead and use whatever > that is, such as maybe a web browser. If you're good with terminals, > that's fine too. And there's a slim area where terminal emulators are > mostly good for you, you'd just need a tiny little bit more from them. > And maybe for some people this tiny little bit more happens to be > BiDi. > > > Most of them were never tested for internationalization needs: > > Terminal emulators weren't created with internationalization in mind. > I18n goals are added one by one. Nowadays combining accents and CJK > are supported by most emulators. Time to stretch it further with BiDi, > shaping, spacing combining marks for Devanagari, etc. > > > [...] delimit input fields in input forms for mainframes, something that > was completely forgotten and remains forgotten today with today's VT-* > protocols, to indicate which side of the communcation link controls the > content of specific areas > > Something that was completely forgotten, probably for good reasons, > and I don't see why it should be brought back. > > > As well today's VT-* protocols have no possibility to be scriptable: > implemeint a way to transport fragments of javascripts would be fine. > > I have absolutely no incentive to work in this direction. > > > Text-only terminals are now aging but no longer needed for user-friendly > interaction, they are used for technical needs where the only need is to be > able to render static documents without interactiving with it, except > scrolling it down, and only if they provide help in the user's language. > > Text-only terminals are no longer needed??? Well, strictly speaking, > computers aren't needed either, people lived absolutely fine lives > before they were invented :) > > If you get to do some work, depending on the kind of work, terminal > emulators may or may not be a necessary or a useful tool for you. For > certain tasks you don't really have anything else, or at least > terminals are way more effective than other approaches. For other > tasks (e.g. text editing) it's mostly a matter of taste whether you > use a terminal or a graphical app. For yet other tasks, terminal > emulators take you nowhere. > > My work aims to bring BiDi into terminal emulators in a reasonably > well designed way, rather than the ad-hoc and pretty broken ways some > emulators have already attempted this. If this is what you were > looking for (as many people are), good for you. If you don't care > about it, because let's say you'd rather use other tools to get your > BiDi work done, so be it, that's also fine. > > > Why not imagine a text -terminal with attributes deliminting a > rectangular area containing an object in which rich-text (HTML or other) > will be rendered and controled by the 2D/3D engine and left unmanaged > internally by the terminal protocols? > > Because I'm not redesigning the essentials of terminal emulators, just > bringing BiDi into whatever these terminal emulators already are. > > Because probably all the terminal emulators are developed by a few > enthusiasts as a hobby in their pretty limited free time, so they go > for what is reasonable to implement, is likely to be used by > applications, and what they think makes sense. > > > Old terminal protocols based on a single regular grid of equal cells are > definitely not user friendly and not suitable for all international text > (only a few scripts allow readable monospaced fonts). > > Terminals are, in many aspects, not user friendly. The best you can > tell about them is that they're poweruser friendly and developer > friendly. > > How much they are suitable for international text depends on how much > compromise someone is willing to make, e.g. whether they're ready to > accept monospace fonts for their language's script. If nothing else, > my proposal at least makes terminals usable for Hebrew. However, in > plenty of terminal emulator's bugtracker there's a request for BiDi, > they usually demonstrate it with Arabic, and show examples of other > terminal emulators that do some BiDi as reference. This means that for > most people requesting BiDi, having monospace fonts for Arabic (plus > shaping I assume) is apparently a good enough compromise. > > > Today we need more flexibility. > > Sure. You get it outside of terminal emulators. Or you can start > changing terminal emulators to accomodate to the new needs, put all > the work in that (or hire someone to do it), and see where it goes. > > I, for one, am not to the slightest bit interested in abandoning the > character grid and allowing for proportional fonts. This would just > break a gazillion of things. Nor am I interested in reviving local > echo, introducing rectangle areas where data can be typed into; nor > making emulators scriptable by an app sending e.g. JavaScript code to > them. > > What I am interested in, for whatever reason, is bringing BiDi into > the existing world of terminal emulators. > > > But it will be difficult to readapt these old protocols to support the > necessary extensions and remain compatible with applications still using > them. > > You see it as difficult, I see it as a challenge that requires buy-in > from so many parties and modification to so many software that I don't > see it viable. > > To summarize: Terminal emulators currently have a strict character > grid model, and tons of other pecularities, limitations and legacy to > live with (and features that have practically died over the decades). > What I do is bring BiDi into this world (in its current state) with as > little modifications to the basics of terminal emulation as absolutely > necessary. If you have ideas for other small, incremental changes, I'm > curious to hear them! If you'd like to see way more substantial > changes to the very core of terminal emulation, like proportional > fonts, JS code downloaded to the terminal etc., I'm not the right guy > to talk to; thanks for your understanding! > > > cheers, > egmont > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 8 00:40:44 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 8 Feb 2019 06:40:44 +0000 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: <20190208064044.27f75709@JRWUBU2> On Fri, 8 Feb 2019 00:38:24 +0100 Egmont Koblinger via Unicode wrote: > I, for one, am not to the slightest bit interested in abandoning the > character grid and allowing for proportional fonts. This would just > break a gazillion of things. The message I take from that and this thread in general is that Emacs and 'M-x term' are the route to take if one only has proportional fonts. What's the sledgehammer for Windows? Where do I find the specification for fixed-width fonts (is wcswidth() the core?) and how do I select the set of fonts to use? Do I need to use fontconfig where available? Richard. From unicode at unicode.org Fri Feb 8 01:05:11 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 08 Feb 2019 09:05:11 +0200 Subject: Bidi paragraph direction in terminal emulators BiDi in terminal emulators In-Reply-To: <20190207223523.0517b753@JRWUBU2> (message from Richard Wordingham via Unicode on Thu, 7 Feb 2019 22:35:23 +0000) References: <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83d0o7v6nz.fsf@gnu.org> <20190205000547.0a38260b@JRWUBU2> <83womete38.fsf@gnu.org> <20190206233243.7ebaafc1@JRWUBU2> <83a7j7smgz.fsf@gnu.org> <83o97nqyd4.fsf@gnu.org> <83mun7qsh7.fsf@gnu.org> <20190207223523.0517b753@JRWUBU2> Message-ID: <83imxurc9k.fsf@gnu.org> > Date: Thu, 7 Feb 2019 22:35:23 +0000 > From: Richard Wordingham via Unicode > > > > Do you mean you aim to maintain a regex that matches everyone's > > > prompt in the world, without a significant amount of false positive > > > matches on non-prompt lines? > > > Yes. > > Wow! You'll do well to match a prompt such as '2p ', which I used for > a while. Like I said: for any reasonable prompt that doesn't match, you can report a bug, and have the Emacs maintainers deliberate whether your case is important enough to be supported by default. Failing that, you can set the regexp to a suitable value in a mode hook defined on your init file. From unicode at unicode.org Fri Feb 8 03:34:29 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 08 Feb 2019 11:34:29 +0200 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <20190208064044.27f75709@JRWUBU2> (message from Richard Wordingham via Unicode on Fri, 8 Feb 2019 06:40:44 +0000) References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> Message-ID: <831s4ir5cq.fsf@gnu.org> > Date: Fri, 8 Feb 2019 06:40:44 +0000 > From: Richard Wordingham via Unicode > > > I, for one, am not to the slightest bit interested in abandoning the > > character grid and allowing for proportional fonts. This would just > > break a gazillion of things. > > The message I take from that and this thread in general is that Emacs > and 'M-x term' are the route to take if one only has proportional fonts. Not sure why. There are terminal emulators out there which support proportional fonts. Emacs is perhaps the only one whose terminal emulator currently supports bidi more or less in full, but is that related to proportional fonts? > What's the sledgehammer for Windows? Not sure what you meant. "M-x term" doesn't work on Windows. > Where do I find the specification for fixed-width fonts (is > wcswidth() the core?) and how do I select the set of fonts to use? Do I > need to use fontconfig where available? That depends on the underlying C library and other facilities; basically on your OS. AFAIK wcwidth will give the results consistent with the UCD only if you use glibc. In Emacs, you have the functions char-width and string-width that take their data from EastAsianWidth.txt. Not sure about other facilities, and I don't really understand what environment are you asking about -- are you talking about C/C++ programs? From unicode at unicode.org Fri Feb 8 03:40:11 2019 From: unicode at unicode.org (Denis Jacquerye via Unicode) Date: Fri, 8 Feb 2019 09:40:11 +0000 Subject: Two more ellispis-type interpunctations: ?.. and !.. In-Reply-To: References: Message-ID: These were proposed with others in 13-237 ( http://unicode.org/L2/L2013/13237-punctuation.txt) and were declined ( https://www.unicode.org/L2/L2014/14101-closed-ai.html). The proposal presented them as Russian punctuation marks. On Thu, 7 Feb 2019, 16:08 Serik Serikbay via Unicode, wrote: > Khakass language is much close to Kyrgyz .. > > > > > On Thu, Feb 7, 2019 at 8:54 PM "J?rg Knappen" via Unicode < > unicode at unicode.org> wrote: > >> While working on a corpus of Kyrgyz language, a Turkic language written >> in the Cyrilic script, >> I encountered two ellipsis-type interpunctations, namely ?.. and !.. >> >> Note that this is not (yet) a proposal to encode them a single Unicode >> characters although I would definitely >> use such characters when available because they make the text processing >> tool chain much simpler and more >> robust. It is a survey question: >> >> Do you have encountered ?.. or !.. in other languages than Kyrgyz? >> >> --J?rg Knappen >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 8 06:30:42 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 8 Feb 2019 13:30:42 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <831s4ir5cq.fsf@gnu.org> References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> Message-ID: Hi Eli, > Not sure why. There are terminal emulators out there which support > proportional fonts. Well, of course, a terminal emulator can load any font, even proportional, but as it places them in the grid, it will look ugly as hell (like this one: https://askubuntu.com/q/781327/398785 ). Sure you could apply some tricks to make it look a bit less terrible (e.g. by centering each glyph in its cell rather than aligning to the left), but it still won't look great. In the world of terminal emulation, many applications expect things to align properly according to the wcwidth() of the string they emit. You abandon this (start placing the glyphs one after the other in a row, no matter how wide they are), and plenty of applications suddenly fall apart big time (let alone questions like how you define the terminal's width in characters). > Emacs is perhaps the only one whose terminal > emulator currently supports bidi more or less in full Let's not get started from here, please. In Emacs-25.2's terminal emulator I executed "cat TUTORIAL.he". For the entire contents, LTR paragraph direction was used and was aligned to the left. Maybe something has changed for 26.x, I don't know. In my work I carefully evaluated 4 other "BiDi-aware" terminal emulators, as well an ancient specification for BiDi which I had to read about twenty times to get to pretty much understand what it's talking about. Identified substantial issues with both the standard as well as all the independent implementations (which didn't care about this standard at all). I show that existing terminal emulators are incompatible to the extent that an app cannot reliably print any RTL text by any means at all. At this point I firmly believe it should be clear that BiDi in terminals is not a topic where one can just go ahead and do something, without having a specification first. I lay down principles which a proper BiDi-supporting platform I believe needs to meet, argue why multiple modes (explicit and implicit) are inevitable, examine what to do with paragraph direction, cursor location and tons of other issues, and come up with concrete suggestion how (partially based on that ancient specifications) these all should be exactly addressed. Then, after putting literally months of work in it, I come here to announce my work and ask for feedback. So far, from a thread of 100+ mails, I take away two pieces of worthful feedback: one is that shaping should be done differently, and the other one is that ? for some use cases ? a bigger scope of data should be used for autodetecting the "paragraph direction" (as per UBA's terminology). And now you suddenly tell that Emacs's terminal supports BiDi more or less in full??? Sorry, I just don't buy it. If you retain this claim, I'd pretty please like to see a specification of its behavior, one which addresses at least all the major the issues I address in my work, one which I could replace my work with, one which I'd be happy to implement in gnome-terminal in the solid belief that it's about as good as my proposal, and would wholeheartedly recommend for other terminal emulators to adopt. Or maybe, by any chance, when you said Emacs's terminal supported BiDi more or less in full, did you perhaps went with your own idea what a BiDi-aware terminal emulator needs to support; ignoring all those things I detail in my work, such as the inevitable need for explicit mode, the need for deciding the scope of implicit vs. explicit mode, and much more? thanks a lot, egmont From unicode at unicode.org Fri Feb 8 06:56:03 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 8 Feb 2019 13:56:03 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: Hi Philippe, > Adding a single bit of protection in cell attributes to indicate they are either protected or become transparent (and the rest of the attributes/character field indicates the id of another terminal grid or rendering plugin crfeating its own layer and having its own scrolling state and dimensions) can allow convenient things, including the possibility of managing a grid-based system of stackable windows. > You can design one of the layer to allow input (managed directly in the terminal, with local echo without transmission delays and without risks of overwriting surrounding contents. At this point you're already touching much more the core of terminal emulator behavior than e.g. my BiDi work does, it's a way more essential, way more complex change ? with much less clear goal to me, like, why should emulators implement it, why would applications start using it etc. If you wish to go for this direction, good luck! (If anything, what I do see somewhat feasibile, is building up something from scratch that looks much more like a proportional-font text editing widget, or even a rich text editor, rather than terminal emulator, and figure out step by step how to get a shell and simple utilities and later more complex utilities run in that. This could be a new platform which, by putting decades of hard work in it ? which I cannot do voluntarily ?, could eventually replace terminal emulators.) Philippe, I hate do say it, but at the risk of being impolite, I just have to. Your ideas would take terminal emulators extremely far from what they are now, with no clear goals and feasibility to me; and are no longer any relevant to BiDi. All I see is we're wasting each other's time on utterly irrelevant topics, and since I see exactly zero chance of any worthful takeaway to come out of this, unfortunately I cannot anymore devote my limited free time for this, I just have to quit this conversation between the two of us. I'm really sorry. best regards, egmont From unicode at unicode.org Fri Feb 8 07:45:15 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 08 Feb 2019 15:45:15 +0200 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: (message from Egmont Koblinger on Fri, 8 Feb 2019 13:30:42 +0100) References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> Message-ID: <83tvhepf6c.fsf@gnu.org> > From: Egmont Koblinger > Date: Fri, 8 Feb 2019 13:30:42 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > Hi Eli, > > > Not sure why. There are terminal emulators out there which support > > proportional fonts. > > Well, of course, a terminal emulator can load any font, even > proportional, but as it places them in the grid, it will look ugly as > hell Maybe so, but the original text was this: Emacs and 'M-x term' are the route to take if one only has proportional fonts. Which I don't understand, since the terminal emulator in Emacs doesn't do anything special about proportional fonts, AFAIK. > In Emacs-25.2's terminal emulator I executed "cat TUTORIAL.he". For > the entire contents, LTR paragraph direction was used and was aligned > to the left. Maybe something has changed for 26.x, I don't know. I told you what changed: Emacs 25 forces LTR paragraph direction, whereas Emacs 26 and later does not. You can get dynamic paragraph direction in your Emacs 25 as well if you set bidi-paragraph-direction to nil in the *term* buffer. > And now you suddenly tell that Emacs's terminal supports BiDi more or > less in full??? Emacs implements the latest UBA from Unicode 11; and the Emacs terminal emulator inserts all the text into a "normal" Emacs buffer, and displays that buffer as any other buffer. So yes, you have there full UBA support. I thought this was clear, sorry if it wasn't. One caveat with this is that the Emacs emulator works only on Posix platforms, it doesn't work on MS-Windows. > Sorry, I just don't buy it. If you retain this claim, I'd pretty > please like to see a specification of its behavior The specification is the latest version of the UBA, augmented with three deviations, two of them allowed by the UBA, the third isn't: . Emacs uses HLA1 for determining base paragraph direction: it decides on base direction only once for every chunk of text delimited by empty lines; . Emacs doesn't by default remove bidi formatting controls from display; . Emacs wraps long lines _after_ reordering, not before. I think that's it. If I forget something, please forgive me: I implemented this 10 years ago, so maybe something evades me at the moment. > one which addresses at least all the major the issues I address in > my work, one which I could replace my work with, one which I'd be > happy to implement in gnome-terminal in the solid belief that it's > about as good as my proposal, and would wholeheartedly recommend for > other terminal emulators to adopt. > > Or maybe, by any chance, when you said Emacs's terminal supported BiDi > more or less in full, did you perhaps went with your own idea what a > BiDi-aware terminal emulator needs to support; ignoring all those > things I detail in my work, such as the inevitable need for explicit > mode, the need for deciding the scope of implicit vs. explicit mode, > and much more? Sorry, I cannot afford testing everything you wrote in your specification. I think most, if not all, of that is covered, but I certainly didn't test that, so maybe I'm wrong. Please feel free to test the relevant aspects and ask questions if you need more "inside information". I do hope that my impression about "most everything being supported" is correct, because that would give you a working implementation/prototype of most of the features you want to see in terminal emulators, so you could actually try the behavior to see if it's convenient, causes problems, etc. One other feature you may find interesting (something that I don't think you covered in your document, at least not explicitly) is that Emacs supports visual-order cursor motion, in addition to the "usual" logical-order. The latter is, of course, the default, but you can switch to the former if you set the visual-order-cursor-movement option to a non-nil value. From unicode at unicode.org Fri Feb 8 07:57:56 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 8 Feb 2019 14:57:56 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <83tvhepf6c.fsf@gnu.org> References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <83tvhepf6c.fsf@gnu.org> Message-ID: Hi Eli, > Emacs implements the latest UBA from Unicode 11; and the Emacs > terminal emulator inserts all the text into a "normal" Emacs buffer, > and displays that buffer as any other buffer. So yes, you have there > full UBA support. One of the essentials of my work is that there's much more to BiDi in terminal emulators than running the UBA. If one takes a step backwards to look at the big picture, it becomes clear that in some cases the UBA needs to be run, while in other cases it mustn't. And then of course there needs to be some means of switching, and so on... According to the description you give, Emacs's terminal always applies the BiDi algorithm, therefore by its design only implements what I call "implicit mode", and not the "explicit mode". On the other hand, in order to run Emacs inside a terminal emulator, you need to set that terminal emulator to explicit mode, so that it doesn't reshuffle the characters. The behavior it expects from the outer terminal doesn't match the behavior it provides in its inner one. As an interesting consequence, if you open Emacs, then inside it a terminal emulator, and then inside it an Emacs, it will display BiDi incorrectly, in reversed order. I'm making the strong claim that by running the UBA a terminal emulator doesn't become BiDi aware, there's much more it needs to do. cheers, egmont From unicode at unicode.org Fri Feb 8 08:27:35 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 08 Feb 2019 16:27:35 +0200 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: (message from Egmont Koblinger on Fri, 8 Feb 2019 14:57:56 +0100) References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <83tvhepf6c.fsf@gnu.org> Message-ID: <83pns2pd7s.fsf@gnu.org> > From: Egmont Koblinger > Date: Fri, 8 Feb 2019 14:57:56 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > According to the description you give, Emacs's terminal always applies > the BiDi algorithm, therefore by its design only implements what I > call "implicit mode", and not the "explicit mode". You can have what you call the "explicit mode" if you set the variable bidi-display-reordering to nil. This only supports the LTR explicit mode, though. Personally, I don't see when would the RTL explicit mode be useful: there's no RTL-only text in real life, so some reordering is always required. But maybe I'm missing something. > I'm making the strong claim that by running the UBA a terminal > emulator doesn't become BiDi aware, there's much more it needs to do. Like I said, you are welcome to test the rest of your requirements and ask questions if you think something is not supported or isn't working as expected. From unicode at unicode.org Fri Feb 8 08:42:51 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 8 Feb 2019 15:42:51 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <83pns2pd7s.fsf@gnu.org> References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <83tvhepf6c.fsf@gnu.org> <83pns2pd7s.fsf@gnu.org> Message-ID: On Fri, Feb 8, 2019 at 3:28 PM Eli Zaretskii wrote: > You can have what you call the "explicit mode" if you set the variable > bidi-display-reordering to nil. So, if someone is running a mixture of applications requiring implicit vs. explicit modes, they'll have to continuously toggle the setting of their terminal back and forth. Just as for Konsole and friends there's a graphical setting, correspondingly for Emacs's terminal there's this bidi-display-reordering setting. Now, I, as a user, want BiDi to work as seamlessly as possible, definitely without me having to repeatedly switch a setting back and forth if the applications could just as well do it automatically. One of the basics of my spec. Whether Emacs will adopt this, or will keep requiring users to toggle this setting back and forth depending on the particular app they wish to run, is not my call. cheers, egmont From unicode at unicode.org Fri Feb 8 09:49:35 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 08 Feb 2019 17:49:35 +0200 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: (message from Egmont Koblinger on Fri, 8 Feb 2019 15:42:51 +0100) References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <83tvhepf6c.fsf@gnu.org> <83pns2pd7s.fsf@gnu.org> Message-ID: <83o97mp9f4.fsf@gnu.org> > From: Egmont Koblinger > Date: Fri, 8 Feb 2019 15:42:51 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > On Fri, Feb 8, 2019 at 3:28 PM Eli Zaretskii wrote: > > > You can have what you call the "explicit mode" if you set the variable > > bidi-display-reordering to nil. > > So, if someone is running a mixture of applications requiring implicit > vs. explicit modes, they'll have to continuously toggle the setting of > their terminal back and forth. Why would they want to toggle it back and forth? What are the use cases where it makes sense to mix both modes? IME, you either need one or the other, never both. In any case, I'm just trying to help you map your requirements into existing Emacs features. If this is not helpful, feel free to disregard. > Now, I, as a user, want BiDi to work as seamlessly as possible, > definitely without me having to repeatedly switch a setting back and > forth if the applications could just as well do it automatically. One > of the basics of my spec. > > Whether Emacs will adopt this, or will keep requiring users to toggle > this setting back and forth depending on the particular app they wish > to run, is not my call. You can hardly expect Emacs (or any other application) to support control sequences that are not yet defined, let alone standardized. When they become sufficiently widely available, I'm sure someone will add them to Emacs. From unicode at unicode.org Fri Feb 8 10:44:53 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 8 Feb 2019 17:44:53 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <83o97mp9f4.fsf@gnu.org> References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <83tvhepf6c.fsf@gnu.org> <83pns2pd7s.fsf@gnu.org> <83o97mp9f4.fsf@gnu.org> Message-ID: Hi Eli, > Why would they want to toggle it back and forth? What are the use > cases where it makes sense to mix both modes? IME, you either need > one or the other, never both. (Back to the basics, which are mentioned pretty clearly in my specification, I believe, and I've also described here multiple times... sigh.) For certain apps, one of the modes is required (e.g. for cat it's the implicit mode). For other tasks it's the other mode (e.g. for emacs the explicit mode). In a typical terminal session, you don't just use one of these kinds of commands. You use various commands in a sequence, e.g. a cat followed by an emacs, then a zip, then whatnot, then emacs again, then a cat and a grep, etc... The very last thing I would want to do as a user is to toggle some setting back and forth, let alone remember which command needs which mode. > You can hardly expect Emacs (or any other application) to support > control sequences that are not yet defined, let alone standardized. The most essential sequence, BDSM to switch between implicit and explicit modes, has been defined for like 28 years now. Sure I bring slight changes and clarifications to it, as well as introduce new ones. As of my recommendation which I've announced, these new ones are defined as well. It's probably never going to be a de jure standard, adopted by ECMA or whatever "authority", but that's not what happens anywhere else in terminal emulators nowadays. An "authority" which doesn't keep up to date with innovations, doesn't have a feedback forum, and hasn't released a new version for 28 years, is clearly not suitable for making progress. We have just announced a public forum called "Terminal WG" for terminal emulator developers to collaborate and join their efforts wrt. new extensions, rather than ad-hoc collaborations or each going their own separate ways. We'd like its work to be widely accepted as a basis for the desired behavior. My BiDi work is one of the works hosted there. It'll probably never be an "authority" like ECMA, but hopefully will be some kind of well-respected place of specs to adhere to. > When they become sufficiently widely available, I'm sure someone will > add them to Emacs. There's always a chicken and egg problem with this attutide. At the very least, I'm kindly asking Emacs to emit BDSM so that when it's fired up on a gnome-terminal, it'll have the terminal's BiDi automatically disabled. This has nothing to do yet with Emacs's built-in terminal emulator. Addressing that is sure a much bigger chunk of work; I hope it'll happen if my BiDi proposal indeed turns out to be successful. cheers, egmont From unicode at unicode.org Fri Feb 8 11:16:09 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Fri, 8 Feb 2019 17:16:09 +0000 (GMT) Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com> <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> Message-ID: <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> Andrew West wrote: > Just reminding you that "The initial character in a variation sequence is never a nonspacing combining mark (gc=Mn) or a canonical decomposable character" (The Unicode Standard 11.0 ?23.4). This means that a variation sequence cannot be defined for any precomposed letters and diacritics, so for example you could not italicize the word "f?te" by simply adding VS14 after each letter because "?" (in NFC form) cannot act as the base for a variation sequence. You would have to first convert any text to be italicized to NFD, then apply VS14 to each non-combining character. This alone would make a VS solution unacceptable in my opinion. As it happens I was not aware of that before, and in fact I had already produced a PDF document for submission to the Unicode Technical Committee when I read your post. https://www.unicode.org/L2/L2019/19063-italic-vs.pdf So, it is an issue that needs to be resolved. I am a researcher and I am looking for the best way to do this so as to get a good result that people can use, I am not trying to assert that my suggestion is necessarily the best way to do it. For example, I accepted the suggestion that James made. The meeting of the Unicode Technical Committee is not due until April and hopefully some other people will send in documents and comments on the topic. Hopefully the issue that Andrew mentions can be resolved in some way. William Overington Friday 8 February 2019 From unicode at unicode.org Fri Feb 8 14:39:22 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 8 Feb 2019 20:39:22 +0000 Subject: Columns in Terminal Emulators (was: Bidi paragraph direction in terminal emulators) In-Reply-To: <83tvhepf6c.fsf@gnu.org> References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <83tvhepf6c.fsf@gnu.org> Message-ID: <20190208203922.24cde2ca@JRWUBU2> On Fri, 08 Feb 2019 15:45:15 +0200 Eli Zaretskii via Unicode wrote: > > From: Egmont Koblinger > > Date: Fri, 8 Feb 2019 13:30:42 +0100 > > Cc: Richard Wordingham , > > unicode Unicode Discussion > > > > Hi Eli, > > > > > Not sure why. There are terminal emulators out there which > > > support proportional fonts. > > > > Well, of course, a terminal emulator can load any font, even > > proportional, but as it places them in the grid, it will look ugly > > as hell > > Maybe so, but the original text was this: > > Emacs and 'M-x term' are the route to take if one only has > proportional fonts. > > Which I don't understand, since the terminal emulator in Emacs doesn't > do anything special about proportional fonts, AFAIK. As a terminal emulator, it does. It abandons straight columns to honour the spacing glyphs' widths. It neither inappropriately truncates nor inappropriately overlaps glyphs. These avoided treatments don't just make text ugly; they can make it unreadable. Richard. From unicode at unicode.org Fri Feb 8 14:53:27 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 08 Feb 2019 13:53:27 -0700 Subject: Encoding italic Message-ID: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> I'd like to propose encoding italics and similar display attributes in plain text using the following stateful mechanism: ? Italics on: ESC [3m ? Italics off: ESC [23m ? Bold on: ESC [1m ? Bold off: ESC [22m ? Underline on: ESC [4m ? Underline off: ESC [24m ? Strikethrough on: ESC [9m ? Strikethrough off: ESC [29m ? Reverse on: ESC [7m ? Reverse off: ESC [27m ? Reset all attributes: ESC [m where ESC is U+001B. This mechanism has existed for around 40 years and is already supported as widely as any new Unicode-only convention will ever be. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Fri Feb 8 15:02:06 2019 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Fri, 8 Feb 2019 13:02:06 -0800 Subject: Encoding italic In-Reply-To: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> References: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> Message-ID: +? -- Rebecca Bettencourt On Fri, Feb 8, 2019 at 12:55 PM Doug Ewell via Unicode wrote: > I'd like to propose encoding italics and similar display attributes in > plain text using the following stateful mechanism: > > ? Italics on: ESC [3m > ? Italics off: ESC [23m > ? Bold on: ESC [1m > ? Bold off: ESC [22m > ? Underline on: ESC [4m > ? Underline off: ESC [24m > ? Strikethrough on: ESC [9m > ? Strikethrough off: ESC [29m > ? Reverse on: ESC [7m > ? Reverse off: ESC [27m > ? Reset all attributes: ESC [m > > where ESC is U+001B. > > This mechanism has existed for around 40 years and is already supported > as widely as any new Unicode-only convention will ever be. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 8 15:29:57 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 8 Feb 2019 22:29:57 +0100 Subject: Encoding italic In-Reply-To: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> References: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> Message-ID: Hi guys, Having been a terminal emulator developer for some years now, I have to say ? perhaps surprisingly ? that I don't fancy the idea of reusing escape sequences of the terminal world. (Mind you, I don't find it a good idea to add italic and whatnot formatting support to Unicode at all... but let's put aside that now.) There are a lot of problems with these escape sequences, and if you go for a potentially new standard, you might not want to carry these problems. There is not a well-defined framework for escape sequences. In this particular case you might say it starts with ESC [ and ends with the letter 'm', but how do you know where to end the sequence if that letter 'm' just doesn't arrive? Terminal emulators have extremely complex tables for parsing (and still many of them get plenty of things wrong). It's unreasonable for any random small utility processing Unicode text to go into this business of recognizing all the well-known escape sequences, not even to the extent to know where they end. Whatever is designed should be much more easily parseable. Should you say "everything from ESC[ to m", you'll cause a whole bunch of problems when a different kind of escape sequence gets interpreted as Unicode. A parser, by the way, would also have to interpret combined sequences like ESC[3;0;1m or alike, for which I don't see a good reason as opposed to having separate sequences for each. Also, it should be carefully evaluated what to do with C1 (U+009B) instead of the C0 ESC[ opening for an escape sequence ? here terminal emulators vary. These just make everything even more cumbersome. ECMA-48 8.3.117 specifies ESC[1m as "bold or increased intensity". It's only nowadays that most terminal emulators support 256 colors and some even support 16M true colors that some emulators try to push for this bit unambiguously meaning "bold" only, whereas in most emulators it means "both bold and increased intensity". Because of compatibility reason, it won't be a smooth switch. Note that "bold" and "increased intensity" only go in the same direction with white-on-black color scheme, with black-on-white bold stands out more while increased intensity (a lighter shade of gray instead of black) stands out less. (We could also start nitpicking that the spec doesn't even say that increased intensity is just for the foreground and not for the background too.) Should this scheme be extended for colors, too? What to do with the legacy 8/16 as well as the 256-color extensions wrt. the color palette? Should Unicode go into the business of defining a fixed set of colors, or allow to alter the palette colors using the OSC 4 and friends escape sequences which supported by about half of the terminal emulators out there? For 256-colors and truecolors, there are two or three syntaxes out there regarding whether the separator is a colon or a semicolon. ECMA-48 doesn't say anything about it, TUI T.416 does, although it's absolutely not clear. See e.g. the discussion at the comment section of https://gist.github.com/XVilka/8346728 , in Dec 2018, we just couldn't figure out which syntax exactly TUI T.416 wants to say. Moreover, due to a common misinterpretation of the spec, one of the positional parameters are often omitted. Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m for curly underline. What to do with them? Where to draw the line what to add to Unicode and what not to? Will Unicode possibly be a bottleneck of further improvements in terminal emulators, because from now on every new mode we figure out we'd like to have in terminals should go through some Unicode committee? And what if Unicode wants to have a mode that terminal emulators aren't interested in, who will assign numbers to them that don't clash with terminals? Who will somehow keep the two worlds in sync? What to do with things that Unicode might also want to have, but doesn't exist in terminal emulators due to their nature, such as switching to a different font size? > This mechanism [...] is already supported > as widely as any new Unicode-only convention will ever be. I truly doubt this, these escape sequences are specific to terminal emulation, an extremely narrow subset of where Unicode is used and rich text might be desired. I see it a much more viable approach if Unicode goes for something brand new, something clean, easily parseable, and it remains the job of specific applications to serve as a bridge between the two worlds. Or, if it wants to adopt some already existing technology, I find HTML/CSS a much better starting point. regards, egmont On Fri, Feb 8, 2019 at 9:55 PM Doug Ewell via Unicode wrote: > > I'd like to propose encoding italics and similar display attributes in > plain text using the following stateful mechanism: > > ? Italics on: ESC [3m > ? Italics off: ESC [23m > ? Bold on: ESC [1m > ? Bold off: ESC [22m > ? Underline on: ESC [4m > ? Underline off: ESC [24m > ? Strikethrough on: ESC [9m > ? Strikethrough off: ESC [29m > ? Reverse on: ESC [7m > ? Reverse off: ESC [27m > ? Reset all attributes: ESC [m > > where ESC is U+001B. > > This mechanism has existed for around 40 years and is already supported > as widely as any new Unicode-only convention will ever be. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > From unicode at unicode.org Fri Feb 8 15:36:17 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 08 Feb 2019 23:36:17 +0200 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: (message from Egmont Koblinger on Fri, 8 Feb 2019 17:44:53 +0100) References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <83tvhepf6c.fsf@gnu.org> <83pns2pd7s.fsf@gnu.org> <83o97mp9f4.fsf@gnu.org> Message-ID: <83bm3motda.fsf@gnu.org> > From: Egmont Koblinger > Date: Fri, 8 Feb 2019 17:44:53 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > For certain apps, one of the modes is required (e.g. for cat it's the > implicit mode). For other tasks it's the other mode (e.g. for emacs > the explicit mode). No one in their right minds will run Emacs inside the Emacs terminal emulator. And even for other applications, disabling bidi will almost always needed only for full-screen programs, which use curses-like libraries to address the entire screen. So you'd switch off reordering for the entire time you are running such an app, then switch it back on after exiting. The other, simpler text applications will always need reordering to active. > > You can hardly expect Emacs (or any other application) to support > > control sequences that are not yet defined, let alone standardized. > > The most essential sequence, BDSM to switch between implicit and > explicit modes, has been defined for like 28 years now. Sure I bring > slight changes and clarifications to it, as well as introduce new > ones. As of my recommendation which I've announced, these new ones are > defined as well. Are there any terminal emulators that support these sequences? > > When they become sufficiently widely available, I'm sure someone will > > add them to Emacs. > > There's always a chicken and egg problem with this attutide. At the > very least, I'm kindly asking Emacs to emit BDSM so that when it's > fired up on a gnome-terminal, it'll have the terminal's BiDi > automatically disabled. Feel free to file a feature request with the Emacs bug tracker about this. Somebody, maybe even myself, is likely to act on that at some point. From unicode at unicode.org Fri Feb 8 15:54:12 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Fri, 8 Feb 2019 22:54:12 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: <83bm3motda.fsf@gnu.org> References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <83tvhepf6c.fsf@gnu.org> <83pns2pd7s.fsf@gnu.org> <83o97mp9f4.fsf@gnu.org> <83bm3motda.fsf@gnu.org> Message-ID: On Fri, Feb 8, 2019 at 10:36 PM Eli Zaretskii wrote: > No one in their right minds will run Emacs inside the Emacs terminal > emulator. And even for other applications, disabling bidi will almost > always needed only for full-screen programs, which use curses-like > libraries to address the entire screen. So you'd switch off > reordering for the entire time you are running such an app, then > switch it back on after exiting. Exactly. But the question is: should it be the user to manually switch it on/off, or should it happen for them automatically under the hood? If the latter, how? My BiDi proposal answers this. Do you have another possible answer? > Are there any terminal emulators that support these sequences? Prior to my specs: Not that I'm aware of. As of my work being available: at least VTE and Mintty are working on it, and I know that iTerm2 was also waiting for some specification. I'm sincerely hoping for even more to follow. e. From unicode at unicode.org Fri Feb 8 15:55:58 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 8 Feb 2019 21:55:58 +0000 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <831s4ir5cq.fsf@gnu.org> References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> Message-ID: <20190208215558.59fc19f5@JRWUBU2> On Fri, 08 Feb 2019 11:34:29 +0200 Eli Zaretskii via Unicode wrote: > > Date: Fri, 8 Feb 2019 06:40:44 +0000 > > From: Richard Wordingham via Unicode > > > > > I, for one, am not to the slightest bit interested in abandoning > > > the character grid and allowing for proportional fonts. This > > > would just break a gazillion of things. > > > > The message I take from that and this thread in general is that > > Emacs and 'M-x term' are the route to take if one only has > > proportional fonts. > > Not sure why. There are terminal emulators out there which support > proportional fonts. Emacs is perhaps the only one whose terminal > emulator currently supports bidi more or less in full, but is that > related to proportional fonts? Emacs is the one I know that can be made to support Indic fonts. It's rather a big too for such a relatively minor task, which is why I implicitly called it a sledgehammer. > > What's the sledgehammer for Windows? > Not sure what you meant. "M-x term" doesn't work on Windows. So my question is, 'What do I use on Windows?' The application may be disproportionate to the function I use it for. > > Where do I find the specification for fixed-width fonts (is > > wcswidth() the core?) and how do I select the set of fonts to use? > > Do I need to use fontconfig where available? > That depends on the underlying C library and other facilities; > basically on your OS. AFAIK wcwidth will give the results consistent > with the UCD only if you use glibc. In Emacs, you have the functions > char-width and string-width that take their data from > EastAsianWidth.txt. Not sure about other facilities, and I don't > really understand what environment are you asking about -- are you > talking about C/C++ programs? I will give a concrete application. If I want to make a font that is interpretable for Tai Tham and maximally usable with VTE, what are the VTE-specific constraints for me to be able to use it for Tai Tham when using basic text utilities? For example, if VTE decides that for as two clusters and , can I nevertheless position the above-matra above the ? Richard. From unicode at unicode.org Fri Feb 8 16:08:25 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 8 Feb 2019 22:08:25 +0000 Subject: Encoding italic In-Reply-To: <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> Message-ID: <20190208220825.40fd5f86@JRWUBU2> On Fri, 8 Feb 2019 17:16:09 +0000 (GMT) "wjgo_10009 at btinternet.com via Unicode" wrote: > Andrew West wrote: >> Just reminding you that "The initial character in a variation >> sequence >> is never a nonspacing combining mark (gc=Mn) or a canonical >> decomposable character" (The Unicode Standard 11.0 ?23.4). > Hopefully the issue that Andrew mentions can be resolved in some way. This is not a problem. Instead of writing , one just writes . Richard. From unicode at unicode.org Fri Feb 8 16:16:30 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 09 Feb 2019 00:16:30 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <20190208215558.59fc19f5@JRWUBU2> (message from Richard Wordingham via Unicode on Fri, 8 Feb 2019 21:55:58 +0000) References: <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> Message-ID: <8336oyori9.fsf@gnu.org> > Date: Fri, 8 Feb 2019 21:55:58 +0000 > From: Richard Wordingham via Unicode > > > > What's the sledgehammer for Windows? > > > Not sure what you meant. "M-x term" doesn't work on Windows. > > So my question is, 'What do I use on Windows?' The application may be > disproportionate to the function I use it for. Try "M-x shell". Most of "M-x term" is not needed on Windows anyway, because the Windows console doesn't support SGR escapes and other curses-like functionalities, at least not yet. > I will give a concrete application. If I want to make a font that is > interpretable for Tai Tham and maximally usable with VTE, what are the > VTE-specific constraints for me to be able to use it for Tai Tham when > using basic text utilities? For example, if VTE decides that for > as two clusters and above-matra>, can I nevertheless position the above-matra above the > ? For character composition, you must have a shaping engine to talk to, and the shaper should tell you the width of each grapheme cluster it returns. I don't see how you can expect wcwidth, or any other interface that was designed to work with _characters_, to be useful when you need to display grapheme clusters. From unicode at unicode.org Fri Feb 8 16:22:40 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 8 Feb 2019 22:22:40 +0000 Subject: Encoding italic In-Reply-To: References: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> Message-ID: <20190208222240.7fbe009a@JRWUBU2> On Fri, 8 Feb 2019 22:29:57 +0100 Egmont Koblinger via Unicode wrote: > Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m > for curly underline. What to do with them? Where to draw the line what > to add to Unicode and what not to? Will Unicode possibly be a > bottleneck of further improvements in terminal emulators, because from > now on every new mode we figure out we'd like to have in terminals > should go through some Unicode committee? And what if Unicode wants to > have a mode that terminal emulators aren't interested in, who will > assign numbers to them that don't clash with terminals? Who will > somehow keep the two worlds in sync? Escape sequences are outside the scope of Unicode. They are part of a higher level protocol (TUS 23.1 'Control codes'). Richard. From unicode at unicode.org Fri Feb 8 16:26:28 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 8 Feb 2019 14:26:28 -0800 Subject: Encoding italic In-Reply-To: <20190208220825.40fd5f86@JRWUBU2> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <20190208220825.40fd5f86@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 8 18:18:14 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Feb 2019 00:18:14 +0000 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <8336oyori9.fsf@gnu.org> References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> Message-ID: <20190209001814.4813f28f@JRWUBU2> On Sat, 09 Feb 2019 00:16:30 +0200 Eli Zaretskii via Unicode wrote: > > Date: Fri, 8 Feb 2019 21:55:58 +0000 > > From: Richard Wordingham via Unicode > > I will give a concrete application. If I want to make a font that > > is interpretable for Tai Tham and maximally usable with VTE, what > > are the VTE-specific constraints for me to be able to use it for > > Tai Tham when using basic text utilities? For example, if VTE > > decides that for as two clusters > > and , can I nevertheless position > > the above-matra above the ? > For character composition, you must have a shaping engine to talk to, > and the shaper should tell you the width of each grapheme cluster it > returns. (a) What defines the grapheme clusters? The definition might be terminal-specific. (b) With a terminal that expects a fixed width font, surely the terminal decides how many cells it allocates to a group of characters, and the font designer has to come up with a suitable value based on that. > I don't see how you can expect wcwidth, or any other > interface that was designed to work with _characters_, to be useful > when you need to display grapheme clusters. Well I can envisage a decision being made that a grapheme cluster str (as decreed by the terminal) shall occupy wcswidth(str) cells - "The wcswidth() function returns the number of column positions for the wide-character string s, truncated to at most length n". Richard. From unicode at unicode.org Fri Feb 8 18:23:32 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Feb 2019 00:23:32 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <20190208220825.40fd5f86@JRWUBU2> Message-ID: <20190209002332.4da93c64@JRWUBU2> On Fri, 8 Feb 2019 14:26:28 -0800 Asmus Freytag via Unicode wrote: > On 2/8/2019 2:08 PM, Richard Wordingham via Unicode wrote: > On Fri, 8 Feb 2019 17:16:09 +0000 (GMT) > "wjgo_10009 at btinternet.com via Unicode" wrote: > > Andrew West wrote: > > Just reminding you that "The initial character in a variation > sequence > is never a nonspacing combining mark (gc=Mn) or a canonical > decomposable character" (The Unicode Standard 11.0 ?23.4). > > Hopefully the issue that Andrew mentions can be resolved in some way. > > This is not a problem. Instead of writing , one just writes > . > > And .... introducing yet another convention, which is that combining > marks inherit the font of the base character. > > Remember, italics, even though presented as a boolean attribute in > most UIs is in fact typographically a font selection. Wouldn't be the base character for the selection of the font? Richard. From unicode at unicode.org Fri Feb 8 19:42:44 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 9 Feb 2019 01:42:44 +0000 Subject: Encoding italic In-Reply-To: <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> Message-ID: <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> William, Rather than having the user insert the VS14 after every character, the editor might allow the user to select a span of text for italicization.? Then it would be up to the editor/app to insert the VS14s where appropriate. For Andrew?s example of ?f?te?, the user would either type the string: ?f? + ??? + ?t? + ?e? or the string: ?f? + ?e? + + ?t? + ?e?. If the latter, the application would insert VS14 characters after the ?f?, ?e?, ?t?, and ?e?.? The application would not insert a VS14 after the combining circumflex ? because the specification does not allow VS characters after combining marks, they may only be used on base characters. In the first ?spelling?, since the specifications forbid VS characters after any character which is not a base character (in other words, not after any character which has a decomposition, such as ???) ? the application would first need to convert the string to the second ?spelling?, and proceed as above.? This is known as converting to NFD. So in order for VS14 to be a viable approach, any application would ? need to convert any selected span to NFD, and ? only insert VS14 after each base character.? And those are two operations which are quite possible, although they do add slightly to the programmer?s burden.? I don?t think it?s a ?deal-killer?. Of course, the user might insert VS14s without application assistance.? In which case hopefully the user knows the rules.? The worst case scenario is where the user might insert a VS14 after a non-base character, in which case it should simply be ignored by any application.? It should never ?break? the display or the processing; it simply makes the text for that document non-conformant.? (Of course putting a VS14 after ??? should not result in an italicized ???.) Cheers, James From unicode at unicode.org Fri Feb 8 20:08:34 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 8 Feb 2019 18:08:34 -0800 Subject: Encoding italic In-Reply-To: <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> Message-ID: <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 8 20:45:35 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 9 Feb 2019 02:45:35 +0000 Subject: Encoding italic In-Reply-To: <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> Message-ID: Asmus Freytag wrote, > You are still making the assumption that selecting a different glyph for > the base character would automatically lead to the selection of a different > glyph for the combining mark that follows. That's an iffy assumption > because "italics" can be realized by choosing a separate font (typographically, > italics is realized as a separate typeface). > > There's no such assumption built into the definition of a VS. At best, inside > the same font, there may be an implied ligature, but that does not work if > there's an underlying font switch. Midstream font switching isn?t a user option in most plain-text applications, although there can be some font substitution happening at the OS level.? Any combining mark must apply to its base letter glyph, even after a base letter glyph has been modified. More sophisticated editors, like BabelPad, allow users to select different fonts for different ranges of Unicode.? If a user selects font X for ASCII and font Y for combining marks, then mark positioning is already broken. If the user selects Times New Roman for both ASCII and combining marks, then no font switching is involved.? The Times New Roman type face includes italic letter form variants.? Any application sharp enough to know that the italic letter form variants are stored in a different computer *file* should be clever enough to apply mark positioning accordingly.? And any single font file which includes italic letters and maps them with VS14 would avoid any such issues altogether. From unicode at unicode.org Fri Feb 8 23:33:49 2019 From: unicode at unicode.org (=?UTF-8?Q?Elias_M=C3=A5rtenson?= via Unicode) Date: Sat, 9 Feb 2019 13:33:49 +0800 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <83va1yte1j.fsf@gnu.org> References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83va1yte1j.fsf@gnu.org> Message-ID: On Wed, 6 Feb 2019, 00:09 Eli Zaretskii via Unicode > Moreover, emitting the control sequences that set the mode is in > itself a complication, because if the terminal doesn't support them, > the result could be corrupted display. You will need methods of > detecting the support, and those detection methods usually involve > sending another control sequence to the terminal and waiting for > response, something that complicates applications and causes delays in > displaying output. > That's what the TERM environment variable is for though. Regards, Elias -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 9 01:42:09 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 09 Feb 2019 09:42:09 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <20190209001814.4813f28f@JRWUBU2> (message from Richard Wordingham via Unicode on Sat, 9 Feb 2019 00:18:14 +0000) References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> Message-ID: <83y36po1bi.fsf@gnu.org> > Date: Sat, 9 Feb 2019 00:18:14 +0000 > From: Richard Wordingham via Unicode > > > For character composition, you must have a shaping engine to talk to, > > and the shaper should tell you the width of each grapheme cluster it > > returns. > > (a) What defines the grapheme clusters? The definition might be > terminal-specific. Well, the "you" above alluded to the terminal emulator, of course. The grapheme clusters are determined by the shaping engine that the emulator must call when appropriate (or always). > (b) With a terminal that expects a fixed width font, surely the > terminal decides how many cells it allocates to a group of characters, > and the font designer has to come up with a suitable value based on > that. Yes. A terminal emulator that works with a shaper should probably post-process the width information returned by the shaper for these purposes. > > I don't see how you can expect wcwidth, or any other > > interface that was designed to work with _characters_, to be useful > > when you need to display grapheme clusters. > > Well I can envisage a decision being made that a grapheme cluster str > (as decreed by the terminal) shall occupy wcswidth(str) cells - "The > wcswidth() function returns the number of column positions for the > wide-character string s, truncated to at most length n". AFAIU, the shaping engine returns its output in terms of font glyph numbers, not character codepoints, so you cannot in general call wcswidth on them. The shaper also returns the advance information, which serves instead of wcwidth and related APIs for determining the actual width on display. From unicode at unicode.org Sat Feb 9 02:10:56 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 09 Feb 2019 10:10:56 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: (message from Elias =?utf-8?Q?M=C3=A5rtenson?= on Sat, 9 Feb 2019 13:33:49 +0800) References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <83va1yte1j.fsf@gnu.org> Message-ID: <83sgwxnzzj.fsf@gnu.org> > From: Elias M?rtenson > Date: Sat, 9 Feb 2019 13:33:49 +0800 > Cc: Egmont Koblinger , unicode > > Moreover, emitting the control sequences that set the mode is in > itself a complication, because if the terminal doesn't support them, > the result could be corrupted display. You will need methods of > detecting the support, and those detection methods usually involve > sending another control sequence to the terminal and waiting for > response, something that complicates applications and causes delays in > displaying output. > > That's what the TERM environment variable is for though. That's not indicative enough when some version of a terminal starts to support a feature not supported by previous versions of the same terminal. Happens a lot with terminal emulators such as xterm, which are under active development, and add features all the time. From unicode at unicode.org Sat Feb 9 04:58:05 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Feb 2019 10:58:05 +0000 Subject: Encoding italic In-Reply-To: <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> Message-ID: <20190209105805.3884e35f@JRWUBU2> On Fri, 8 Feb 2019 18:08:34 -0800 Asmus Freytag via Unicode wrote: > On 2/8/2019 5:42 PM, James Kass via Unicode wrote: > You are still making the assumption that selecting a different glyph > for the base character would automatically lead to the selection of a > different glyph for the combining mark that follows. That's an iffy > assumption because "italics" can be realized by choosing a separate > font (typographically, italics is realized as a separate typeface). The usual practice is to look for a font that supports both base character and mark. > Under the implicit assumptions bandied about here, the VS approach > thus reveals itself as a true rich-text solution (font switching) > albeit realized with pseudo coding rather than markup, markdown or > escape sequences. Isn't that already the case if one uses variation sequences to choose between Chinese and Japanese glyphs? >> Of course, the user might insert VS14s without application >> assistance.? In which case hopefully the user knows the rules.? The >> worst case scenario is where the user might insert a VS14 after a >> non-base character, in which case it should simply be ignored by any >> application.? It should never ?break? the display or the processing; >> it simply makes the text for that document non-conformant.? (Of >> course putting a VS14 after ??? should not result in an italicized >> ???.) Is there any obligation on applications to ignore it? In plain text, the Unicode rules allow the application to choose to render every third '?' as italic. Possibly it comes down to the mens rea of the application (or of its coder or specifier), but without mentalism an application could opt to treat as . A relevant concern would be 'voracious' with the first 'o' italicised by VS14. How would current typeface selection logic work? I can envisage only being in the cmap of an italic font. Richard. From unicode at unicode.org Sat Feb 9 05:50:59 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Sat, 09 Feb 2019 12:50:59 +0100 Subject: Encoding italic In-Reply-To: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> Message-ID: Den 2019-02-08 21:53, skrev "Doug Ewell via Unicode" : > I'd like to propose encoding italics and similar display attributes in > plain text using the following stateful mechanism: Note that these do NOT nest (no stack...), just state changes for the relevant PART of the "graphic" (i.e. style) state. So the approach in that regard is quite different from the approach done in HTML/CSS. > ? Italics on: ESC [3m > ? Italics off: ESC [23m > ? Bold on: ESC [1m > ? Bold off: ESC [22m > ? Underline on: ESC [4m (implies turning double underline off) Underline, double: ESC [21m (implies turning single underline off) > ? Underline off: ESC [24m > ? Strikethrough on: ESC [9m > ? Strikethrough off: ESC [29m > ? Reverse on: ESC [7m > ? Reverse off: ESC [27m "Reverse" = "switch background and foreground colours". This is an (odd) colour thing. If you want to go with (full!) colour (foreground and background), fine, but the "reverse" is oddball (and based on what really old terminals were limited to when it comes to colour). I'd rather include 'ESC [50m' (not variable spacing, i.e. "monospace" font) and 'ESC [26m' (variable spacing, i.e. "proportional" font). Recall that this is NOT for terminal emulators but for styling applied to text outside of terminal emulators. (Terminal emulators already implement much of this and more; albeit sometimes wrongly). This would be handy for including (say) programming code or computer commands (or for that matter, "ASCII art", or more generally "Unicode art") in otherwise "ordinary" text... (The "ordinary" text preferably set in a proportional font.) > ? Reset all attributes: ESC [m (Actually 'ESC [0m', with the 0 default-able.) Handy, agreed, but not 100% necessary. These ESC-sequences should not normally be inserted "manually" but by a text editor program, using the conventional means of "making bold" etc. (ctrl-b, cmd-b, "bold" in a menu); only "hackers" (in the positive sense) would actually bother about the command sequences as such. /Kent K > where ESC is U+001B. > > This mechanism has existed for around 40 years and is already supported > as widely as any new Unicode-only convention will ever be. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 9 05:51:19 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Sat, 09 Feb 2019 12:51:19 +0100 Subject: Encoding italic In-Reply-To: Message-ID: Den 2019-02-08 22:29, skrev "Egmont Koblinger via Unicode" : > (Mind you, I don't find it a good idea to add italic and whatnot > formatting support to Unicode at all... but let's put aside that now.) I don't think Doug mean to "add it to the Unicode standard", just to have a summary of "handy esc-sequences (actually command-sequences) for simple styling of text" picked from long-standing (text level...) standards. > There are a lot of problems with these escape sequences, and if you go > for a potentially new standard, you might not want to carry these > problems. > > There is not a well-defined framework for escape sequences. In this > particular case you might say it starts with ESC [ and ends with the > letter 'm', but how do you know where to end the sequence if that > letter 'm' just doesn't arrive? Terminal emulators have extremely There is an overriding "basic (overall) syntax" for esc-seq/ command-sequences that do not include a string argument (like OSC, APC, ...). IIUC it is (originally as byte sequences, but here as character sequences): \u001B[\u0020-\002F]*[\u0030-\007E]|? (\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]? (no newline or carriage return in there). True, that has no direct limit, but it would not be unreasonable to set a limit of (say) max 30 characters. Potential (i.e. starting with ESC) esc-"sequences" that do not match the overall syntax or are too long can simply be rendered as is (except for the ESC itself). The esc/command sequences (that match) but are not interpreted should be ignored in "normal" (not "show invisibles" mode) display. They are unlikely to be "default ignored" by such things as sorting (and should preferably be filtered out beforehand, if possible). But if we compare to other rich text editors, the command sequences should be ignored by (interactive) searching, just like HTML tags are ignored in interactive searching (the internal representation "skipping" the HTML tags in one way or another). HTML tags should also (when text known to be HTLM) filtered out before doing such things as sorting. > complex tables for parsing (and still many of them get plenty of > things wrong). It's unreasonable for any random small utility > processing Unicode text to go into this business of recognizing all > the well-known escape sequences, not even to the extent to know where > they end. Whatever is designed should be much more easily parseable. > Should you say "everything from ESC[ to m", you'll cause a whole bunch > of problems when a different kind of escape sequence gets interpreted > as Unicode. The escape/command sequences would not be part of Unicode (standard). > A parser, by the way, would also have to interpret combined sequences > like ESC[3;0;1m or alike, for which I don't see a good reason as > opposed to having separate sequences for each. Also, it should be Formally covered by the (non-Unicode) standards, but optional (IIUC). > carefully evaluated what to do with C1 (U+009B) instead of the C0 ESC[ > opening for an escape sequence ? here terminal emulators vary. These > just make everything even more cumbersome. > > ECMA-48 8.3.117 specifies ESC[1m as "bold or increased intensity". I think one should interpret these in a "modern" way, not looking too much at what old terminals were limited to. (Colour ("increased intensity") should be handled completely separately from bold.) > Should this scheme be extended for colors, too? What to do with the > legacy 8/16 as well as the 256-color extensions wrt. the color > palette? Should Unicode go into the business of defining a fixed set > of colors, or allow to alter the palette colors using the OSC 4 and > friends escape sequences which supported by about half of the terminal > emulators out there? IF extending to colour, only refer to "true colour" (RGB) command-sequence. The colour palette versions are for the limitations of (semi-)old terminals. > For 256-colors and truecolors, there are two or three syntaxes out > there regarding whether the separator is a colon or a semicolon. It can only be colon. Using semicolon would interfere with the syntax for multiple style specifications in one command sequence. (I by mistake wrote a semicolon there in an earlier post; sorry.) > Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m > for curly underline. What to do with them? Where to draw the line what (Note colon, not semicolon, as separator.) Possible, partially matching the capabilities for underlining via CSS (solid, dotted, dashed, wavy, double). Depends on how much styling options one wants to pick up. > to add to Unicode and what not to? Will Unicode possibly be a I don't think anyone wants to make this part of the Unicode standard. (A the most a Unicode technical note...; from Unicode's point of view.) [...] > What to do with things that Unicode might also want to have, but > doesn't exist in terminal emulators due to their nature, such as > switching to a different font size? While ECMA-48 only has a palette (content defined by the implementation) of ten fonts, xterm (!), IIUC, has 'OSC 50; BEL' (it should be an ST not BEL, and it should be a DCS not an OSC...) for more general font switching. Not part of Doug's proposal summary of "good to implement command sequences". And it has a string parameter, so it cannot formally be a command-sequence (which can only have digits and some punctuation in them). But a much more limited 'ESC [50m' (not variable spacing, i.e. "monospace" font) and 'ESC [26m' (variable spacing, i.e. "proportional" font) (exactly which fonts are implementation defined), would be reasonable. Switch to monospace for code snippets, for quoting text from a terminal emulator, or for "ASCII/Unicode art" (which is still quite common). For font SIZE, ECMA-48 has: 'ESC [2 I' (select "computer decipoint", which seems to be the "point size" unit used on computers (slightly different from older point size units)), and then 'ESC [16 C' for 16 points. Not part of Doug's proposal summary of "good to implement command sequences". (Note the space before the terminating letter of the sequences!) ECMA-48 even has a font stretch command: 'ESC [; B'. E.g. double height would be 'ESC [200;100 B' (I don't think these accumulate, so it's relative to the set font size). Condensed style (narrowing the characters but keeping the height) would, e.g., be 'ESC [100;75 B' (compare the 'wdth' design axis in OpenType). (So for the time ECMA-48 was made, it is quite advanced on these points.) As you can see, these thing are aimed at typography/"print", not terminal emulators... And not part of Doug's proposal. >> This mechanism [...] is already supported >> as widely as any new Unicode-only convention will ever be. > > I truly doubt this, these escape sequences are specific to terminal > emulation, an extremely narrow subset of where Unicode is used and > rich text might be desired. This kind of command sequences are popular to implement in terminal (emulators), but the text styling command sequences are not at all (from a standards, and technical, point of view) limited to terminal (emulators). > I see it a much more viable approach if Unicode goes for something > brand new, something clean, easily parseable, and it remains the job If done right, I don't see that the command sequences are that hard to PARSE. (Doing it wrong will of course get you into all sorts of trouble.) Interpreting (a selection) of them is slightly harder, but is stuff that very commonly implemented (bold, underline, ...) as long as one does not get into the a bit more advanced stuff like condensed/extended by percentage (which is not so commonly implemented, and not part of Doug's proposal). > of specific applications to serve as a bridge between the two worlds. > Or, if it wants to adopt some already existing technology, I find > HTML/CSS a much better starting point. (X)HTML/CSS is fine. But it requires 1) a "second" level of parsing (actually several different parsers), and 2) is a huge task to implement. Command sequences (? la ECMA-48) are 1) possible to parse out at the text level, and 2) interpretation can be limited to "simple" styling (like in Doug's proposal, perhaps extended some (or a lot, depending)...), and then is a much smaller implementation task than HTML/CSS. /Kent K From unicode at unicode.org Sat Feb 9 06:52:30 2019 From: unicode at unicode.org (David Starner via Unicode) Date: Sat, 9 Feb 2019 04:52:30 -0800 Subject: Encoding italic In-Reply-To: References: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> Message-ID: On Sat, Feb 9, 2019 at 3:59 AM Kent Karlsson via Unicode < unicode at unicode.org> wrote: > > Den 2019-02-08 21:53, skrev "Doug Ewell via Unicode" >: > > ? Reverse on: ESC [7m > > ? Reverse off: ESC [27m > > "Reverse" = "switch background and foreground colours". > > This is an (odd) colour thing. If you want to go with (full!) colour > (foreground and background), fine, but the "reverse" is oddball (and > based on what really old terminals were limited to when it comes to > colour). > Note that this is actually the only thing that stands out to me in Unicode not supporting older character sets; in PETSCII (Commodore 64), the high-bit character characters were the reverse (in this sense) of the low-bit characters. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 9 07:35:18 2019 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Sat, 9 Feb 2019 05:35:18 -0800 Subject: Encoding italic In-Reply-To: References: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> Message-ID: On Sat, Feb 9, 2019 at 4:58 AM David Starner via Unicode < unicode at unicode.org> wrote: > > On Sat, Feb 9, 2019 at 3:59 AM Kent Karlsson via Unicode < > unicode at unicode.org> wrote: > >> >> Den 2019-02-08 21:53, skrev "Doug Ewell via Unicode" > >: >> > ? Reverse on: ESC [7m >> > ? Reverse off: ESC [27m >> >> "Reverse" = "switch background and foreground colours". >> >> This is an (odd) colour thing. If you want to go with (full!) colour >> (foreground and background), fine, but the "reverse" is oddball (and >> based on what really old terminals were limited to when it comes to >> colour). >> > > Note that this is actually the only thing that stands out to me in Unicode > not supporting older character sets; in PETSCII (Commodore 64), the > high-bit character characters were the reverse (in this sense) of the > low-bit characters. > This is true, many legacy character sets encoded reverse-video characters as wholly-separate characters, and even allowed them in contexts widely considered plain-text such as file names. This makes reverse-video possibly the one text attribute best argued to be worthy of encoding in Unicode. But I can already tell you it won't work, because we made such an argument in an early version of L2/19-025, and even proposed using VS14, the very same VS William Overington has since swiped from us for italics. That proposal was shot down rather quickly. Bold, italics, etc. don't even stand a chance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 9 08:06:48 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Feb 2019 14:06:48 +0000 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <83y36po1bi.fsf@gnu.org> References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> Message-ID: <20190209140648.12174bf4@JRWUBU2> On Sat, 09 Feb 2019 09:42:09 +0200 Eli Zaretskii via Unicode wrote: > > Date: Sat, 9 Feb 2019 00:18:14 +0000 > > From: Richard Wordingham via Unicode > > > > > For character composition, you must have a shaping engine to talk > > > to, and the shaper should tell you the width of each grapheme > > > cluster it returns. > > > > (a) What defines the grapheme clusters? The definition might be > > terminal-specific. > > Well, the "you" above alluded to the terminal emulator, of course. > The grapheme clusters are determined by the shaping engine that the > emulator must call when appropriate (or always). I find it very hard to believe that that is how it works with GNOME Terminal (Version 3.18.3, using VTE Version 0.42.5). At the command line I typed in the Khmer script string ????? (KA, COENG, KA, SIGN E, KA), and saw the string split into four columns (KA, COENG), (KA), (SIGN E), (KA), with each column given the same width. When written correctly, SIGN E is first in visual order. The fourth column was displayed on top of the third column, which contained a dotted circle to show that SIGN E on its own was not grammatically correct. If I were writing a Khmer font for use with Gnome terminal, I would attempt to ensure that the display for SIGN E fitted in a single cell. Of course, the renderer's grapheme cluster boundaries don't always match appearances. To get the traditional placement of U+1A58 TAI THAM SIGN MAI KANG LAI, I end up with it being a mark glyph one cluster later than HarfBuzz indicates it to be. It would be good to be able to access a maintained statement of the VTE rules for allocating characters to a cell, or group of cells, as appropriate. > > (b) With a terminal that expects a fixed width font, surely the > > terminal decides how many cells it allocates to a group of > > characters, and the font designer has to come up with a suitable > > value based on that. > > Yes. A terminal emulator that works with a shaper should probably > post-process the width information returned by the shaper for these > purposes. Perhaps it should base the number of cells on the width of the clusters. However, continuing with my example, U+1789 KHMER LETTER NYO as a base character is too wide to fit in a cell, and the next character will overwrite its right-hand part. From this I deduce that it is allocated just one cell. Gnome terminal is not alone in doing this, but it does better than some, in my opinion, in that the overflow of the foreground of one cell is not obliterated by the background of the next cell. U+1789 has an East Asian width property of 'Neutral', which is distinctly unhelpful. What I would like is a specification of what a font must do to avoid such problems. > > > I don't see how you can expect wcwidth, or any other > > > interface that was designed to work with _characters_, to be > > > useful when you need to display grapheme clusters. It, or something similar but worse, gets used, especially when moving the cursor for editing. > > Well I can envisage a decision being made that a grapheme cluster > > str (as decreed by the terminal) shall occupy wcswidth(str) cells - > > "The wcswidth() function returns the number of column positions for > > the wide-character string s, truncated to at most length n". > > AFAIU, the shaping engine returns its output in terms of font glyph > numbers, not character codepoints, so you cannot in general call > wcswidth on them. The shaper also returns the advance information, > which serves instead of wcwidth and related APIs for determining the > actual width on display. Unfortunately, when the rectangular grid is being preserved, typographical advance width is generally ignored when determining the placement of characters. Now, this is not always true; one can have the situation where the the positioning of characters respects the advance widths, but the positioning of the cursor assumes a fixed-width rectangular grid. I have found working with that to be extremely confusing. Richard. From unicode at unicode.org Sat Feb 9 08:21:24 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Feb 2019 14:21:24 +0000 Subject: Encoding italic In-Reply-To: References: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> Message-ID: <20190209142124.66d3edbb@JRWUBU2> On Sat, 9 Feb 2019 04:52:30 -0800 David Starner via Unicode wrote: > Note that this is actually the only thing that stands out to me in > Unicode not supporting older character sets; in PETSCII (Commodore > 64), the high-bit character characters were the reverse (in this > sense) of the low-bit characters. Later ISCII has some styling codes, bold and italic amongst them. Richard. From unicode at unicode.org Sat Feb 9 04:54:54 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Sat, 9 Feb 2019 10:54:54 +0000 (GMT) Subject: Encoding colour (from Re: Encoding italic) In-Reply-To: References: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> Message-ID: <642c3d7c.6bf3.168d1e557a8.Webtop.71@btinternet.com> Egmont Koblinger wrote: > Should this scheme be extended for colors, too? What to do with the legacy 8/16 as well as the 256-color extensions wrt. the color palette? Should Unicode go into the business of defining a fixed set of colors, or allow to alter the palette colors using the OSC 4 and friends escape sequences which supported by about half of the terminal emulators out there? Encoding colour is already a topic in relation to emoji and maybe could be extended to other characters. A stateful method, though which might be useful for plain text streams in some applications, would be to encode as characters some of the glyphs for indicating colours and the digit characters to go with them from page 5 and from page 3 of the following publication. http://www.users.globalnet.co.uk/~ngo/locse027.pdf > What to do with things that Unicode might also want to have, but > doesn't exist in terminal emulators due to their nature, such as switching to a different font size? Well, if people were to want to do it, there could be a character encoded in the Specials section and then use that character as a base character and follow it with a sequence of tag characters. William Overington Saturday 9 February 2019 From unicode at unicode.org Sat Feb 9 07:12:03 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Sat, 9 Feb 2019 13:12:03 +0000 (GMT) Subject: Encoding colour (from Re: Encoding italic) In-Reply-To: <642c3d7c.6bf3.168d1e557a8.Webtop.71@btinternet.com> References: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> <642c3d7c.6bf3.168d1e557a8.Webtop.71@btinternet.com> Message-ID: <79e6f552.6d26.168d262e56c.Webtop.71@btinternet.com> Previously I wrote: > A stateful method, though which might be useful for plain text streams > in some applications, would be to encode as characters some of the > glyphs for indicating colours and the digit characters to go with them > from page 5 and from page 3 of the following publication. > http://www.users.globalnet.co.uk/~ngo/locse027.pdf Thinking about this further, for this application copies of the glyphs could be redesigned so as to be square and could be emoji-style and the meanings of the characters specifying which colour component is to be set could be changed so that they refer to the number previously entered using one or more of the special digit characters. Thus the setting of colour components could be done in the same reverse notation way that the FORTH computer language works. Yet although the colour components thus set would be stateful until changed there would be no Escape sequence and if an application did not support interpretation of the characters as setting colours, they would just be displayed as glyphs, each either as a particular glyph or as a .notdef glyph. William Overington Saturday 9 February 2019 From unicode at unicode.org Sat Feb 9 11:42:52 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 9 Feb 2019 18:42:52 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <20190209140648.12174bf4@JRWUBU2> References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> Message-ID: Hi Richard, On Sat, Feb 9, 2019 at 3:08 PM Richard Wordingham via Unicode wrote: > It would be good to be able to access a maintained statement of the > VTE rules for allocating characters to a cell, or group of cells, as > appropriate. What VTE did, up to a couple of days ago: It opens the font, and measures the ASCII 33-126 or so characters, takes their average size (well, in case of monospace font, they should all have the same size), this determines the cell size. Then every character cell is rendered individually, using Pango or Cairo or I'm not sure what exactly ? there are like three paths in the source, the details are unclear to me. A cell might contain a base character + nonspacing combining accents, these are passed together to Pango and friends, so they render it as one unit. The glyph is aligned to the left of its designated cell area, overflowing on the right (and thus potentially overlapping with the next glyph) if it's wider than its designated area. As a special case, two adjacents cells might contain a double wide (typically CJK) character, but it's not that special after all: it's also displayed aligned to the left edge of its first cell. What I improved a couple of days ago (to be released in vte-0.56), for Devanagari and friends, although I know there's more than this to address these scripts properly: If a cell contains a regular letter, and the next cell contains a spacing combining mark, then these two are passed to Pango in a single step, that is, the spacing combining mark is applied around its base letter by Pango as expected. (Previously the spacing combining mark was rendered on its own, around a dotted circle, which was obviously pretty bad.) What I'm working on currently, as you all know by now, is BiDi-shuffling the cells before rendering them (hopefully for vte-0.58). This is how VTE works now, but it's by no means a specification, and tailoring a font to this behavior is probably not the right approach. Instead, VTE's behavior should be improved. We have a pending feature request (which I've already linked) to use HarfBuzz for rendering the glyphs, which would then render grapheme clusters beautifully. The problem that I don't know how to address is: What if harfbuzz tells us that the overall width for rendering a particular grapheme cluster is significantly different from its designated area (the number of character cells [wcswidth()] multiplied by the width of each)? cheers, egmont > > > > (b) With a terminal that expects a fixed width font, surely the > > > terminal decides how many cells it allocates to a group of > > > characters, and the font designer has to come up with a suitable > > > value based on that. > > > > Yes. A terminal emulator that works with a shaper should probably > > post-process the width information returned by the shaper for these > > purposes. > > Perhaps it should base the number of cells on the width of the > clusters. However, continuing with my example, U+1789 KHMER LETTER NYO > as a base character is too wide to fit in a cell, and the next > character will overwrite its right-hand part. From this I deduce that it > is allocated just one cell. Gnome terminal is not alone in doing this, > but it does better than some, in my opinion, in that the overflow of the > foreground of one cell is not obliterated by the background of the > next cell. U+1789 has an East Asian width property of 'Neutral', which > is distinctly unhelpful. > > What I would like is a specification of what a font must do to avoid > such problems. > > > > > I don't see how you can expect wcwidth, or any other > > > > interface that was designed to work with _characters_, to be > > > > useful when you need to display grapheme clusters. > > It, or something similar but worse, gets used, especially when moving > the cursor for editing. > > > > Well I can envisage a decision being made that a grapheme cluster > > > str (as decreed by the terminal) shall occupy wcswidth(str) cells - > > > "The wcswidth() function returns the number of column positions for > > > the wide-character string s, truncated to at most length n". > > > > AFAIU, the shaping engine returns its output in terms of font glyph > > numbers, not character codepoints, so you cannot in general call > > wcswidth on them. The shaper also returns the advance information, > > which serves instead of wcwidth and related APIs for determining the > > actual width on display. > > Unfortunately, when the rectangular grid is being preserved, > typographical advance width is generally ignored when determining the > placement of characters. Now, this is not always true; one can have > the situation where the the positioning of characters respects the > advance widths, but the positioning of the cursor assumes a fixed-width > rectangular grid. I have found working with that to be extremely > confusing. > > Richard. > From unicode at unicode.org Sat Feb 9 12:07:06 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 09 Feb 2019 20:07:06 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: (message from Egmont Koblinger via Unicode on Sat, 9 Feb 2019 18:42:52 +0100) References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> Message-ID: <83sgwwn8dx.fsf@gnu.org> > Date: Sat, 9 Feb 2019 18:42:52 +0100 > Cc: unicode Unicode Discussion > From: Egmont Koblinger via Unicode > > What if harfbuzz tells us that the overall width for rendering a > particular grapheme cluster is significantly different from its > designated area (the number of character cells [wcswidth()] > multiplied by the width of each)? You need to use what HarfBuzz tells you _instead_ of wcswidth. It is in general wrong to use wcswidth or anything similar when you use a shaping engine and support complex script shaping. From unicode at unicode.org Sat Feb 9 12:25:08 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 9 Feb 2019 19:25:08 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <83sgwwn8dx.fsf@gnu.org> References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: On Sat, Feb 9, 2019 at 7:07 PM Eli Zaretskii wrote: > You need to use what HarfBuzz tells you _instead_ of wcswidth. It is > in general wrong to use wcswidth or anything similar when you use a > shaping engine and support complex script shaping. This approach is not viable at all. Terminal emulators have an internal data structure that they maintain, a matrix of character cells. Every operation is performed here, every escape sequence is defined on this layer what it does, the cursor position is tracked on this layer, etc. You can move the cursor to integer coordinates, overwrite the letter in that cell, and do plenty of other operations (like push the rest to the right by one cell). If you change these fundamentals, most of the terminal-based applications will fall apart big time. This behavior has to be absolutely independent from the font. The application running inside the terminal doesn't and cannot know what font you use, let alone how harfbuzz is about to render it. (You can even have no font at all, such as with the libvterm headless emulator library, or a detached screen or tmux session; or have multiple fonts at the same time if a screen or tmux session is attached from multiple graphical emulators.) So one part of a terminal emulator's code is responsible for maintaining this matrix of characters according to the input it receives. Another part of their code is responsible for presenting this matrix of characters on the UI, doing the best it can. If you say that the font should determine the logical width, you need to start building up something brand new from scratch. You need to have something that doesn't have concepts like "width in characters". You need to redefine cursor movement and many other escape sequences. You need to heavily adjust the behavior of a gazillion of software, e.g. zip's two-column output, anything that aligns in columns (e.g. midnight commander, tmux's vertical split etc.), the shell's (or readline's) command editing and wrapping to multiple lines, ncurses, and so on, all the way to e.g. fullscreen text editors like Emacs. And then we're not talking about terminal emulators anymore, as we know them now, but something new, something pretty different. Terminal emulators do have strong limitations. Complex text rendering can only work to the extent we can squeeze it into these limitations. cheers, egmont From unicode at unicode.org Sat Feb 9 12:55:58 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 09 Feb 2019 20:55:58 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: (message from Egmont Koblinger on Sat, 9 Feb 2019 19:25:08 +0100) References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: <83r2cgn64h.fsf@gnu.org> > From: Egmont Koblinger > Date: Sat, 9 Feb 2019 19:25:08 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > > You need to use what HarfBuzz tells you _instead_ of wcswidth. It is > > in general wrong to use wcswidth or anything similar when you use a > > shaping engine and support complex script shaping. > > This approach is not viable at all. > [...] I'm probably missing something, because I don't see the grave problems you hint at. Any width provided back by a shaper can be rounded to the nearest integral character cell, so your canvas can still remain rectangular. And I see no reason why an application should be bothered by the actual number of character cells occupied by the text it wrote on display. So what exactly is not viable in using the width reported back by the shaper? > If you say that the font should determine the logical width, you need > to start building up something brand new from scratch. Are you saying that a terminal cannot work with variable-pitch fonts? > Terminal emulators do have strong limitations. Complex text rendering > can only work to the extent we can squeeze it into these limitations. No one said anything to the contrary, AFAICT. From unicode at unicode.org Sat Feb 9 13:03:21 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 9 Feb 2019 20:03:21 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <83r2cgn64h.fsf@gnu.org> References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <83r2cgn64h.fsf@gnu.org> Message-ID: On Sat, Feb 9, 2019 at 7:56 PM Eli Zaretskii wrote: > I'm probably missing something, because I don't see the grave problems > you hint at. Any width provided back by a shaper can be rounded to > the nearest integral character cell, so your canvas can still remain > rectangular. Let's suppose a utility outputs these two lines of text: abcdefg| complex| whereas "abcdefg" are these English letters themselves, but "complex" is a word of some language requiring complex script rendering, taking up 7 logical cells (because that's what wcwidth() says). Also, "|" is the pipe symbol, or a vertical box drawing line, whatever. Now let's assume that harfbuzz tells you that the desired width for rendering this "complex" word is 5.3 times the width of the character cell. Or 8.6 times it. How to proceed? How will the "|" bars align up, and thus mc's two-panel layout, tmux's vertical split etc. not fall apart? In the latter case, when the width requested by harfbuzz is bigger than the designated width, what to with characters that "fall off" at the right edge of the terminal? e. From unicode at unicode.org Sat Feb 9 13:12:03 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 9 Feb 2019 11:12:03 -0800 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 9 13:13:29 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 09 Feb 2019 21:13:29 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: (message from Egmont Koblinger on Sat, 9 Feb 2019 20:03:21 +0100) References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <83r2cgn64h.fsf@gnu.org> Message-ID: <83pns0n5ba.fsf@gnu.org> > From: Egmont Koblinger > Date: Sat, 9 Feb 2019 20:03:21 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > Let's suppose a utility outputs these two lines of text: > abcdefg| > complex| > > whereas "abcdefg" are these English letters themselves, but "complex" > is a word of some language requiring complex script rendering, taking > up 7 logical cells (because that's what wcwidth() says). Also, "|" is > the pipe symbol, or a vertical box drawing line, whatever. > > Now let's assume that harfbuzz tells you that the desired width for > rendering this "complex" word is 5.3 times the width of the character > cell. Or 8.6 times it. How to proceed? How will the "|" bars align up, > and thus mc's two-panel layout, tmux's vertical split etc. not fall > apart? In the latter case, when the width requested by harfbuzz is > bigger than the designated width, what to with characters that "fall > off" at the right edge of the terminal? That's the application's problem, not the terminal's. An application that wants its column to line up _and_ wants to support complex text scripts will need to move cursor to certain coordinates, not to assume that 7 codepoints always take 7 columns on display. Or it will have to tell the users to use specific fonts, which are known to provide guarantees that this happens. How is this different from using variable-pitch fonts? From unicode at unicode.org Sat Feb 9 13:36:50 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 9 Feb 2019 20:36:50 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <83pns0n5ba.fsf@gnu.org> References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <83r2cgn64h.fsf@gnu.org> <83pns0n5ba.fsf@gnu.org> Message-ID: On Sat, Feb 9, 2019 at 8:13 PM Eli Zaretskii wrote: > That's the application's problem, not the terminal's. An application > that wants its column to line up _and_ wants to support complex text > scripts will need to move cursor to certain coordinates, not to assume > that 7 codepoints always take 7 columns on display. In order to do that, an application needs to know how wide a text will appear, which depends on the font. How will it know it? Will it by some means know the font and the rendering engine the terminal uses (even across ssh) and will it have to measure it itself? Or will it be able to ask the terminal? If so, how? Maybe a new extension, an asynchronous escape sequence that responds back with the measured width? What about the latency caused by the bunch of asyncronous roundtrips, especially over ssh? What about the utter pain and intrinsic unreliability of handling asynchronous responses, as I've outlined in a section of https://gitlab.freedesktop.org/terminal-wg/specifications/issues/8 ? What if there's no font? What if there are multiple fonts at the same time? What if the font is changed later on, is it okay then for the display of existing stuff to fall apart and only newly printed stuff to appear correctly? How do you define the "width of the terminal in characters", get/set by ioctl(..., TIOC[GS]WINSZ, ...) that many apps rely on? If you define it by any means, what if by placing the maximum numbers of "i"s in a row doesn't fill up the entire width? Will that area be unaccessible, then? Or despite having a definition of terminal width, will there be new cells beyond this width to write to? What if filling a row with all "w"s overflows? I take it that an app shouldn't print there, but what if it still does, will that piece of text just not be shown? How much more complicated would you think implementing something like "zip -h" become? > How is this different from using variable-pitch fonts? Do you mean variable-pitch font where the terminal still places each glyph in its designated area? The font is the private business of the terminal emulator, then, it'll just appear ugly as a screenshot I've already linked, but the emulation behavior wouldn't care. Or do you mean variable-pitch font where each letter is placed after each other, as you'd expect in document editors? That is, way more "i"s that "w"s fitting in a line? It's not different, it's practically the same. And this is something that none of the terminal emulators I'm aware of does; and having some clue about terminal emuators, I can't imagine how one could do (see all the questions above for a start). This is why I'm saying: Sure you can take this path, but then we're talking about something new, not terminal emulators as we currently know them. You can take this path, but then you'll have to rebuild many of the already existing apps, and beware, they'll get way more complex. e. From unicode at unicode.org Sat Feb 9 13:48:20 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 9 Feb 2019 20:48:20 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: Hi Asmus, > On quick reading this appears to be a strong argument why such emulators will > never be able to be used for certain scripts. Effectively, the model described works > well with any scripts where characters are laid out (or can be laid out) in fixed > width cells that are linearly adjacent. I'm wondering if you happen to know: Are there any (non-CJK) scripts for which a mechanical typewriter does not exist due to the complexity of the script? Are there any (non-CJK) scripts for which crossword puzzles don't exist? For scripts where these do exist, is it perhaps an acceptable tradeoff to keep their limitations in the terminal emulator world as well, to combine the terminal emulator's power with these scripts? Honestly, even with English, all I have to do is "cat some_text_file", and chances are that a word is split in half at some random place where it hits the right margin. Even with just English, a terminal emulator isn't something that gives me a grammatically and typographically super pleasing or correct environment. It gives me something that I personally find grammatically and typographically "good enough", and in the mean time a powerful tool to get my work done. Obviously the more complex the script, the more tradeoffs there will be. I think it's a call each user has to make whether they prefer a terminal emulator or a graphical app for a certain kind of task. And if terminal emulators have a lower usage rate in these scripts, that's not necessarily a problem. If we can improve by small incremental changes, sure, let's do. If we'd need to heavily redesign plenty of fundamentals in order to improve, it most likely won't happen. cheers, egmont From unicode at unicode.org Sat Feb 9 14:01:21 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 09 Feb 2019 22:01:21 +0200 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: (message from Egmont Koblinger on Sat, 9 Feb 2019 20:36:50 +0100) References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <83r2cgn64h.fsf@gnu.org> <83pns0n5ba.fsf@gnu.org> Message-ID: <83mun4n33i.fsf@gnu.org> > From: Egmont Koblinger > Date: Sat, 9 Feb 2019 20:36:50 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > On Sat, Feb 9, 2019 at 8:13 PM Eli Zaretskii wrote: > > > That's the application's problem, not the terminal's. An application > > that wants its column to line up _and_ wants to support complex text > > scripts will need to move cursor to certain coordinates, not to assume > > that 7 codepoints always take 7 columns on display. > > In order to do that, an application needs to know how wide a text will > appear, which depends on the font. How will it know it? I don't know. Maybe it keeps a database of character combinations that need shaping, each one with the maximum width on display the result can occupy. Or maybe it does something else. If it cannot, and the terminal cannot either, then what you say is that some scripts can never be supported by text terminals. From unicode at unicode.org Sat Feb 9 14:07:46 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 9 Feb 2019 21:07:46 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <83mun4n33i.fsf@gnu.org> References: <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <83r2cgn64h.fsf@gnu.org> <83pns0n5ba.fsf@gnu.org> <83mun4n33i.fsf@gnu.org> Message-ID: On Sat, Feb 9, 2019 at 9:01 PM Eli Zaretskii wrote: > then what you say is that some scripts > can never be supported by text terminals. I'm not familiar at all with all the scripts and their requirements, but yes, basically this is what I'm saying. I'm afraid some scripts can never be perfectly supported by text terminals. I hope though that all the scripts can be supported with more or less compromises, e.g. like it would appear in a crossword. But maybe not. Maybe one day some new, modern platform will arise with the goal of replacing terminal emulators, which I wouldn't necessarily mind. It's gonna take an enormous amount of work, though. cheers, egmont From unicode at unicode.org Sat Feb 9 14:22:06 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Sat, 9 Feb 2019 12:22:06 -0800 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: Egmont, On 2/9/2019 11:48 AM, Egmont Koblinger via Unicode wrote: > Are there any (non-CJK) scripts for which crossword puzzles don't exist? There are crossword puzzles for Hindi (in the Devanagari script). Just do an image search for "Hindi crossword puzzle". But the conventions for these break up words into syllables fitting into the boxes, and the rules for that are complex. You have to allow for the placement of dependent vowels, which may take up extra space left or right, as well as consonant clusters, which would be expressed often as conjuncts in Sanskrit, but which in Hindi are more commonly rendered as dead consonant sequences. So the "stuff in a box" is: 1. Inherently proportional width. 2. Inherently multi-character in content. (underlying 1 to 3 or more characters per cell) This is the kind of compromise you would have to have to make for almost any Indic script, to enable a rational approach to building crossword puzzles that make sense. And in a terminal context, you probably would not get acceptable behavior for Hindi if you tried to just take all the "stuff in a box" chunks and tried to lay them out directly in a line, as if the script behaved more like CJK. The existence proof of techniques to cut up text into syllables that enable crossword puzzle building, is not the same as a determination that the script, ipso facto, would work in a terminal context without dealing with additional complex script issues. At any rate, this is once again straying over into the issue of whether terminals can? be adapted for the requirements of shaping rules for complex scripts -- rather than the nominal subject of the thread, which has to do with bidi text layout in terminals. --Ken From unicode at unicode.org Sat Feb 9 14:26:06 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 9 Feb 2019 21:26:06 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: Hi Ken, > There are crossword puzzles for Hindi (in the Devanagari script). Just > do an image search for "Hindi crossword puzzle". It's easy to confirm the existence by an image search, it's hard to confirm the non-existence ;) > The existence proof of techniques to cut up text into syllables that > enable crossword puzzle building, is not the same as a determination > that the script, ipso facto, would work in a terminal context without > dealing with additional complex script issues. Thanks a lot for your detailed explanation; this possibility indeed didn't occur to me. cheers, egmont From unicode at unicode.org Sat Feb 9 15:02:55 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Sat, 9 Feb 2019 13:02:55 -0800 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: On 2/9/2019 11:48 AM, Egmont Koblinger wrote: > Hi Asmus, > >> On quick reading this appears to be a strong argument why such emulators will >> never be able to be used for certain scripts. Effectively, the model described works >> well with any scripts where characters are laid out (or can be laid out) in fixed >> width cells that are linearly adjacent. > I'm wondering if you happen to know: > > Are there any (non-CJK) scripts for which a mechanical typewriter does > not exist due to the complexity of the script? Egmont, are you excluding CJK because of the difficulty handling a large repertoire with mechanical means? However, see: https://en.wikipedia.org/wiki/Chinese_typewriter > > Are there any (non-CJK) scripts for which crossword puzzles don't exist? > > For scripts where these do exist, is it perhaps an acceptable tradeoff > to keep their limitations in the terminal emulator world as well, to > combine the terminal emulator's power with these scripts? I agree with you that crossword puzzles and scrabble have a similar limitation to the design that you sketched for us. However, take a script that is written in syllables (each composed of 1-5 characters, say). In a "crossword" I could write this script so that each syllable occupies a cell. It would be possible to read such a puzzle, but trying to use such a draconian technique for running text would be painful, to say the least. (We are not even talking about pretty, here). Here's an example for Hindi: https://vargapaheli.blogspot.com/2017/ I don't read Hindi, but 5 vertical in the top puzzle, cell 2, looks like it contains both a consonant and a vowel. To force Hindi crosswords mode you need to segment the string into syllables, each having a variable number of characters, and then assign a single display position to them. Now some syllables are wider than others, so you could use the single/double width paradigm. The result may be somewhat legible for Devanagari, but even some of the closely related scripts may not fit that well. Now there are some scripts where the same syllable can be written in more than one form; the forms differing by how the elements are fused (or sometimes not fused) into a single shape. Sometimes, these differences are more "stylistic", more like an 'fi' ligature in English, sometimes they really indicate different words, or one of the forms is simply not correct (like trying to spell lam-alif in Arabic using two separate letters). I'm sure there are scripts that work rather poorly (effectively not at all) in cross- word mode. The question then becomes one of goals. Are you defining as your goal to have some kind of "line by line" display that can survive any Unicode text thrown at it, or are you trying to extend a given design with rather specific limitations, so that it survives / can be used with, just a few more scripts than European + CJK? > > Honestly, even with English, all I have to do is "cat some_text_file", > and chances are that a word is split in half at some random place > where it hits the right margin. Even with just English, a terminal > emulator isn't something that gives me a grammatically and > typographically super pleasing or correct environment. It gives me > something that I personally find grammatically and typographically > "good enough", and in the mean time a powerful tool to get my work > done. The discrepancies would be more like throwing random blank spaces in the middle of every word, writing letters out of order, or overprinting. So, more fundamental, not just "not perfect". To give you an idea, here is an Arabi crossword. It uses the isolated shape of all letters and writes all words unconnected. That's two things that may be acceptable for a puzzle, but not for text output. http://www.everyday-arabic.com/2013/12/crossword1.html (try typing 3 vertical as a word to see the difference - it's 4x U+062A) > > Obviously the more complex the script, the more tradeoffs there will > be. I think it's a call each user has to make whether they prefer a > terminal emulator or a graphical app for a certain kind of task. And > if terminal emulators have a lower usage rate in these scripts, that's > not necessarily a problem. If we can improve by small incremental > changes, sure, let's do. If we'd need to heavily redesign plenty of > fundamentals in order to improve, it most likely won't happen. > You may begin to see the limitations and that they may well prevent you from reaching even your limited goal for speakers of at least three of the top ten languages worldwide. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 9 15:08:53 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 9 Feb 2019 13:08:53 -0800 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <83r2cgn64h.fsf@gnu.org> <83pns0n5ba.fsf@gnu.org> <83mun4n33i.fsf@gnu.org> Message-ID: <9caa3dee-fa21-9420-756d-2c1d5f9b652b@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 9 15:29:31 2019 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Sat, 9 Feb 2019 22:29:31 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <83mun4n33i.fsf@gnu.org> References: <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <83r2cgn64h.fsf@gnu.org> <83pns0n5ba.fsf@gnu.org> <83mun4n33i.fsf@gnu.org> Message-ID: <20190209212931.dvsutuqft7qsfht3@angband.pl> On Sat, Feb 09, 2019 at 10:01:21PM +0200, Eli Zaretskii via Unicode wrote: > > From: Egmont Koblinger > > Date: Sat, 9 Feb 2019 20:36:50 +0100 > > Cc: Richard Wordingham , > > unicode Unicode Discussion > > > > On Sat, Feb 9, 2019 at 8:13 PM Eli Zaretskii wrote: > > > > > That's the application's problem, not the terminal's. An application > > > that wants its column to line up _and_ wants to support complex text > > > scripts will need to move cursor to certain coordinates, not to assume > > > that 7 codepoints always take 7 columns on display. It must know that those particular 7 codepoints take, say, 5 columns when written together in a sequence. And it can't possibly ask the terminal, either -- it might be on a link that doesn't allow metadata to pass, it might be broadcasted, its output might be recorded many years prior to being displayed. A good part of the time the program is even run on a different distribution/release/OS. Obviously, a program running with system libraries might suffer misalignment and thus visual corruption if those libraries don't know beyond, say, Unicode 13 yet the terminal expects Unicode 17 -- but that's no different from any other property incompatibly changing. Property changes for established characters are pretty rare thus no significant loss of interoperability can be expected over time. > > In order to do that, an application needs to know how wide a text will > > appear, which depends on the font. How will it know it? > > I don't know. Maybe it keeps a database of character combinations > that need shaping, each one with the maximum width on display the > result can occupy. Or maybe it does something else. If it cannot, > and the terminal cannot either, then what you say is that some scripts > can never be supported by text terminals. That's doable even within the current rules, where every codepoint bears a wcwidth of 0, 1 or 2. A cluster made of codepoints a ' b c d " ^ (where a b c d have widths 1 while ' " ^ widths 0) needs to be rendered in exactly 4 cells. This may force stretching or condensing the shaped cluster compared to what usual typography would demand but that's in no way different from stretching Latin "i" or condensing "W". Meow! -- ??????? ??????? Remember, the S in "IoT" stands for Security, while P stands ??????? for Privacy. ??????? From unicode at unicode.org Sat Feb 9 15:31:37 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 9 Feb 2019 22:31:37 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: Hi Asmus, On Sat, Feb 9, 2019 at 10:02 PM Asmus Freytag (c) wrote: > are you excluding CJK because of the difficulty handling a large > repertoire with mechanical means? No, I excluded CJK because they're pretty well solved in terminals, and nowhere near along the lines of how they work with typewriters. I should've probably said "letter based" scripts or whatever, I'm not familiar with the exact terminologies. > To force Hindi crosswords mode you need to segment the string into syllables, > each having a variable number of characters [...] Thanks a lot to you too for your detailed explanation! > Are you defining as your goal to have some kind of "line by line" display that > can survive any Unicode text thrown at it, or are you trying to extend a given > design with rather specific limitations, so that it survives / can be used with, > just a few more scripts than European + CJK? I don't have a clearly defined goal. I find fun in developing VTE (and slightly improving other terminal emulators too by spreading ideas, knowledge, comments etc.), addressing various kinds of goals, whatever happens to come next. At this point it's BiDi, with a bit of Devanagari improvement sneaking in the other day. What is clear to me: I cannot redefine the basics of terminal emulation. I can only add incremental improvements to whatever it already is, and I have to make sure that the ecosystem built around it during decades (all the screen handling libraries and applications) doesn't break. I'm limited by these constraints. > The discrepancies would be more like throwing random blank spaces in the > middle of every word, writing letters out of order, or overprinting. So, more > fundamental, not just "not perfect". Let's take the Devanagari improvement of the other day. Until now, there were plenty of dotted circles shown, and combining spacing marks that should've been placed before the letter were placed after the letter, before a placeholder dotted circle. Now they are displayed as expected: the combininig spacing mark shows up before the letter (if it's of that kind), and no dotted circle. The letter + spacing marks now shows up correctly. The entire word still doesn't, e.g. there are often spaces between letters where the upper line connecting them should be continuous. Eventually HarfBuzz could help, but it's just not yet clear how exactly. I cannot essentially change the underlying model of fixed width cells. On top of this model, though, we can experiment with various ideas about displaying. For example, if a word occupies 7 columns in the model, then HarfBuzz renders it, and the rendered version occupies the width of 8.6 columns, maybe we can squeeze it using a trivial linear transformation? I'm not sure, but maybe it's an idea worth investigating. Won't look perfect, but probably will look better than what we do currently. We already have column spacing implemented, to pull the columns further apart from each other by a fixed amount (mostly for accessibility purposes), maybe a user can use this feature to make more room for a nicely rendered, non-squeezed Devanagari text. > To give you an idea, here is an Arabi crossword. It uses the isolated shape of > all letters and writes all words unconnected. That's two things that may be > acceptable for a puzzle, but not for text output. You can't get nice Arabic without first making sure the order of the letters is the correct one, not reversed. :-) That's what my current work is about. As per Richard's feedback, I also see that shaping needs to be done differently than I had thought. Mind you, my visual inspection of what the non-preferred shaping approach gave to me vs. what a proper HarfBuzz rendering gave (for Arabic) were extremely close to each other, something that I'd probably consider "good enough" if I spoke the language and were aware of the terminal's constraints. Well, definitely a major improvement over what we have. > You may begin to see the limitations and that they may well prevent you from > reaching even your limited goal for speakers of at least three of the top ten languages > worldwide. If the goal is to have perfect rendering without compromises: sure I won't reach that. (It's not a goal for me. For perfect rendering, users should get away from terminals.) If the goal is to have something reasonably good, better than what we have currently, I can't see why not. cheers, e. From unicode at unicode.org Sat Feb 9 15:40:12 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sat, 9 Feb 2019 22:40:12 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <9caa3dee-fa21-9420-756d-2c1d5f9b652b@ix.netcom.com> References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <83r2cgn64h.fsf@gnu.org> <83pns0n5ba.fsf@gnu.org> <83mun4n33i.fsf@gnu.org> <9caa3dee-fa21-9420-756d-2c1d5f9b652b@ix.netcom.com> Message-ID: On Sat, Feb 9, 2019 at 10:10 PM Asmus Freytag via Unicode wrote: > > I hope though that all the scripts can be supported with more or less > > compromises, e.g. like it would appear in a crossword. But maybe not. > > See other messages: not. For the crossword analogy, I can see why it's not good. But this doesn't mean there aren't any other ideas we could experiment with. Or do you mean to say that because it can't be made perfect, there's no point at all in partially improving? I don't think I agree with that. e. From unicode at unicode.org Sat Feb 9 17:43:45 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Sat, 9 Feb 2019 15:43:45 -0800 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <83r2cgn64h.fsf@gnu.org> <83pns0n5ba.fsf@gnu.org> <83mun4n33i.fsf@gnu.org> <9caa3dee-fa21-9420-756d-2c1d5f9b652b@ix.netcom.com> Message-ID: On 2/9/2019 1:40 PM, Egmont Koblinger wrote: > On Sat, Feb 9, 2019 at 10:10 PM Asmus Freytag via Unicode > wrote: > >>> I hope though that all the scripts can be supported with more or less >>> compromises, e.g. like it would appear in a crossword. But maybe not. >> See other messages: not. > For the crossword analogy, I can see why it's not good. But this > doesn't mean there aren't any other ideas we could experiment with. "all...scripts" is the issue.? We know how to handle text for all scripts and what complexities one has to account for in order to do that. You can back off some corner cases or (slightly) degrade things, but even after you are done with that, there will be scripts where the "more or less compromises" forces by the design parameters you gave will mean an utterly unacceptable display. That said, there are scripts that had "passable" typewriter implementations and it may be possible to tweak things to approach that level support. Don't know for sure, it depends on the details for each script. > > Or do you mean to say that because it can't be made perfect, there's > no point at all in partially improving? I don't think I agree with > that. It's more a question of being upfront with your goal. At this point I understand it as accepting some design parameters as fundamental and seeing whether there are some tweaks that allow more scripts to work with or to "survive" given the constraints. That's not a totally useless effort, but it is a far cry from Unicode's universal support for ALL writing systems. A./ PS: also we have been seriously hijacking a thread related to bidi > > > e. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 9 17:46:38 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Feb 2019 23:46:38 +0000 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: <20190209234638.7b35d0c6@JRWUBU2> On Sat, 9 Feb 2019 13:02:55 -0800 "Asmus Freytag \(c\) via Unicode" wrote: > To force Hindi crosswords mode you need to segment the string into > syllables, > each having a variable number of characters, and then assign a single > display > position to them. Now some syllables are wider than others, so you > could use the single/double width paradigm. The result may be > somewhat legible for Devanagari, but even some of the closely related > scripts may not fit that well. It is also possible that whole syllables are used because there are vertical words. > To give you an idea, here is an Arabi crossword. It uses the isolated > shape of > all letters and writes all words unconnected. That's two things that > may be acceptable for a puzzle, but not for text output. > > http://www.everyday-arabic.com/2013/12/crossword1.html > > (try typing 3 vertical as a word to see the difference - it's 4x > U+062A) Crosswords suffer from the need to be read vertically as well as horizontally. Can Arabic naturally be written vertically? In any case, Arabic typewriters exist and, so far as I understand, work. The problem rather seems to be one of standardising the Procrustean technique to be used. It seems from what Khaled Hosny wrote that monospace for letters is the usual solution already. The design difficulty for Arabic is rather that horizontally adjacency may sometimes need to be treated as accidental rather than as an invitation to cursively join.. Richard. From unicode at unicode.org Sat Feb 9 17:50:49 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 9 Feb 2019 23:50:49 +0000 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: <20190209235049.7cb45f59@JRWUBU2> On Sat, 9 Feb 2019 22:31:37 +0100 Egmont Koblinger via Unicode wrote: > Let's take the Devanagari improvement of the other day. Until now, > there were plenty of dotted circles shown, and combining spacing marks > that should've been placed before the letter were placed after the > letter, before a placeholder dotted circle. Now they are displayed as > expected: the combininig spacing mark shows up before the letter (if > it's of that kind), and no dotted circle. The letter + spacing marks > now shows up correctly. The entire word still doesn't, e.g. there are > often spaces between letters where the upper line connecting them > should be continuous. This is an example of where one needs a font designed for terminal emulators. Richard. From unicode at unicode.org Sat Feb 9 17:59:46 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sun, 10 Feb 2019 00:59:46 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <20190209235049.7cb45f59@JRWUBU2> References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <20190209235049.7cb45f59@JRWUBU2> Message-ID: Hi, On Sun, Feb 10, 2019 at 12:52 AM Richard Wordingham via Unicode wrote: > This is an example of where one needs a font designed for terminal > emulators. Definitely, this is another approach I forgot to mention in my mail, rather than VTE switching to harfbuzz and figuring out all the issues. This approach would also make them usable in every decent terminal emulator at once, not just VTE. Is there such a monospace font obeying wcwidth (that is: double wide character for when a spacing mark is combined) for Devanagari? Is there a monospace font for Arabic, for Syriac, etc.? (How much do these questions make sense at all?) If there are such fonts, I'd be happy to use them for testing. e. From unicode at unicode.org Sat Feb 9 19:25:14 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 10 Feb 2019 01:25:14 +0000 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <20190209212931.dvsutuqft7qsfht3@angband.pl> References: <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <83r2cgn64h.fsf@gnu.org> <83pns0n5ba.fsf@gnu.org> <83mun4n33i.fsf@gnu.org> <20190209212931.dvsutuqft7qsfht3@angband.pl> Message-ID: <20190210012514.6678351d@JRWUBU2> On Sat, 9 Feb 2019 22:29:31 +0100 Adam Borowski via Unicode wrote: > On Sat, Feb 09, 2019 at 10:01:21PM +0200, Eli Zaretskii via Unicode > wrote: > > I don't know. Maybe it keeps a database of character combinations > > that need shaping, each one with the maximum width on display the > > result can occupy. Or maybe it does something else. If it cannot, > > and the terminal cannot either, then what you say is that some > > scripts can never be supported by text terminals. > > That's doable even within the current rules, where every codepoint > bears a wcwidth of 0, 1 or 2. A cluster made of codepoints a ' b c d > " ^ (where a b c d have widths 1 while ' " ^ widths 0) needs to be > rendered in exactly 4 cells. This may force stretching or condensing > the shaped cluster compared to what usual typography would demand but > that's in no way different from stretching Latin "i" or condensing > "W". It would be helpful if overlong shapings were condensed automatically. The general principle that functions work better on strings applies here. There are two obvious situations where the additive formulae break down. (a) Emoji should, should they not, occupy at least 2 cells. There are a few problem sequences, such as (or is wcwidth(0x20E3) equal to 1?). (b) Brahmi-like Indic scripts. In many of these, the combination of a virama or invisible stacker and a base consonant acts like a combining mark, either causing no advance or as a mark with a very slight width. Examples include Grantha, Myanmar, Tai Tham and Khmer. Stretching a stack of 3 or 4 consonants to occupy 3 or 4 cells instead of 1 would be worse than stretching 'i'. If you do it, you want fonts that adjust the glyphs accordingly, just as for 'i'. Richard. From unicode at unicode.org Sat Feb 9 19:49:05 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 10 Feb 2019 01:49:05 +0000 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> Message-ID: <20190210014905.352d8c99@JRWUBU2> On Sat, 9 Feb 2019 18:42:52 +0100 Egmont Koblinger via Unicode wrote: > The > problem that I don't know how to address is: What if harfbuzz tells us > that the overall width for rendering a particular grapheme cluster is > significantly different from its designated area (the number of > character cells [wcswidth()] multiplied by the width of each)? You have to reduce the width of the glyph used. The tricky bit is where the glyph deliberately overhangs or underlies a neighbouring glyph. A good example of this is almost U+0E33 THAI CHARACTER SARA AM, whose nikkhahit component can typically overhangs the previous character; however, ink beyond the left limit should not be a problem for LTR scripts. Which side do you align RTL cells on? Now, you might want to treat U+0E33 as interacting with its predecessor, because it does. The test word is ??? 'water'. Richard. From unicode at unicode.org Sat Feb 9 21:10:42 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 10 Feb 2019 03:10:42 +0000 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <20190209235049.7cb45f59@JRWUBU2> Message-ID: <20190210031042.72384bf9@JRWUBU2> On Sun, 10 Feb 2019 00:59:46 +0100 Egmont Koblinger via Unicode wrote: > Is there such a monospace font obeying wcwidth (that is: double wide > character for when a spacing mark is combined) for Devanagari? For CV, that would correspond to a Hindi typewriter, so the odds look good. The Remington keyboard layout is taken from the typewriter design. However, the typewriter had non-spacing keys for repha (roughly ) and vattu (), so you'll be out of luck for consonant clusters. On the other hand, is two key strokes - the cells would be for and ! There's an implementation of the keyboard in the M17N database - hi-remington.mim. > Is there a monospace font for Arabic, Apart from wcwidth("??") = ?2, Khaled has already said in this thread that there are such fonts. > for Syriac, etc.? (How much do these questions make sense at all?) Perfect sense. Richard. From unicode at unicode.org Sat Feb 9 21:45:38 2019 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Sun, 10 Feb 2019 03:45:38 +0000 Subject: Encoding italic In-Reply-To: <20190209105805.3884e35f@JRWUBU2> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com> <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> <20190209105805.3884e35f@JRWUBU2> Message-ID: <8d8394ca-a753-227f-5526-f11d60854651@it.aoyama.ac.jp> On 2019/02/09 19:58, Richard Wordingham via Unicode wrote: > On Fri, 8 Feb 2019 18:08:34 -0800 > Asmus Freytag via Unicode wrote: >> Under the implicit assumptions bandied about here, the VS approach >> thus reveals itself as a true rich-text solution (font switching) >> albeit realized with pseudo coding rather than markup, markdown or >> escape sequences. > > Isn't that already the case if one uses variation sequences to choose > between Chinese and Japanese glyphs? Well, not necessarily. There's nothing prohibiting a font that includes both Chinese and Japanese glyph variants. Regards, Martin. From unicode at unicode.org Sat Feb 9 22:25:53 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 10 Feb 2019 04:25:53 +0000 Subject: Encoding italic In-Reply-To: <8d8394ca-a753-227f-5526-f11d60854651@it.aoyama.ac.jp> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> <20190209105805.3884e35f@JRWUBU2> <8d8394ca-a753-227f-5526-f11d60854651@it.aoyama.ac.jp> Message-ID: <28ab3a43-624f-2563-d485-e7002e4b3b3b@gmail.com> Martin J. D?rst wrote, >> Isn't that already the case if one uses variation sequences to choose >> between Chinese and Japanese glyphs? > > Well, not necessarily. There's nothing prohibiting a font that includes > both Chinese and Japanese glyph variants. Just as there?s nothing prohibiting a single font file from including both roman and italic variants of Latin characters. From unicode at unicode.org Sun Feb 10 04:35:56 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Sun, 10 Feb 2019 11:35:56 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <20190210014905.352d8c99@JRWUBU2> References: <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <20190210014905.352d8c99@JRWUBU2> Message-ID: On Sun, Feb 10, 2019 at 2:57 AM Richard Wordingham via Unicode wrote: > Which side do you align RTL cells on? It's out of the scope of my docs. In the current work-in-progress implementation I align them to the left, but there's a TODO entry to align them to the right instead (or maybe center all the glyphs). e. From unicode at unicode.org Sun Feb 10 05:02:12 2019 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Sun, 10 Feb 2019 03:02:12 -0800 Subject: Encoding italic In-Reply-To: <20190209142124.66d3edbb@JRWUBU2> References: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> <20190209142124.66d3edbb@JRWUBU2> Message-ID: On Sat, Feb 9, 2019 at 6:23 AM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Sat, 9 Feb 2019 04:52:30 -0800 > David Starner via Unicode wrote: > > > Note that this is actually the only thing that stands out to me in > > Unicode not supporting older character sets; in PETSCII (Commodore > > 64), the high-bit character characters were the reverse (in this > > sense) of the low-bit characters. > > Later ISCII has some styling codes, bold and italic amongst them. > Interesting. I found the 1991 ISCII spec: http://varamozhi.sourceforge.net/iscii91.pdf The styling codes are: EF 30 - Bold EF 31 - Italic EF 32 - Underline EF 33 - Double Width EF 34 - Highlight EF 35 - Outline EF 36 - Shadow EF 37 - Double Height, Top Half EF 38 - Double Height, Bottom Half EF 39 - Double Height & Double Width There are also codes for switching scripts (Roman, Devanagari, Bengali, Tamil, Arabic, Persian, etc.) but these are not necessary since Unicode encodes these separately. These take effect "till the end of a line, or till the same attribute [code is encountered]." In other words, these just toggle the attribute, and all the attributes are reset when a newline is encountered. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 10 07:18:38 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 10 Feb 2019 14:18:38 +0100 Subject: Encoding italic In-Reply-To: <28ab3a43-624f-2563-d485-e7002e4b3b3b@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> <20190209105805.3884e35f@JRWUBU2> <8d8394ca-a753-227f-5526-f11d60854651@it.aoyama.ac.jp> <28ab3a43-624f-2563-d485-e7002e4b3b3b@gmail.com> Message-ID: Le dim. 10 f?vr. 2019 ? 05:34, James Kass via Unicode a ?crit : > > Martin J. D?rst wrote, > > >> Isn't that already the case if one uses variation sequences to choose > >> between Chinese and Japanese glyphs? > > > > Well, not necessarily. There's nothing prohibiting a font that includes > > both Chinese and Japanese glyph variants. > > Just as there?s nothing prohibiting a single font file from including > both roman and italic variants of Latin characters. > May be but such a fint would not work as intended to display both styles distinctly with the common use of the italic style: it would have to make a default choice and you would then need either a special text encoding, or enabling an OpenType feature (if using OpenType font format) to select the other style in a non-standard custom way. The only case where it happens in real fonts is for the mapping of Mathematical Symbols which have a distinct encoding for some variants (only for a basic subset of the Latin alphabet, as well as some basic Greek and a few other letters from other scripts), and this is typically done only in symbol fonts containing other mathametical symbols, but because of the specific encoding for such mathematical use. As well we have the variants registered in Unicode for IPA usage (only lowercase letters, treated as symbols and not case-paired). These were allowed in Unicode because of their specific contextual use as distinctive symbols from known standards, and not for general use in human languages (because these encodings are defective and don't have the necessary coverage, notably for the many diacritics, case mappings, and other linguisitic, segmentation and layout properties). The same can be said about superscript/subscript variants, bold variants, monospace variants: they have specific use and not made for general purpose texts in human languages with their common orthographic conventions: Latin is a large script and one of the most complex, and it's quite normal that there are some deviating usages for specific purposes, provided they are bounded in scope and use. But what you would like is to extend the whole Latin script (and why not Greek, Cyrillic, and others) with multiple reencodings for lot of stylistic variants, and each time a new character or diacritic is encoded it would have to be encoded multiple times (so you'd break the encoding character model, and would just complicate the implementation even more, and would also create new security issues with lot of new confusables, that every user of Unicode would then have to take into account, and evey application or library would then need to be updated, and have to include large datatables to handle them). As well it would create many conflicts if we used the "VARIATION SELECTOR n" characters, or would need to permanently assign specific ones for specific styles; and then rapidly we would no longer have enough "VARIATION SELECTOR n" selectors in Unicode : we only have 256 of them, only one is more or less permanently dedicated. [VS16 is almos compeltely reserved now for distinction between normal/linguisitic and emoji/colorful variants. The emoji subset in Unicode is an open set which could expand in the future to tens of thousands symbols, and will likely cause large work overhaed in CLDR project just to describe them, one reason for which I think that Emoji character data in CLDR should be separated in a distinct translation project, with its own versioning and milestones, and not maintained in sync with the rest of CLDR data, if we consider how emojis have flooded the CLDR survey discussions, when this subset has many known issues and inconsistencies and still no viable encoding model like the "character encoding model" to make it more consistant, and updatable separately from the rest of the Unicode UCD releases; in my opinion the emojis in Unicode are still an alpha project in development and it's too soon to describe them as a "standard" when there are many other possible way to handle them; these emeojis are just there now to remlain as "legacy" mappings but won't resist an expected coming new formal standard about them insterad of the current mess they create now.] -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 10 07:54:39 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 10 Feb 2019 14:54:39 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: Le sam. 9 f?vr. 2019 ? 20:55, Egmont Koblinger via Unicode < unicode at unicode.org> a ?crit : > Hi Asmus, > > > On quick reading this appears to be a strong argument why such emulators > will > > never be able to be used for certain scripts. Effectively, the model > described works > > well with any scripts where characters are laid out (or can be laid out) > in fixed > > width cells that are linearly adjacent. > > I'm wondering if you happen to know: > > Are there any (non-CJK) scripts for which a mechanical typewriter does > not exist due to the complexity of the script? > Look into South Asian scripts (Lao, Khmer, Tibetan...) and large syllabaries (CANS, Ethiopian). Even Arabic is challenging and does not work very well (or is very ugly) with typewriters or monospaced fonts, except if we use "simplified" Arabic. Hebrew is a bit better but also has issues if you need to support all its diacritics. Finally even Latin is not easy to fit with its ligatures, and multiple diacritics, some of them with complex layouts and applicable to pairs of letters, or seomtimes larger groups). The monospace restriction is a strong limitator: but then I don't see why a "terminal" could not handle fonts with variable metrics, and why it must be modeled only as a regular grid of rectangular cells (all of equal size) containing only one "character" (or cluster?). It is perfectly possible to have a terminal handling text as collection of "logical lines", split (horizontally?) as multiple spans covering one or more cells, each span containing one or more characters (or a full cluster) rendered correctly. But then you recreate the basic HTML standard (just discard the "document" and "body" level which would be implicit in a terminal, keep the "block" and "inline" elements, and flow the text (note that rendered lines could as well variable heights, depending on the height of their unbreakable spans and their vertical alignment...). But then you need specific controls to make proper vertical alignments (basically you need a "tabulator" in the terminal with a way to define the start of a tabulator scope and its end, and then reference tabulations by id when defining them in the middle of the text; this tabulator would be more powerful than just the TAB control which only uses an implicit/predefined tabulator). Then for editors in terminals you need a way to query the position of some items and make "logical" moves: the simple (line/column) coordinates on a grid are not usable. In HTML we would do that with form input elements (the form is flowed normally but is navigatable and input elements will have their own editable areas). So using controls, you would try to mimic again what HTML already provides you for free (and without complex specifications and redevelopment). So my opinion is that all legacy terminal protocosl will remain broken and it is more viable to work with the W3C to define a basic HTML profile suitable for terminals, but that will benefit of all the improvements made in HTRML to support i18n, including required ones (BiDi, variable-width fonts needed for complex scripts, accessibility...), but without the extra elements that were added in HTML5 for semantic document structures (HTML5 still speaks about the "document" level, but there's little defined for documents that are infinite streams that you can start reading from random position and possibly never terminated): All we need is a subset of HTML5 with only a few block elements without terminator tags ("p" would be implicit) and the inline elements for all the rest, and this becomes a viable "terminal protocol" which would deprecate all the legacy VT-like protocols (and would put an end to the desire of adding many new controls or duplicate reencodings in Unicode for specific styles. The only block elements that would be useful on top of this are forms and form inputs, to create editable fields and some attributes to allow editing or disallow them. Scripting would be an option (only for local data validation or filtering some inputs that must not be sent to the server, or to allow accessibility features, input methods and orthographic helpers). Then with that we are no longer blocked by the old terminal limitations (but it will still be possible for a terminal emulator to create a reasonnable layout to map it to a grid-based terminal, and then offer some helper tools to show a selectable popup view for things that cannot be rendered on the basic grid). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 10 09:31:57 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 10 Feb 2019 15:31:57 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> <20190209105805.3884e35f@JRWUBU2> <8d8394ca-a753-227f-5526-f11d60854651@it.aoyama.ac.jp> <28ab3a43-624f-2563-d485-e7002e4b3b3b@gmail.com> Message-ID: <48a28b67-2e6a-8d0d-8808-edcac2a4ee44@gmail.com> Philippe Verdy wrote, >> ...[one font file having both italic and roman]... > The only case where it happens in real fonts is for the mapping of > Mathematical Symbols which have a distinct encoding for some > variants ... William Overington made a proof-of-concept font using the VS14 character to access the italic glyphs which were, of course, in the same real font.? Which means that the developer of a font such as Deja Vu Math TeX Gyre could set up an OpenType table mapping the Basic Latin in the font to the italic math letter glyphs in the same font using the VS14 characters.? Such a font would work interoperably on modern systems.? Such a font would display italic letters both if encoded as math alphanumerics or if encoded as ASCII plus VS14.? Significantly, the display would be identical. > ...[math alphanumerics]... > These were allowed in Unicode because of their specific contextual > use as distinctive symbols from known standards, and not for general > use in human languages They were encoded for interoperability and round-tripping because they existed in character sets such as STIX.? They remain Latin letter form variants.? If they had been encoded as the variant forms which constitute their essential identity it would have broken the character vs. glyph encoding model of that era.? Arguing that they must not be used other than for scientific purposes is just so much semantic quibbling in order to justify their encoding. Suppose we started using the double struck ASCII variants on this list in order to note Unicode character numbers such as ??+???????? or ??+?????????? Hexadecimal notation is certainly math and Unicode can be considered a science.? Would that be ?math abuse? if we did it?? (Is linguistics not a science?) > (because these encodings are defective and don't have the necessary > coverage, notably for the many diacritics, The combining diacritics would be used. > case mappings, Adjust them as needed. > and other linguisitic, segmentation and layout properties). > > The same can be said about superscript/subscript variants, > ... : they have specific use and not made for general purpose texts ... So people who used ISO-8859-1 were not allowed to use the superscript digits therein for marking footnotes?? Those superscript digits were reserved by ISO-8859-1 only for use by math and science? MATHEMATICAL ITALIC CAPITAL A Decomposition mapping:? U+0041 Binary properties:? Math, Alphabetic, Uppercase, Grapheme Base, ... SUPERSCRIPT TWO Decomposition mapping:? U+0032 Binary properties:? Grapheme Base MODIFIER LETTER SMALL C Decomposition mapping:? U+0063 Binary properties:? Alphabetic, Lowercase, Grapheme Base, ... From unicode at unicode.org Sun Feb 10 13:30:41 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 10 Feb 2019 19:30:41 +0000 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: <20190210193041.5add71ea@JRWUBU2> On Sun, 10 Feb 2019 14:54:39 +0100 Philippe Verdy via Unicode wrote: > Le sam. 9 f?vr. 2019 ? 20:55, Egmont Koblinger via Unicode < > unicode at unicode.org> a ?crit : > > > Hi Asmus, > > > > > On quick reading this appears to be a strong argument why such > > > emulators > > will > > > never be able to be used for certain scripts. Effectively, the > > > model > > described works > > > well with any scripts where characters are laid out (or can be > > > laid out) > > in fixed > > > width cells that are linearly adjacent. > > > > I'm wondering if you happen to know: > > > > Are there any (non-CJK) scripts for which a mechanical typewriter > > does not exist due to the complexity of the script? > > > > Look into South Asian scripts (Lao, Khmer, Tibetan...) and... The Khmer script is an interesting case - see http://onkhmertype.com/the-cambodian-typewriter. The problem there is that deep cells are needed. What's the VTE algorithm for the vertical extent of the cell? The only problem I can see for Lao is that there can be two marks below a consonant. Otherwise, a straightforward adaptation of a Thai typewriter should suffice. There's a Tai Tham typewriter in the National Museum in Bangkok. However, spelling may have been adapted to cope with any limitations. >... large syllabaries (CANS, Ethiopian). That's more a matter of extent than complexity. Sesquidimensional Egyptian hieroglyphs could be tricky - they'll be like producing 2-D renderings of ideographic description sequences. There could be a problem with standardising cuneiform character widths. Richard. From unicode at unicode.org Sun Feb 10 16:49:54 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sun, 10 Feb 2019 15:49:54 -0700 Subject: Encoding italic Message-ID: <006901d4c192$f0d9d260$d28d7720$@ewellic.org> Egmont Koblinger wrote: > There are a lot of problems with these escape sequences, and if you go > for a potentially new standard, you might not want to carry these > problems. As others have pointed out, I am suggesting the use of some profile of ISO 6429 within plain text to implement these features about which there is disagreement whether they belong in plain text or not. I am very definitely NOT proposing that anything be added to Unicode or 10646, nor that an all-new standard be created. > There is not a well-defined framework for escape sequences. I thought ISO 6429 defined things rather clearly, if verbosely. > In this particular case you might say it starts with ESC [ and ends > with the letter 'm', but how do you know where to end the sequence if > that letter 'm' just doesn't arrive? Well, what do you do in HTML if the closing '>' never arrives? If it's simply a matter of the text coming to an end before the 'm' arrives, then it doesn't matter. If the 'm' (or other final code unit for other commands) is dropped but the sequence goes on, like [3This is italicized[m, then gosh, I don't know offhand what the standard says. It might be worthwhile to try looking it up, or seeing what implementations do, or defining it clearly in the profile. > Terminal emulators have extremely complex tables for parsing (and > still many of them get plenty of things wrong). It's unreasonable for > any random small utility processing Unicode text to go into this > business of recognizing all the well-known escape sequences, not even > to the extent to know where they end. Perhaps interestingly, I wrote a random small utility many years ago that displayed ISO 6429 sequences on a Windows console, back in the dark ages between ANSI.SYS and Windows 10 support for 6429. It didn't cover the entire standard, nor could it, but a decent subset. It understood where sequences ended, even unknown ones, because that is all laid out in the standard. > Whatever is designed should be much more easily parseable. Should you > say "everything from ESC[ to m", you'll cause a whole bunch of > problems when a different kind of escape sequence gets interpreted as > Unicode. I'm afraid I don't understand this statement. > A parser, by the way, would also have to interpret combined sequences > like ESC[3;0;1m or alike, for which I don't see a good reason as > opposed to having separate sequences for each. That's easy: 3 = turn on italics 0 = turn off all special styling, including italics 1 = turn on bold (or intense, whichever the output device supports) It's a silly sequence, because why would you turn on an attribute and then immediately turn it off before using it? But silly though it may be, it's well-formed and very easy to parse. My random small utility had no problem with it. > Also, it should be carefully evaluated what to do with C1 (U+009B) > instead of the C0 ESC[ opening for an escape sequence ? here terminal > emulators vary. These just make everything even more cumbersome. Why would they vary? CSI encoded as <1B 5B> or as <9B> is exactly the same. Again, this is very clear in the standard. > ECMA-48 8.3.117 specifies ESC[1m as "bold or increased intensity". > It's only nowadays that most terminal emulators support 256 colors and > some even support 16M true colors that some emulators try to push for > this bit unambiguously meaning "bold" only, whereas in most emulators > it means "both bold and increased intensity". [...] Why would we expect every displayed and printed page to look identical? That's not going to happen no matter what encoding mechanism you use for "bold" and "intense" and the rest. Not all HTML pages look identical either. > Should this scheme be extended for colors, too? What to do with the > legacy 8/16 as well as the 256-color extensions wrt. the color > palette? Why not? > Should Unicode go into the business Nope. Unicode should do nothing about this. > For 256-colors and truecolors, there are two or three syntaxes out > there regarding whether the separator is a colon or a semicolon. > ECMA-48 doesn't say anything about it, TUI T.416 does, although it's > absolutely not clear. See e.g. the discussion at the comment section > of https://gist.github.com/XVilka/8346728 , in Dec 2018, we just > couldn't figure out which syntax exactly TUI T.416 wants to say. That sounds like someone should send a question to ITU-T. Exegesis would perhaps be more productive than despair. > Moreover, due to a common misinterpretation of the spec, one of the > positional parameters are often omitted. That's a decision designers and implementers are sometimes faced with: should we remain bug-compatible with other implementations, or follow the straight and narrow path? I remember browsers going through that era too. > Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m > for curly underline. What to do with them? Should we be extension-compatible with other implementations, or following the straight and narrow path? Another decision that is not unique to ISO 6429. > Where to draw the line what to add to Unicode and what not to? Will > Unicode possibly be a bottleneck of further improvements in terminal > emulators, because from now on every new mode we figure out we'd like > to have in terminals should go through some Unicode committee? I think you know the answer to this by now. >> This mechanism [...] is already supported >> as widely as any new Unicode-only convention will ever be. > > I truly doubt this, these escape sequences are specific to terminal > emulation, an extremely narrow subset of where Unicode is used and > rich text might be desired. That's true. Probably next to nobody is using ISO 6429 sequences for plain text intended for print, just as next to nobody is using the proposed VS14 mechanism or Andrew West's Plane 14 mechanism. My suggestion was to document the ISO 6429 approach, run it up the flagpole, and see if anyone salutes. > Or, if it wants to adopt some already existing technology, I find > HTML/CSS a much better starting point. Q: How can we represent italics in plain text? A: Use rich text. Kent Karlsson wrote: >> ? Underline on: ESC [4m > (implies turning double underline off) > Underline, double: ESC [21m > (implies turning single underline off) I deliberately left out single and double underlining, and many other features of ISO 6429 SGR (such as Fraktur). The email was not intended as a final proposal. I do think it would be strange for single and double underlining not to cancel each other out. > Note that these do NOT nest (no stack...), just state changes for the > relevant PART of the "graphic" (i.e. style) state. So the approach in > that regard is quite different from the approach done in HTML/CSS. I don't regard that as either a bug or a feature. I certainly don't expect that every such mechanism has to nest, simply because SGML and its descendants are designed that way. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sun Feb 10 18:19:04 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Mon, 11 Feb 2019 01:19:04 +0100 Subject: Encoding italic In-Reply-To: <48a28b67-2e6a-8d0d-8808-edcac2a4ee44@gmail.com> Message-ID: Den 2019-02-10 16:31, skrev "James Kass via Unicode" : > > Philippe Verdy wrote, > >>> ...[one font file having both italic and roman]... For OpenType fonts, there is a "design axis" called "ital". Value 0 on that axis would be roman (upright, normally), and value 1 on that axis would be italic. I don't know to what extent that is available in OpenType fonts in common use... (Instead of using two separate font files.) [math chars] > They were encoded for interoperability and round-tripping because they > existed in character sets such as STIX.? They were basically requested "by" STIX, yes. Not sure about the round-tripping bit. > They remain Latin letter form > variants.? If they had been encoded as the variant forms which > constitute their essential identity it would have broken the character > vs. glyph encoding model of that era.? Arguing that they must not be > used other than for scientific purposes I don't think that particular argument was made, IIUC. > is just so much semantic > quibbling in order to justify their encoding. > > Suppose we started using the double struck ASCII variants on this list > in order to note Unicode character numbers such as ??+???????? or > ??+?????????? That particular example would be ok (event though outside of a conventional math formula). But we were talking about natural languages in their conventional orthography, using italics/bold. /Kent K From unicode at unicode.org Sun Feb 10 18:22:25 2019 From: unicode at unicode.org (=?UTF-8?Q?Elias_M=C3=A5rtenson?= via Unicode) Date: Mon, 11 Feb 2019 08:22:25 +0800 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <20190210014905.352d8c99@JRWUBU2> Message-ID: On Sun, 10 Feb 2019, 18:39 Egmont Koblinger via Unicode On Sun, Feb 10, 2019 at 2:57 AM Richard Wordingham via Unicode > wrote: > > > Which side do you align RTL cells on? > > It's out of the scope of my docs. > > In the current work-in-progress implementation I align them to the > left, but there's a TODO entry to align them to the right instead (or > maybe center all the glyphs). > For all the willingness to come up with ways to modernise the terminal, you've only spoken about trying to showhorn rtl text in to the vt102 basic terminal. What I mean is that f you're willing to go as far as introducing new escape codes to allow applications to better control the behaviour of this one feature, why do you stop there? Why still limit yourself to the bonds of vt102? Once you take that first step towards the new control codes, why not simply come up with a new scheme? Why not let me do: TERM=newfancything And then I'd have a system that supports everything I need: variable with fonts, proper rtl text, pixel-precise character positioning, all the colours, inline graphics, etc. There is nothing magic about the grid of cells, and once you introduce new escape sequences, you might as well truly modernise the terminal. Regards, Elias > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 11 05:47:54 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 11 Feb 2019 12:47:54 +0100 Subject: Encoding italic In-Reply-To: <48a28b67-2e6a-8d0d-8808-edcac2a4ee44@gmail.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> <20190209105805.3884e35f@JRWUBU2> <8d8394ca-a753-227f-5526-f11d60854651@it.aoyama.ac.jp> <28ab3a43-624f-2563-d485-e7002e4b3b3b@gmail.com> <48a28b67-2e6a-8d0d-8808-edcac2a4ee44@gmail.com> Message-ID: Le dim. 10 f?vr. 2019 ? 16:42, James Kass via Unicode a ?crit : > > Philippe Verdy wrote, > > >> ...[one font file having both italic and roman]... > > The only case where it happens in real fonts is for the mapping of > > Mathematical Symbols which have a distinct encoding for some > > variants ... > > William Overington made a proof-of-concept font using the VS14 character > to access the italic glyphs which were, of course, in the same real > font. Which means that the developer of a font such as Deja Vu Math TeX > Gyre could set up an OpenType table mapping the Basic Latin in the font > to the italic math letter glyphs in the same font using the VS14 > characters. Such a font would work interoperably on modern systems. > Such a font would display italic letters both if encoded as math > alphanumerics or if encoded as ASCII plus VS14. Significantly, the > display would be identical. > > > ...[math alphanumerics]... > > These were allowed in Unicode because of their specific contextual > > use as distinctive symbols from known standards, and not for general > > use in human languages > > They were encoded for interoperability and round-tripping because they > existed in character sets such as STIX. They remain Latin letter form > variants. If they had been encoded as the variant forms which > constitute their essential identity it would have broken the character > vs. glyph encoding model of that era. Arguing that they must not be > used other than for scientific purposes is just so much semantic > quibbling in order to justify their encoding. > > Suppose we started using the double struck ASCII variants on this list > in order to note Unicode character numbers such as ??+???????? or > ??+????????? Hexadecimal notation is certainly math and Unicode can be > considered a science. Would that be ?math abuse? if we did it? (Is > linguistics not a science?) > > > (because these encodings are defective and don't have the necessary > > coverage, notably for the many diacritics, > > The combining diacritics would be used. > Not for the many precombined characters that are in Latin: do you intend to propose them to be reencoded with all the same variants encoded for maths? Or allow the maths symbols to have diacritics added on them (hint: this does not work correctly with the specific mathematical conventions on diacritics and their specific stacking rules: they are NOT reorderable through canonical equivalence, the order is significant in maths, so you would also need to use CGJ to fix the expected logical semantic and visual stacking order). > > > case mappings, > > Adjust them as needed. > Not so easy: case mappings cannot be fixed. They are stabilized in Unicode. You would need special casing rules under a specific "locale" for maths. Really maths is a specific script even if it borrows some symbols from Latin, Greek or Hebrew but only in specific glyph variants. These symbols should not be even considered as part of the script they originate from (just like Latin A is not the same as Cyrillic A or Greek Alpha, that all have the same forms and the same origin). I can argue tyhe same thing about IPA notations: they are NOT the Latin script and also borrow some letter forms from Latin and Greek, but without any case mappings (only lowercase is used), and also with specific glyph variants. Both examples are technical notations which do not obey the linguistic rules and normal processing of the script they originate from. They are specific "writing systems", unfortunaltely confused within "Unicode scripts", and then abused. Note that some Latin letters have been borrowed from IPA too, for use in African languages, then case mappings were needed: these should have been reencoded as a plain letter pair with a basic case mapping (not the special case mapping rules now needed for African languages, such as open o which looks much like the mirrored c from Latin Roman digits, and open e which was borrowed from Greek epsilon in lowercase but does not use the uppercase Greek Epsilon and uses instead another shape, meaning that the Latin open e should have been encoded as a plain letter pair, distinct from the Greek epsilon; but IPA already used the epsilon-like symbol...). At end these exceptions just cause many inconsistancies and complexities. Applications and libraries cannot adapt easily and are not downward compatible because stable properties are immutable and specific tailorings are needed each time in applications: the more we add these exceptions, the less the standard is easy to adapt and compatibility is much more difficult to preserve. In summary I don't like at all the dual encodings or encodings of additional letters that cannot use the normal stable properties (and this remark is also true for emojis: what a mess ! full of exceptions and different incoherent encoding models !) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 11 06:19:53 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 11 Feb 2019 13:19:53 +0100 Subject: Encoding colour (from Re: Encoding italic) In-Reply-To: <79e6f552.6d26.168d262e56c.Webtop.71@btinternet.com> References: <20190208135327.665a7a7059d7ee80bb4d670165c8327d.4140a658bd.wbe@email03.godaddy.com> <642c3d7c.6bf3.168d1e557a8.Webtop.71@btinternet.com> <79e6f552.6d26.168d262e56c.Webtop.71@btinternet.com> Message-ID: Le dim. 10 f?vr. 2019 ? 02:33, wjgo_10009 at btinternet.com via Unicode < unicode at unicode.org> a ?crit : > Previously I wrote: > > > A stateful method, though which might be useful for plain text streams > > in some applications, would be to encode as characters some of the > > glyphs for indicating colours and the digit characters to go with them > > from page 5 and from page 3 of the following publication. > > > http://www.users.globalnet.co.uk/~ngo/locse027.pdf > > Thinking about this further, for this application copies of the glyphs > could be redesigned so as to be square and could be emoji-style and the > meanings of the characters specifying which colour component is to be > set could be changed so that they refer to the number previously entered > using one or more of the special digit characters. Thus the setting of > colour components could be done in the same reverse notation way that > the FORTH computer language works. > FORTH is not relevant to this discussion. Anyway the usual order for Forth operators (Forth is a stack-based language, similar to PostScript, and working like calculators using the Polish reversed order) is to push the operands from left to right and then use the operator which will pop them in reverse order from right to left before pushing the result on the stack (so "a/b/c" becomes "/a get /b get div /c get div"). But colors are just an operator like "rgb(r,b,g)" and the natural order in stack based languages should also be "/r get /g get /b get rgb". Note that C/C++ (with C calling conventions) usually use another order for its stack, pushing parameters from right to left (if they are not passed via dedicated registers in fix order, the first parameter from the right that fits a register being not passed in the stack but on the "main" accumulator register, possibly a pair or registers for long integer or long pointers, or a different register for floatting points if floatting point registers are used). There's no standard for the order of parameters in stack based languages. It is arbitrary and specific to each language or specific implementations of them. So if you want to create your own scripting language to support your non-standard extension, you can choose any order you want, but this will still not define a standard related to other languages that have never been bound to a specific evaluation/encoding order. Then don't pretend it will be part of the Unicode standard, which is not a scripting language and that does not offer an "ABI" for stateful encodings with arbitarily long contexts (Unicode has placed very low limits on the maximum length of lookahead needed to process text, your extension would not work under these reasonnable limits, so it will have limited private use and cannot be part of TUS). You may create your "proof of concept" (tested on limited configurations) but it will just be private [And so it should use PUA for full compatibility and not abuse the other standardized code points, as your extension would not be compatible/conforming to the existing rules and limits, without amending them and discussing a lot how existing conforming applications can be adapted, and analyzing the effects if they are not updated. Approving this extension is another thing, and it will need to pass the standard process to be added to the proposals schedule, pass through the two technical comities, pass the alpha and beta phases, and then the prepublication. You'll also need to work on documentations and fix many quirks found in them, then you'll need supporters to pass the vote (and if you're not an UTC member or an ISO member, you will never be able to vote for it: you need then to convince the voters by listening what they remark and refine your specifications to match their desires, and probably to split your proposal in several parts or limit your initial goals, leaving the other problematic poitns for later; if what remains "stable" in your proposal may not be usable in practice without the additional extensions still in discussion, and in fact this subset may still remain in the encoding queue for years, until it reaches a point where it starts being usable for practical problems; before that, you'll have to experiment with private-use and should be ready to accept competing proposals, not compatible with your proposal, and learn from them to reach an acceptable consensus; reaching that consensus is the longest step but initially most voters will not decide for or against your proposal if they are not confident enough about the merit of each proposal, because they want to preserve a resasonnable compatibility across TUS versions and with existing applications without adding further problems, notably in terms of confusability/security. But don't ask them to break the existing stability rules which were even harder to formalize: these rules is the foundation that allowed TUS/ISO 10646 to become a successful worldwide standard with lot of applications using them without much trouble and much more benefit than the older legacy non-interoperable encodings.] -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 11 03:55:45 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Mon, 11 Feb 2019 09:55:45 +0000 (GMT) Subject: Encoding italic In-Reply-To: <006901d4c192$f0d9d260$d28d7720$@ewellic.org> References: <006901d4c192$f0d9d260$d28d7720$@ewellic.org> Message-ID: <19cde956.265.168dbfbe52f.Webtop.73@btinternet.com> Doug Ewell wrote: > ?, just as next to nobody is using the proposed VS14 mechanism ? Well, of course not because use of VS14 in a plain text document to record a request for an italic glyph version is not at the present time an official part of Unicode. The next scheduled Unicode Technical Committee meeting is due to start on 30 April 2019. Here is a link to the proposal document. https://www.unicode.org/L2/L2019/19063-italic-vs.pdf VS14 is used to indicate a request for an italic glyph version in my VS14 Maquette font but that is clearly just a maquette font for experimental use to test the concept and show that it works. An application program that supports OpenType and that has the liga table switched on is needed in order to use the VS14 Maquette font to demonstrate that the use of VS14 in this way works. https://forum.high-logic.com/viewtopic.php?f=10&t=7831 William Overington Monday 11 February 2019 From unicode at unicode.org Mon Feb 11 12:42:20 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Mon, 11 Feb 2019 19:42:20 +0100 Subject: Encoding italic In-Reply-To: <19cde956.265.168dbfbe52f.Webtop.73@btinternet.com> Message-ID: Den 2019-02-11 10:55, skrev "wjgo_10009 at btinternet.com via Unicode" : > Doug Ewell wrote: > >> ?, just as next to nobody is using the proposed VS14 mechanism ? > > Well, of course not because use of VS14 in a plain text document to > record a request for an italic glyph version is not at the present time > an official part of Unicode. Looking deeply into the crystal ball, swirling my hands over it... ... ... Using a VS to get italics, or anything like that approach, will NEVER be a part of Unicode! /Kent K From unicode at unicode.org Mon Feb 11 16:46:32 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Mon, 11 Feb 2019 23:46:32 +0100 Subject: Encoding colour (from Re: Encoding italic) In-Reply-To: <566149ed.87a.168dd5381ae.Webtop.71@btinternet.com> Message-ID: Continuing too look deep into the crystal ball, doing some more hand swirls... ... ... The scheme quoted (far) below (from wjgo_10009), or anything like it, will NEVER be part of Unicode! ----------- But I do like colour (and bold and italic) also for otherwise "plain" text. And having those stylings represented in a lightweight manner, in many cases. Not needing heavy-lifting with (say) HTML+CSS. More on that further below. As we have noted already on this thread, we already have a standard for specifying background and foreground (the glyphs for the text) colour. As ESC (command) sequences. It even has (non-standard) "room" for for an alpha channel (after the 6th ':', a parameter position otherwise unused for RGB; it is used for K of CMYK in the ITU T.416 standard). Colour, RGB, with alpha channel T (0: opaque, 255: fully transparent; this way around since 0 is the default default value in these things), can be given with the detailed syntax below (it matches the overall syntax, so there is no overall syntax error for the detailed syntax). The brackets, except the single first one, indicate optional; strictly speaking everything after the "2" here is incrementally optional, but that is a nit; the i and the "a:s" are intended for different kinds of colour adjustments (at least the "i" one being implementation defined). But those are a bit too detailed to pick up here. The lowercase variables, not the final m, here to be replaced by digits representing values 0 to 255. A syntax error would result in the command sequence being ignored. (If too long, longer than 35(?) chars, the printable characters would be displayed, no interpretation as a command sequence.) The 2 means RGB (and, here, T) colour specification. Foreground colour: ESC [38:2:i:r:g:b[:t[:a:s]]m Background colour: ESC [48:2:i:r:g:b[:t[:a:s]]m E.g. ESC [38:2:0:70:100:200:100m for a slightly transparent bluish foreground colour. Separator is (must be) colon, so as not to interfere with the permitted (but I would not recommend it) multiple style settings in a single SGM command sequence, using semicolon separator. ----------- Now, colour for plain text? Well, lots of people are editing coloured plain text daily! Any decent modern IDE does automatic syntax colouring (and bold and italic). And that for program source text, which certainly does not have any HTML/CSS or any other higher-level (formatting) protocol applied to them. Ok, the colouring/bold/italic is entirely internal. It is not saved in the files in any way, it is derived. But it would be nice to sometimes keep the syntax colouring, when quoting a piece of program source code (from an IDE) Into a chat conversation, for instance. Or pasting a piece of source code into a presentation slide or a document (in these cases any light-weight colouring/style would need to be converted to whatever representation is used for such things in those document formats, something more "heavy-weight"). And keep the formatting/colour in a light-weight manner, when copying/cutting (ctrl-c/ctrl-x) text from an IDE. One that is also easy to strip away (if pasting a perhaps modified version of it into a source file (via an IDE)). The "heavy-weight" ones are harder to strip away, and might not even be supported on the target platform. ESC/command sequences are easy to strip away, due to the starting control character and well-defined overall syntax, even though it is only the start character that is (otherwise) non-printable in the sequence. They were designed for being easy to parse out! And they are already standardised! Platform independently. And light-weight. Granted, they are, for now, only popular to implement in terminal emulators. But the styling command sequences are NOT specifically made for terminal (emulators). If you worry about actual ESC characters in source code (strings), those should be written as \e, or other more general escape sequence (a completely different, though somewhat related, sense of the term "escape sequence"), like \u001B. It is a REALLY bad idea to have a real escape character (U+001B) in a source code string literal. (Nit: The "predefined" colours in ECMA-48 are not useful for this. They are too stark. The IDEs (by default) use milder colours.) If you think that using styling on program source text is a new-fangled idea that came with the IDEs: No, it started already in the sixties. Algol-60 source text, when printed in books, had the keywords written in bold. For the *actual* programs, IIRC (at least for some compiler), one had to mark the keywords with underscore: _BEGIN_, _IF_, ... (No lowercase in computers then...) The keywords were initially not reserved, so one had to mark them. And... often stored as punched cards or punched paper tape... While possible, I do NOT propose to use command sequences to mark keywords (etc.) as bold (or colour) when input to a compiler. NOR do I propose to encode characters for punched hole patterns... (Have to draw the line somewhere. ;-) /Kent K Den 2019-02-11 17:11, skrev "wjgo_10009 at btinternet.com" : > Suppose that there are sixteen new characters, which are in plane 1 or > maybe plane 14, but which for this mailing list post I will express > using the digits 0 .. 9, Z, R, G, B, A, F. > > There would be a virtual machine to set the colour, that would have > registers h, r, g, b, a and a system service > Set_Foreground_Colour(r,g,b,a). > > Then the sixteen new characters would each have a default glyph, which > could be displayed emoji-style, and, in an application environment that > has the virtual machine available and switched on, would have the > following effects in the virtual machine and their glyphs would not then > be displayed. The virtual machine would be sandboxed. > > Z h:=0; > 0 h:=10*h ; > 1 h:=10*h + 1; > 2 h:=10*h + 2; > 3 h:=10*h + 3; > 4 h:=10*h + 4; > 5 h:=10*h + 5; > 6 h:=10*h + 6; > 7 h:=10*h + 7; > 8 h:=10*h + 8; > 9 h:=10*h + 9; > R r:=h; h:=0; > G g:=h; h:=0; > B b:=h; h:=0; > A a:=h; h:=0; > F Set_Foreground_Colour(r,g,b,a); > > Thus for example, remembering that these ordinary characters are just > being used here for explanation in this post, and that the actual > characters if encoded would probably be in plane 1 or plane 14: > > So the sequence Z128R160G248B255AF could be used to set the foreground > colour to an opaque blue colour. From unicode at unicode.org Mon Feb 11 19:57:26 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 12 Feb 2019 01:57:26 +0000 Subject: Encoding italic In-Reply-To: References: Message-ID: On 2019-02-11 6:42 PM, Kent Karlsson wrote: > Using a VS to get italics, or anything like that approach, will > NEVER be a part of Unicode! Maybe the crystal ball is jammed.? This can happen, especially on the older models which use vacuum tubes. Wanting a second opinion, I asked the magic 8 ball: ?Will VS14 italic be part of Unicode?? The answer was: ?It is decidedly so.? From unicode at unicode.org Mon Feb 11 20:20:21 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 11 Feb 2019 21:20:21 -0500 Subject: Encoding colour (from Re: Encoding italic) In-Reply-To: References: Message-ID: On 2/11/19 5:46 PM, Kent Karlsson via Unicode wrote: > Continuing too look deep into the crystal ball, doing some more > hand swirls... > > ... > > ... > > The scheme quoted (far) below (from wjgo_10009), or anything like it, > will NEVER be part of Unicode! Not in Unicode, but I have to say I'm intrigued by the idea of writing HTML with tag characters (not even necessarily "restricted" HTML: the whole deal).? This does NOT make it possible to write "italics in plain text," since you aren't writing plain text.? But what you can do is write rich text (HTML) that Just So Happens to look like plain text when rendered with a plain-text-renderer? (and maybe there could be plain-text-renderers that straddle the line, maybe supporting some limited subset of HTML and doing boldface and italics or something.? BUT, this would NOT be a Unicode feature/catastrophe at all.? This would be purely the decision of the committee in charge of HTML/XML and related standards, to decide to accept Unicode tag characters as if they were ASCII for the purposes of writing XML tags/attributes &c.? It's totally nothing to do with Unicode, unless the XML folks want Unicode to change some properties on the tag chars or something.? I think it's a... fascinating idea, and probably has *disastrous* consequences lurking that I haven't tried to think of yet, but it's not a Unicode idea. ~mark From unicode at unicode.org Mon Feb 11 23:23:48 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 12 Feb 2019 05:23:48 +0000 Subject: Encoding italic In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <001701d4b942$ca834e50$5f89eaf0$@xencraft.com> <69f43412.412.168a368b74a.Webtop.72@btinternet.com> <9d5a12a5-a1e0-7b39-4760-69533b6135c7@gmail.com> <7adb902b.3cb3.168bd1d21ed.Webtop.71@btinternet.com> <2893dab.6705.168ce1c0536.Webtop.71@btinternet.com> <99fff5a6-8918-d180-5bbe-b9268eaee96d@gmail.com> <3c40f3c0-3f84-546f-e955-9b15f5afee70@ix.netcom.com> <20190209105805.3884e35f@JRWUBU2> <8d8394ca-a753-227f-5526-f11d60854651@it.aoyama.ac.jp> <28ab3a43-624f-2563-d485-e7002e4b3b3b@gmail.com> <48a28b67-2e6a-8d0d-8808-edcac2a4ee44@gmail.com> Message-ID: Philippe Verdy wrote, >>> case mappings, >> >> Adjust them as needed. > > Not so easy: case mappings cannot be fixed. They are stabilized in Unicode. > You would need special casing rules under a specific "locale" for maths. In BabelPad, I can select a string of text and convert it to math italics.? If upper case italics is desired, it would be necessary to select the text, convert it back to ASCII, convert it to upper case, and convert that upper case to math italics.? Casing the math alphanumerics doesn?t seem to present any problem.? Any program could make those interim steps invisible to the end user. (With VS14, BabelTags mark-up, or new control character(s)?casing isn?t even an issue.) From unicode at unicode.org Mon Feb 11 23:53:03 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 12 Feb 2019 05:53:03 +0000 Subject: Vendor-assigned emoji (was: Encoding italic) In-Reply-To: References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> Message-ID: <7955f6fa-b683-ca92-37ec-649d3a2afa3f@gmail.com> On 2019-01-24 Andrew West wrote, > The ESC and UTC do an appallingly bad job at regulating emoji, and I > would like to see the Emoji Subcommittee disbanded, and decisions on > new emoji taken away from the UTC, and handed over to a consortium or > committee of vendors who would be given a dedicated vendor-use emoji > plane to play with (kinda like a PUA plane with pre-assigned > characters with algorithmic names [VENDOR-ASSIGNED EMOJI XXXXX] which > the vendors can then associate with glyphs as they see fit; and as > emoji seem to evolve over time they would be free to modify and > reassign glyphs as they like because the Unicode Standard would not > define the meaning or glyph for any characters in this plane). Nobody disagreed and I think it?s a splendid suggestion.? If anyone is discussing drafting a proposal to accomplish this, please include me in the ?cc?. From unicode at unicode.org Tue Feb 12 06:50:00 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Tue, 12 Feb 2019 13:50:00 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: Hi Philippe, > The monospace restriction is a strong limitator: but then I don't see why a "terminal" could not handle fonts with variable metrics, and why it must be modeled only as a regular grid of rectangular cells (all of equal size) containing only one "character" (or cluster?). Because this is what a "terminal" currently is, this is one of the basic assumptions around which gazilliions of libraries and application were built up. Just one example: A utility might query the width, let's say it's 80 columns. Then it can print either 81 "i"s, or 81 "w"s, and in both cases it can be sure that the last one will be aligned exactly below the first one. You can sure change this. But then you'll have to heavily adjust the behavior of all the screen drawing libraries and all the applications that use these libraries or do their own screen handling. It's out of the scope of my work to do anything like this. If you feel like, I encourage you to go ahead, put your work in it, and present a proof of concept. > So using controls, you would try to mimic again what HTML already provides you for free (and without complex specifications and redevelopment). Show me that "without complex specifications and redevelopment" because all I see is the need to heavily rewrite plenty of libs and tools that were created and continuously developed during the last few decades. I don't really see this approach feasible. Feel free to prove me wrong by presenting software that works on top of the redefined terminal emulator concept, at least on a proof on concept level. For starter, I'd love to see a shell with interactive line editing (like bash, zsh), and one application that uses vertical alignment heavily, let's say "top" or anything similar, using proportional font in your newly created world. cheers, egmont From unicode at unicode.org Tue Feb 12 07:08:01 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Tue, 12 Feb 2019 14:08:01 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <20190210014905.352d8c99@JRWUBU2> Message-ID: Hi Elias, > For all the willingness to come up with ways to modernise the terminal, you've only spoken about trying to showhorn rtl text in to the vt102 basic terminal. Yes, addressing BiDi was the exact thing that I did now. What's wrong with that? I can't address all the imperfectnesses at once. If you take a look at VTE's changelog, you'll see that I've done a lot more than this, and chances are this won't be my last improvement either. > What I mean is that f you're willing to go as far as introducing new escape codes to allow applications to better control the behaviour of this one feature, why do you stop there? Why still limit yourself to the bonds of vt102? Did I stay I'll stop here? No, I presented one step, without saying anything about what might be the next one I tackle. (Okay, I drafted out some ideas for continuing this work, and I said things about what will definitely _not_ be the next step, as far as I'm concerned.) > Once you take that first step towards the new control codes, why not simply come up with a new scheme? Why not let me do: > > TERM=newfancything > > And then I'd have a system that supports everything I need: variable with fonts, proper rtl text, pixel-precise character positioning, all the colours, inline graphics, etc. Because this would create a brand new world where practically every application has to be heavily adjusted, if not built up from scratch (e.g. for ncurses, I'd expect that a new replacement would have to be designed and created). Because this is not solely an engineering kind of task, but rather something that would need buy-in from a critical set of people (the maintainers of all these libs and apps, and the other popular terminals), which I find unlikely to get, given that for most of these apps the current platform is good enough, and something new would add an significant amount of extra burden for marginal benefits. Because, even if everyone supported the idea, the required amount of design and implementation work would be magnitudes bigger than for BiDi. Because I'm doing one thing at a time. And I honestly just because I came here to announce my work that addresses _one_ thing, I really don't find it a fair question to ask why I didn't address suddenly magnitudes more than that. Because I'm doing this as a hobby project, not as a paid job. If someone offers me a job to do this, we can discuss it. > There is nothing magic about the grid of cells, and once you introduce new escape sequences, you might as well truly modernise the terminal. The magic about the grid of cells is all the software that were built up with this assumption during the last couple of decades. cheers, egmont From unicode at unicode.org Tue Feb 12 11:05:55 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Tue, 12 Feb 2019 18:05:55 +0100 Subject: Encoding colour (from Re: Encoding italic) In-Reply-To: Message-ID: Den 2019-02-12 03:20, skrev "Mark E. Shoulson via Unicode" : > On 2/11/19 5:46 PM, Kent Karlsson via Unicode wrote: >> Continuing too look deep into the crystal ball, doing some more >> hand swirls... >> >> ... >> >> ... >> >> The scheme quoted (far) below (from wjgo_10009), or anything like it, >> will NEVER be part of Unicode! > > Not in Unicode, but I have to say I'm intrigued by the idea of writing > HTML with tag characters (not even necessarily "restricted" HTML: the > whole deal).? This does NOT make it possible to write "italics in plain > text," since you aren't writing plain text.? But what you can do is > write rich text (HTML) that Just So Happens to look like plain text when > rendered with a plain-text-renderer (and maybe there could be > plain-text-renderers that straddle the line, maybe supporting some > limited subset of HTML and doing boldface and italics or something.? And so would ESC/command sequences as such, if properly skipped for display. If some are interpreted, those would affect the display of other characters. Just like "HTML in tag characters" would. A show invisibles mode would display both ESC/command sequences as well as "HTML in tag characters" characters. > BUT, this would NOT be a Unicode feature/catastrophe at all.? This would > be purely the decision of the committee in charge of HTML/XML and > related standards, to decide to accept Unicode tag characters as if they > were ASCII for the purposes of writing XML tags/attributes &c.? It's I have no say on HTML/CSS, but I would venture to predict that those who do have a say, would not be keen on that idea. And XML tags in general need not be in ASCII. And... identifiers in CSS need not be in pure ASCII either... And attribute values, like filenames including those that refer to CSS files (CSS is preferably stored separately from the HTML/XML), certainly need not be pure ASCII.) So, no, I'd say that that idea is completely dead. /Kent K > totally nothing to do with Unicode, unless the XML folks want Unicode to > change some properties on the tag chars or something.? I think it's a... > fascinating idea, and probably has *disastrous* consequences lurking > that I haven't tried to think of yet, but it's not a Unicode idea. > > ~mark > From unicode at unicode.org Tue Feb 12 11:06:10 2019 From: unicode at unicode.org (Kent Karlsson via Unicode) Date: Tue, 12 Feb 2019 18:06:10 +0100 Subject: Encoding italic In-Reply-To: Message-ID: Oh, the crystal ball is pure solid state, no moving or hot parts. A magic 8-ball on the other hand can easily get jammed... (Now, enough of that...) /K Den 2019-02-12 02:57, skrev "James Kass via Unicode" : > > On 2019-02-11 6:42 PM, Kent Karlsson wrote: > >> Using a VS to get italics, or anything like that approach, will >> NEVER be a part of Unicode! > > Maybe the crystal ball is jammed.? This can happen, especially on the > older models which use vacuum tubes. > > Wanting a second opinion, I asked the magic 8 ball: > ?Will VS14 italic be part of Unicode?? > The answer was: > ?It is decidedly so.? > From unicode at unicode.org Tue Feb 12 14:31:30 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 12 Feb 2019 20:31:30 +0000 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> Message-ID: <20190212203130.6cc47de4@JRWUBU2> On Tue, 12 Feb 2019 13:50:00 +0100 Egmont Koblinger via Unicode wrote: > For > starter, I'd love to see a shell with interactive line editing (like > bash, zsh),... Bash already seems to handle proportional fonts quite well when run under Emacs 'M-x shell', which is more than can be said for bash on Gnome-terminal or an Emacs terminal! In the latter two, it cannot synchronise text display and cursor position. Richard. From unicode at unicode.org Wed Feb 13 02:53:20 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Wed, 13 Feb 2019 09:53:20 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: <20190212203130.6cc47de4@JRWUBU2> References: <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <83sgwwn8dx.fsf@gnu.org> <20190212203130.6cc47de4@JRWUBU2> Message-ID: On Tue, Feb 12, 2019 at 9:35 PM Richard Wordingham via Unicode wrote: > Bash already seems to handle proportional fonts quite well when run > under Emacs 'M-x shell', Having never used bash inside Emacs's shell, here's my experience after about a minute of trying it: Cursor keys allow you to walk back to the prompt, backspace allows to delete the prompt, typing letters lets you modify the prompt... Not something that I consider a sensible behavior. If I do so, I have no idea what the executed command will be. Coloring gives some clue, but isn't always reliable. My prompt is blue, the text I type after that is black. I type one letter and then press Ctrl-T to transpose the last two letters (the trailing space of my prompt, and the newly typed letter). The newly typed letter is black. I press Enter, this one-letter command isn't executed, and becomes blue. I feel magnitudes safer in standard bash where I know it doesn't allow me to walk back to the prompt, only allows me to edit whatever I'm trying to execute. I have not studied how this behavior is implemented, but as per [1] as well as the behavior I experience, it seems that lot of bash's behavior wrt. line editing is moved to Emacs itself. Pretty much none of my preferred shortcuts work as they do in native bash, something I'm not happy about either. I've no idea how this (external editing) would be expected to be the generic behavior when there's no Emacs (no external editor) in the game, plus a whole bunch of other utilities are expected to run (ones that fail big time in Emacs's M-x shell, or even refuse to start up). [1] https://www.gnu.org/software/emacs/manual/html_node/emacs/Shell-Mode.html From unicode at unicode.org Wed Feb 13 13:05:16 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Wed, 13 Feb 2019 19:05:16 +0000 (GMT) Subject: Vendor-assigned emoji (was: Encoding italic) In-Reply-To: <6f5fbda.16e5.168e27256a9.Webtop.72@btinternet.com> References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org> <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com> <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org> <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com> <77c75982-d16e-fba8-fc96-9146523233ab@kli.org> <7955f6fa-b683-ca92-37ec-649d3a2afa3f@gmail.com> <6f5fbda.16e5.168e27256a9.Webtop.72@btinternet.com> Message-ID: <1b56845a.2568.168e83fb9e5.Webtop.71@btinternet.com> James Kass wrote: > Nobody disagreed and I think it?s a splendid suggestion.? If anyone is > discussing drafting a proposal to accomplish this, please include me > in the ?cc?. I too would like to receive copies of any discussions please. In relation to the proposal, I opine that the facility should not allow a glyph that has been assigned to be changed at a later date. Given that discussion is about a whole plane of code points being assigned, then even if the code points are assigned at fifty every month that would take over one hundred years to fill a whole plane. Certainly early months might have more than fifty allocations. It is important to have stability as otherwise archived messages could have their meaning retrospectively changed with no easy way to find out the original meaning. William Overington Tuesday 12 February 2019 From unicode at unicode.org Wed Feb 13 13:10:35 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Wed, 13 Feb 2019 19:10:35 +0000 (GMT) Subject: Encoding colour (from Re: Encoding italic) In-Reply-To: <566149ed.87a.168dd5381ae.Webtop.71@btinternet.com> References: <566149ed.87a.168dd5381ae.Webtop.71@btinternet.com> Message-ID: <3895f39a.2578.168e84496bb.Webtop.71@btinternet.com> Philippe Verdy replied to my post, including quoting me. WJGO >> Thinking about this further, for this application copies of the glyphs could be redesigned so as to be square and could be emoji-style and the meanings of the characters specifying which colour component is to be set could be changed so that they refer to the number previously entered using one or more of the special digit characters. Thus the setting of colour components could be done in the same reverse notation way that the FORTH computer language works. PV > FORTH is not relevant to this discussion. I just mentioned FORTH because of the way that numbers are entered before the operators that act upon them. I have no intention to use a stack-based system: what I have in mind at present is much simpler than such a format. Suppose that there are sixteen new characters, which are in plane 1 or maybe plane 14, but which for this mailing list post I will express using the digits 0 .. 9, Z, R, G, B, A, F. There would be a virtual machine to set the colour, that would have registers h, r, g, b, a and a system service Set_Foreground_Colour(r,g,b,a). Then the sixteen new characters would each have a default glyph, which could be displayed emoji-style, and, in an application environment that has the virtual machine available and switched on, would have the following effects in the virtual machine and their glyphs would not then be displayed. The virtual machine would be sandboxed. Z h:=0; 0 h:=10*h ; 1 h:=10*h + 1; 2 h:=10*h + 2; 3 h:=10*h + 3; 4 h:=10*h + 4; 5 h:=10*h + 5; 6 h:=10*h + 6; 7 h:=10*h + 7; 8 h:=10*h + 8; 9 h:=10*h + 9; R r:=h; h:=0; G g:=h; h:=0; B b:=h; h:=0; A a:=h; h:=0; F Set_Foreground_Colour(r,g,b,a); Thus for example, remembering that these ordinary characters are just being used here for explanation in this post, and that the actual characters if encoded would probably be in plane 1 or plane 14: So the sequence Z128R160G248B255AF could be used to set the foreground colour to an opaque blue colour. It may be that upon investiation there could be specified a feature of the system service Set_Foreground_Colour(r,g,b,a) such that "if a=0 then a:=255;" so that total opacity of the colour is presumed unless otherwise set. PV > You may create your "proof of concept" (tested on limited configurations) but it will just be private Yes. PV > [And so it should use PUA for full compatibility ... Yes, I have in mind to use U+EA60 through to U+EA69 for the digits, as U+EA60 is Alt 60000 so it makes it easier if some of the people who want to experiment want to enter characters using the Alt method. William Overington Monday 11 February 2019 -------------- next part -------------- A non-text attachment was scrubbed... Name: an_opaque_blue_colour.png Type: image/png Size: 3528 bytes Desc: not available URL: From unicode at unicode.org Wed Feb 13 19:19:41 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 13 Feb 2019 20:19:41 -0500 Subject: Encoding colour (from Re: Encoding italic) In-Reply-To: References: Message-ID: On 2/12/19 12:05 PM, Kent Karlsson via Unicode wrote: > Den 2019-02-12 03:20, skrev "Mark E. Shoulson via Unicode" > : > >> On 2/11/19 5:46 PM, Kent Karlsson via Unicode wrote: >>> Continuing too look deep into the crystal ball, doing some more >>> hand swirls... >>> >>> ... >>> >>> ... >>> >>> The scheme quoted (far) below (from wjgo_10009), or anything like it, >>> will NEVER be part of Unicode! >> Not in Unicode, but I have to say I'm intrigued by the idea of writing >> HTML with tag characters (not even necessarily "restricted" HTML: the >> whole deal).? This does NOT make it possible to write "italics in plain >> text," since you aren't writing plain text.? But what you can do is >> write rich text (HTML) that Just So Happens to look like plain text when >> rendered with a plain-text-renderer (and maybe there could be >> plain-text-renderers that straddle the line, maybe supporting some >> limited subset of HTML and doing boldface and italics or something. > And so would ESC/command sequences as such, if properly skipped for display. > If some are interpreted, those would affect the display of other characters. > Just like "HTML in tag characters" would. A show invisibles mode would > display both ESC/command sequences as well as "HTML in tag characters" > characters. Very true.? Maybe the explicitness of HTML appealed to me; escape sequences feel more like... you know, computer "codes" and all. (which of course is what all this is anyway!? So what's wrong with that?) >> BUT, this would NOT be a Unicode feature/catastrophe at all.? This would >> be purely the decision of the committee in charge of HTML/XML and >> related standards, to decide to accept Unicode tag characters as if they >> were ASCII for the purposes of writing XML tags/attributes &c.? It's > I have no say on HTML/CSS, but I would venture to predict that those > who do have a say, would not be keen on that idea. And XML tags in > general need not be in ASCII. And... identifiers in CSS need not > be in pure ASCII either... And attribute values, like filenames > including those that refer to CSS files (CSS is preferably stored > separately from the HTML/XML), certainly need not be pure ASCII.) > > So, no, I'd say that that idea is completely dead. You're probably right, and CSS is practically a different animal, and I guess at best one would have to settle for a stripped-down version of HTML (in which case, why bother?)? And again, all this is before we even consider other issues; I can't shake the feeling that there security nightmares lurking inside this idea. ~mark From unicode at unicode.org Wed Feb 13 23:27:51 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 13 Feb 2019 21:27:51 -0800 Subject: Encoding colour (from Re: Encoding italic) In-Reply-To: References: Message-ID: <177e99e6-b1d9-e436-55be-e0a511b43f8f@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 14 04:30:20 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 14 Feb 2019 11:30:20 +0100 Subject: Bidi paragraph direction in terminal emulators In-Reply-To: References: <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> <20190208064044.27f75709@JRWUBU2> <831s4ir5cq.fsf@gnu.org> <20190208215558.59fc19f5@JRWUBU2> <8336oyori9.fsf@gnu.org> <20190209001814.4813f28f@JRWUBU2> <83y36po1bi.fsf@gnu.org> <20190209140648.12174bf4@JRWUBU2> <20190210014905.352d8c99@JRWUBU2> Message-ID: Le mar. 12 f?vr. 2019 ? 14:16, Egmont Koblinger via Unicode < unicode at unicode.org> a ?crit : > > There is nothing magic about the grid of cells, and once you introduce > new escape sequences, you might as well truly modernise the terminal. > > The magic about the grid of cells is all the software that were built > up with this assumption during the last couple of decades. > The minimum to support (which is already used in VT* terminals) needs to include support "dualspace" rendering (i.e.characters rendered in one or two cells), widely used for CJK (half-width and fullwidth characters). If the terminal has square cells only one variant is needed (i.e. a monospace cell), but common terminals today use rectangular cells. Thanks Unicode has properties about that, allowing controls to select the appropriate variant (plus legacy encodings for parts of Latin/Greek/Cyrillic). But the extension would be needed for other scripts. And a control in the VT* protocol to select the variant (which would take effect in terminals configured in dualspace rendering mode which is normally the default mode in East Asia). This should apply to other South Asian scripts and most emojis, and adding some control would extend the dualspace rendering to cover the whole Unicode (without having to use the few compatibility characters specifically encoded at end of the BMP). Unfortunately Unicode still does not have any standard variant selector (or other format control) to control that at least at cluster level. This would mean adding some custom escape sequence to the VT* protocol (using the compatibility characters for half-width/fullwidth should be deprecated), which would be also more efficient than having to use variant selector or format controls after each character (this solution works for isolated characters) or having to configure the terminal in ugly monospace mode (with typically 40 cells by line instead of 80) which is only fine for CJK, or for output to old analog TV with very low vertical resolution (below ~400 pixels with cells about 8x8 pixels at most) such as old CGA, Teletext, and early 8-bit personal computers. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 14 05:22:38 2019 From: unicode at unicode.org (Takao Fujiwara via Unicode) Date: Thu, 14 Feb 2019 20:22:38 +0900 Subject: CLDR emoji tagging in SVN Message-ID: <683c0dbd-6838-7f39-ed69-fbb35acd8eaa@redhat.com> Could you make a tag in the SVN or create a zip file for emoji 12.0? My understanding is Unicode emoji 12.0 has been released and I'd like to get the annotations and the translations. Previously I could get http://www.unicode.org/repos/cldr/tags/release-34 but now I don't know which revision is the emoji 12. I'd like to get common/annotations and common/annotationsDerived only. Thanks, Fujiwara From unicode at unicode.org Fri Feb 15 18:54:25 2019 From: unicode at unicode.org (=?UTF-8?Q?Andr=C3=A9s_Sanhueza?= via Unicode) Date: Fri, 15 Feb 2019 21:54:25 -0300 Subject: Spiral symbol In-Reply-To: <641014857.20130122231052@acssoft.de> References: <641014857.20130122231052@acssoft.de> Message-ID: El mar., 22 ene. 2013 a las 19:11, Karl Pentzlin () escribi?: > Am Dienstag, 22. Januar 2013 um 01:11 schrieb Andr?s Sanhueza: > > AS> I have wondered if it may be a good idea to make a proposal to an > AS> "spiral" character, basically because I believe is the only mayor > AS> symbol recurrently used for represent "swearing" in comics that's > AS> missing from Unicode. > > In 2011, I produced a proposal in which I tried to complete the set of > Comic symbols which went into Unicode with the Emoji set, based on the > fact that a confined set of such symbols is found regularly as part of > plain text contained in the speech balloons of comics. > This document, not surprising, contains the spiral (in fact, two > variants of it). > Question: Why both a right and left facing spiral are exactly need? Isn't a single one (whose direction is just a glyph variant) enough? There was a previous thread that also suggested these very symbols, but otherwise I have found no evidence of the specific need for it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 16 08:24:36 2019 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sat, 16 Feb 2019 14:24:36 +0000 Subject: Spiral symbol In-Reply-To: References: <641014857.20130122231052@acssoft.de> Message-ID: <5ABBFBF3-C21F-49B8-A330-B99F88EB685E@evertype.com> > Question: Why both a right and left facing spiral are exactly need? Isn't a single one (whose direction is just a glyph variant) enough? There was a previous thread that also suggested these very symbols, but otherwise I have found no evidence of the specific need for it. Clockwise and anticlockwise are not the same thing. Michael Everson From unicode at unicode.org Sun Feb 17 06:59:16 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 17 Feb 2019 13:59:16 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: Le ven. 8 f?vr. 2019 ? 13:56, Egmont Koblinger a ?crit : > Philippe, I hate do say it, but at the risk of being impolite, I just > have to. > Resist this idea, I've not been impolite. I just want to show you that terminals are legacy environments that are far behind what is needed for proper internationalization. And when I exposed the problem of monospaced fonts, and exposed the case of "dualspace" fonts, this is already used in legacy terminals to solve practical problems (and there are even data in the UCD about them): dualspace is an excellent solution that should be extended even outside CJK contexts (for example with emojis, and various other South Asian scripts). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 18 03:47:28 2019 From: unicode at unicode.org (Egmont Koblinger via Unicode) Date: Mon, 18 Feb 2019 10:47:28 +0100 Subject: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators) In-Reply-To: References: <83k1in30kh.fsf@gnu.org> <20190130220248.1a11f80e@JRWUBU2> <20190131231719.2b545f7f@JRWUBU2> <20190202040142.6647729f@JRWUBU2> <20190202205701.0b0a332d@JRWUBU2> <831s4ox388.fsf@gnu.org> <83womgvnhf.fsf@gnu.org> <83pns8vk05.fsf@gnu.org> <20190204011921.3411cc77@JRWUBU2> <83bm3rv6kd.fsf@gnu.org> <20190204194513.7377857e@JRWUBU2> <834l9juw44.fsf@gnu.org> Message-ID: On Sun, Feb 17, 2019 at 1:59 PM Philippe Verdy wrote: > Resist this idea, I've not been impolite. I didn't say a word about you being impolite. I said I might be impolite for not wishing to continue this discussion in that direction. > I just want to show you that terminals are legacy environments You might have missed the thread's opening mail where I mentioned that I've been developing a terminal emulator for five years. So I'm not sure what you exactly want to show me about what a legacy environment it is; I think I perfectly know it. > that are far behind what is needed for proper internationalization For many languages (or should I say scripts) internationalization is pretty well solved in terminals. For others, requiring LTR complex rendering, so-so. For RTL scripts it's a straight disaster, an application can't even count on the letters of a word showing up in the expected order, no matter what it does. My work fixes the latter only, within(!) the limitations of this legacy environment. I don't find it feasible to get rid of this legacy (the concept of strict grid), and I find it a waste of time to ponder about it. Not sure why after about 200 mails on the topic, I still have a hard time getting this message through. Seems to me that folks here on the Unicode list want everything to be perfect for all the scripts at once and not compromise to the slightest bit; and don't really appreciate work that only offers partial improvement due to a special context's constraints. This is something I didn't expect when I posted to this list. At this point I think I've gathered all the actionable positive feedback I could (two issues: one is that shaping needs to be done differently, and the other one is that the paragraph direction should be detected on larger chunks of data (at least optionally) ? thanks again for them, I'll rework my spec accordingly). For all the rest, irrelevant and hopeless stuff, like switching to proportional fonts, IMO it's high time we let this thread end here. cheers, egmont From unicode at unicode.org Mon Feb 18 06:52:01 2019 From: unicode at unicode.org (=?UTF-8?Q?Andr=C3=A9s_Sanhueza?= via Unicode) Date: Mon, 18 Feb 2019 09:52:01 -0300 Subject: Spiral symbol In-Reply-To: <5ABBFBF3-C21F-49B8-A330-B99F88EB685E@evertype.com> References: <641014857.20130122231052@acssoft.de> <5ABBFBF3-C21F-49B8-A330-B99F88EB685E@evertype.com> Message-ID: I understand the difference. My question was why it was needed to have both spirals as different characters instead of a single one that can be either, as the proposal didn't specify an use case where there is a semantic difference between each one. El s?b., 16 feb. 2019 a las 11:26, Michael Everson via Unicode (< unicode at unicode.org>) escribi?: > > Question: Why both a right and left facing spiral are exactly need? > Isn't a single one (whose direction is just a glyph variant) enough? There > was a previous thread that also suggested these very symbols, but otherwise > I have found no evidence of the specific need for it. > > Clockwise and anticlockwise are not the same thing. > > Michael Everson > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 18 08:38:28 2019 From: unicode at unicode.org (Michael Everson via Unicode) Date: Mon, 18 Feb 2019 14:38:28 +0000 Subject: Spiral symbol In-Reply-To: References: <641014857.20130122231052@acssoft.de> <5ABBFBF3-C21F-49B8-A330-B99F88EB685E@evertype.com> Message-ID: <8286CA64-A753-442C-B315-C507E3163264@evertype.com> Emoji proposals aren?t notable for their uniform quality. There are guidelines, but essentially the subcommittee approves things they like and don?t approve things they don?t like. > On 18 Feb 2019, at 12:52, Andr?s Sanhueza via Unicode wrote: > > I understand the difference. My question was why it was needed to have both spirals as different characters instead of a single one that can be either, as the proposal didn't specify an use case where there is a semantic difference between each one. > > > El s?b., 16 feb. 2019 a las 11:26, Michael Everson via Unicode () escribi?: > > Question: Why both a right and left facing spiral are exactly need? Isn't a single one (whose direction is just a glyph variant) enough? There was a previous thread that also suggested these very symbols, but otherwise I have found no evidence of the specific need for it. > > Clockwise and anticlockwise are not the same thing. > > Michael Everson From unicode at unicode.org Tue Feb 19 09:03:16 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Tue, 19 Feb 2019 15:03:16 +0000 (GMT) Subject: Spiral symbol Message-ID: <4d2acfc0.578d.16906484fec.Webtop.71@btinternet.com> I seem to remember from reading a book many years ago, maybe around fifty years ago, something about one of the early chemists (Lavoisier?) having used two symbols, each a spiral, mirror images of each other, for two different things, maybe oxidation and reduction, in his manuscript but he had had to abandon the idea because when he wanted his text printed the printer did not have any metal sorts of a spiral and so he had to use some other format. Does that ring a bell with anyone please? I remember that where I read it that there were two spiral motifs displayed in the text. At the time I was already into Private Press Printing using letterpress on a handpress using metal type and producing designs using single type border sorts, so that is perhaps why that has remained in my memory William Overington Tuesday 19 February 2019 From unicode at unicode.org Thu Feb 21 03:06:34 2019 From: unicode at unicode.org (via Unicode) Date: Thu, 21 Feb 2019 10:06:34 +0100 Subject: Unihan variants information In-Reply-To: <20190128114931.665a7a7059d7ee80bb4d670165c8327d.ac38193053.wbe@email03.godaddy.com> References: <20190128114931.665a7a7059d7ee80bb4d670165c8327d.ac38193053.wbe@email03.godaddy.com> Message-ID: <4DE0D822-845E-4BF6-9BA8-E5FB44BEB7CB@ouvaton.org> > Le 28 janv. 2019 ? 19:49, Doug Ewell via Unicode a ?crit : > > Michel MARIANI wrote: > >> I've developped an open-source, multi-platform desktop application >> called Unicode Plus > > Before you get too heavily invested in this product name, you may want > to: > > 1. check out the page "Unicode? Copyright and Terms of Use" located at > http://www.unicode.org/copyright.html, and > > 2. send a quick note to the Consortium officers asking whether they are > OK with this use of the Unicode name. To be on the safe side, the application has been renamed Unicopedia Plus . BTW, I tried in the meanwhile to find a more specific Unihan-related mailing list, but there is apparently none... Thanks again. -- Michel MARIANI -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 22 03:07:06 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 22 Feb 2019 09:07:06 +0000 Subject: USE Indic Syllabic Category Message-ID: <20190222090706.16814c4e@JRWUBU2> Where can I find the InSc properties of characters as overridden for the USE of Windows? I am trying to work out why on MS Edge I am now getting dotted circles before U+1A7A TAI THAM SIGN RA HAAM in all of: ??????? rank /sak/ , ?????????? giant fennel /ma ha? hi?/ and ??????? science /sa?t/ ? U+1A7A used to have InSC=Syllable_Modifier, for which these would all work (at the cost of ??????? to serve /s??p/ failing), which was then changed to InSC=Pure_Killer, which will work for all of them once the USE acknowledges that subjoined consonants may follow vowels (as in old-fashioned Khmer - see TUS) and that vowels below precede vowels above in Tai Tham (see Lanna/Tai Tham proposals). My best hypothesis (not thoroughly tested) is that Windows currently has InSc=Consonant_Killer, but can I look his up as opposed to effectively devising a test suite for USE on Office? Richard. From unicode at unicode.org Fri Feb 22 09:29:00 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 22 Feb 2019 15:29:00 +0000 Subject: USE Indic Syllabic Category In-Reply-To: <20190222090706.16814c4e@JRWUBU2> References: <20190222090706.16814c4e@JRWUBU2> Message-ID: <20190222152900.03f86c9f@JRWUBU2> On Fri, 22 Feb 2019 09:07:06 +0000 Richard Wordingham via Unicode wrote: > My best hypothesis (not thoroughly tested) is that Windows currently > has InSc=Consonant_Killer, but can I look his up as opposed to > effectively devising a test suite for USE on Office? That question's rather mangled. It should have said: My best hypothesis (not thoroughly tested) is that Windows currently has InSc=Consonant_Killer, but can where I look this up as opposed to effectively devising a test suite for USE on Windows? FWIW, HarfBuzz currently has VAbv 'vowel above', in accordance with the Unicode 11.0 properties. Richard. From unicode at unicode.org Fri Feb 22 14:04:30 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 22 Feb 2019 12:04:30 -0800 Subject: USE Indic Syllabic Category In-Reply-To: <20190222152900.03f86c9f@JRWUBU2> References: <20190222090706.16814c4e@JRWUBU2> <20190222152900.03f86c9f@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 22 19:47:36 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 23 Feb 2019 01:47:36 +0000 Subject: USE Indic Syllabic Category In-Reply-To: References: <20190222090706.16814c4e@JRWUBU2> <20190222152900.03f86c9f@JRWUBU2> Message-ID: <20190223014736.62564d44@JRWUBU2> On Fri, 22 Feb 2019 22:19:25 +0000 Andrew Glass wrote: > Thank you Richard for pointing out the issue with 0x1A7A > I've looked into this and found an error in our tooling that has this > mapped this to Halant. Based on the spec this should be VAbv. I've > filed a bug. Thanks. Will the correction be rolled out to all Microsoft Windows 10 customers at about the same time? I appreciate that corporate customers may impose their own extra, internal delays - my employer is still on Windows 7. In the meantime, I've updated my fonts (Da Lekh and Lamphun) to correct the problem. However, such corrections run the risk of wrongly deleting dotted circles that come from the backing store, and so are not Unicode-compliant. The sooner I can remove the corrections, the better. > > Where can I find the InSc properties of characters as overridden > > for the USE of Windows? > USE spec includes overrides to ISC and IPC: > https://docs.microsoft.com/en-gb/typography/script-development/use#overrides I had the impression there were more overrides than just those. > > once the USE acknowledges that subjoined consonants may follow > > vowels > I expect to update the USE spec to address this soon. That seems welcome news. I still don't know what the problem with supporting them has been. Richard. From unicode at unicode.org Sat Feb 23 00:46:27 2019 From: unicode at unicode.org (=?utf-8?B?5qKB5rW3IExpYW5nIEhhaQ==?= via Unicode) Date: Sat, 23 Feb 2019 14:46:27 +0800 Subject: USE Indic Syllabic Category In-Reply-To: <20190223014736.62564d44@JRWUBU2> References: <20190222090706.16814c4e@JRWUBU2> <20190222152900.03f86c9f@JRWUBU2> <20190223014736.62564d44@JRWUBU2> Message-ID: <8732C8D5-2B6B-42F5-8AAB-8D77E95A6615@gmail.com> >>> once the USE acknowledges that subjoined consonants may follow >>> vowels >> >> I expect to update the USE spec to address this soon. > > That seems welcome news. I still don't know what the problem with > supporting them has been. USE wasn?t designed to allow such a syllable structure. Tai Tham?s being supported by USE is kind of an oversight. And although it?s appropriate to allow conjoined consonants to follow post-base-spacing vowel signs, it?s not really a trivial debate whether USE should allow conjoined consonants to non-post-base-spacing (ie, pre-base, above-base, and below-base) vowel signs?considering the ambiguity. Best, ?? Liang Hai https://lianghai.github.io > On Feb 23, 2019, at 09:47, Richard Wordingham via Unicode wrote: > > On Fri, 22 Feb 2019 22:19:25 +0000 > Andrew Glass wrote: > >> Thank you Richard for pointing out the issue with 0x1A7A >> I've looked into this and found an error in our tooling that has this >> mapped this to Halant. Based on the spec this should be VAbv. I've >> filed a bug. > > Thanks. Will the correction be rolled out to all Microsoft > Windows 10 customers at about the same time? I appreciate that > corporate customers may impose their own extra, internal delays - my > employer is still on Windows 7. > > In the meantime, I've updated my fonts (Da Lekh and Lamphun) to > correct the problem. However, such corrections run the risk of wrongly > deleting dotted circles that come from the backing store, and so are > not Unicode-compliant. The sooner I can remove the corrections, the > better. > >>> Where can I find the InSc properties of characters as overridden >>> for the USE of Windows? >> USE spec includes overrides to ISC and IPC: >> https://docs.microsoft.com/en-gb/typography/script-development/use#overrides > > I had the impression there were more overrides than just those. > >>> once the USE acknowledges that subjoined consonants may follow >>> vowels >> I expect to update the USE spec to address this soon. > > That seems welcome news. I still don't know what the problem with > supporting them has been. > > Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 23 05:39:44 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 23 Feb 2019 11:39:44 +0000 Subject: USE Indic Syllabic Category In-Reply-To: <8732C8D5-2B6B-42F5-8AAB-8D77E95A6615@gmail.com> References: <20190222090706.16814c4e@JRWUBU2> <20190222152900.03f86c9f@JRWUBU2> <20190223014736.62564d44@JRWUBU2> <8732C8D5-2B6B-42F5-8AAB-8D77E95A6615@gmail.com> Message-ID: <20190223113944.0a636d3e@JRWUBU2> On Sat, 23 Feb 2019 14:46:27 +0800 ?? Liang Hai via Unicode wrote: > >>> once the USE acknowledges that subjoined consonants may follow > >>> vowels > >> > >> I expect to update the USE spec to address this soon. > > > > That seems welcome news. I still don't know what the problem with > > supporting them has been. > > USE wasn?t designed to allow such a syllable structure. Tai Tham?s > being supported by USE is kind of an oversight. And although it?s > appropriate to allow conjoined consonants to follow post-base-spacing > vowel signs, it?s not really a trivial debate whether USE should > allow conjoined consonants to non-post-base-spacing (ie, pre-base, > above-base, and below-base) vowel signs?considering the ambiguity. 1. "The goal of the clustering logic is to enable what is graphically consistent with a given script?s rules, rather than enforcing particular orthographic or linguistic rules. Such considerations should be applied at another layer, such as a spelling checker." - USE Specification. There are very few cases that cannot be resolved by a spell-checker once word boundaries are resolved. Pali and Tai phonology (but Lao is TBC) conspire to keep the numbers down. 2. The UTC membership had this discussion when discussing the proposals on the Unicore list. 3. Ambiguity is often font-dependent with above- and below-base vowels, and with tone marks. Marks above are frequently positioned relative to the phonetically preceding spacing consonant element - , , and are common coda ("sakot") consonants that are spacing. In Northern Thai, is frequently and can be written with the vowel largely to the left of the subscript consonant. Apart from , Northern Thai largely avoids , preferring the minor ambiguity of, for example, being either /hu?p/ or /lu? pa?/. (These two forms are a doublet.) 4. They're explicitly noted in the TUS for the Khmer script, and I suspect they're important for Tai languages in the Khmer ('Khom') script. 5. For visual proofing, one can use colour-coding - people are welcome to copy the relevant logic from my Da Lekh Si font. Word processor support for colour distinctions is limited, but it is in place in several browsers. Most of each akshara is in the foreground colour, so it works with syntax highlighting and similar existing uses of colour-coding. 6. The Sanskrit clusters grv- and gvr- are ambiguous in several Sanskrit-capable Indic scripts. (I haven't yet had the chance to study how Sanskrit is written in Tai Tham, though I do know of one inscription.) 7. The ambiguity of and was called out when was allowed as the usual subscript of U+1A37 TAI THAM LETTER BA. 8. The biggest ambiguity issue is the use of for U+1A6C TAI THAM VOWEL SIGN OA BELOW. The USE is powerless to deal with this. I wish someone would let me in on the evidence that they are actually distinct. 9. There is actually a problem with CVC aksharas being wrongly encoded paradoxically because of USE's poor support for Tai Tham. HarfBuzz allows an OpenType font to shape Tai Tham text even if it does not declare support for the script. Such fonts have to do Indic rearrangement themselves, and this is generally done by means of ligatures for . Consequently, a cluster gets encoded as , as there are scores of clusters and five preposed vowels. I know it is possible to do rearrangement properly given access to GSUB; I have a Tai Tham via ASCII mode in my Da Lekh fonts, and I have to do some rearrangement to clean up after the USE. There was a brief, happy period when HarfBuzz's SEA shaping engine was available for Tai Tham, but this was deleted in favour of an implementation of the USE. There are now two bunches of Tai Tham fonts which simply don't work on Microsoft browsers - Graphite fonts and the DIY OpenType Indic rearrangers. Richard. From unicode at unicode.org Sat Feb 23 08:07:53 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 23 Feb 2019 14:07:53 +0000 Subject: USE Indic Syllabic Category In-Reply-To: <8732C8D5-2B6B-42F5-8AAB-8D77E95A6615@gmail.com> References: <20190222090706.16814c4e@JRWUBU2> <20190222152900.03f86c9f@JRWUBU2> <20190223014736.62564d44@JRWUBU2> <8732C8D5-2B6B-42F5-8AAB-8D77E95A6615@gmail.com> Message-ID: <20190223140753.0558ee92@JRWUBU2> On Sat, 23 Feb 2019 14:46:27 +0800 ?? Liang Hai via Unicode wrote: > >>> once the USE acknowledges that subjoined consonants may follow > >>> vowels > >> > >> I expect to update the USE spec to address this soon. > > > > That seems welcome news. I still don't know what the problem with > > supporting them has been. > > USE wasn?t designed to allow such a syllable structure. Tai Tham?s > being supported by USE is kind of an oversight. And although it?s > appropriate to allow conjoined consonants to follow post-base-spacing > vowel signs, it?s not really a trivial debate whether USE should > allow conjoined consonants to non-post-base-spacing (ie, pre-base, > above-base, and below-base) vowel signs?considering the ambiguity. What are your thoughts on the handling of 'medial consonants'? My best surmise is that the Unicode classification is intended for subscript consonants that prototypically occur between a phonetically and orthographically syllable-initial consonant and the possibly implicit vowel. Significantly, clusters of medial consonants can occur. However, I am not sure why they should be treated any differently from subscript consonants. My best hypotheses are that: 1) They can lose any segmental significance in the pronunciation of a word, e.g. being reduced to encoding features, as in Burmese. 2) Their visual positioning in the onset cluster does not relate to the phonetic order; for example, medial RA may be written before the cluster without any anchor in the vertical stack. >From the prototypical behaviour, the USE has deduced the rule that a medial consonant must be followed by a vowel, albeit implicit. An implicit vowel does not count if it is removed by a virama (as opposed to a pure killer). You have suggested that the Indic Syllabic Category should reflect the structure of strings in scripts more closely. Do you agree that this deduction goes beyond the implications of the Unicode categorisation as a medial consonant? Or do you think that the Unicode concept of 'medial consonant' should be changed. My feeling is that I should report to Microsoft that the characterisation of U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA and U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA, both with InSC=Consonant_Medial, as medial consonants, is wrong for the USE. There are three ways that these signs fail to correspond to the USE's model of a medial consonant: 1. The Tai Tham sequences and can act as vowels in Tai Tham languages. 2. The implicit vowel following them can be silenced. Now normally this should not be a problem, for the vowel killers are categorised as 'pure_killer' (U+1A7A) and 'syllable_modifier' (U+1A7C). The potential issue revealed itself when U+1A7A was mistagged as 'halant', implying 'virama'. 3. MEDIAL RA can precede a resonant consonant, as in ?????? /t?an??m/ (MFL Rev 1 p269). Richard. From unicode at unicode.org Sun Feb 24 07:28:15 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 24 Feb 2019 13:28:15 +0000 Subject: USE Indic Syllabic Category In-Reply-To: <8732C8D5-2B6B-42F5-8AAB-8D77E95A6615@gmail.com> References: <20190222090706.16814c4e@JRWUBU2> <20190222152900.03f86c9f@JRWUBU2> <20190223014736.62564d44@JRWUBU2> <8732C8D5-2B6B-42F5-8AAB-8D77E95A6615@gmail.com> Message-ID: <20190224132815.00b3cb86@JRWUBU2> On Sat, 23 Feb 2019 14:46:27 +0800 ?? Liang Hai via Unicode wrote: > USE wasn?t designed to allow such a syllable structure. Tai Tham?s > being supported by USE is kind of an oversight. And although it?s > appropriate to allow conjoined consonants to follow post-base-spacing > vowel signs, There's a quick hack there. As U+1A63 TAI THAM VOWEL SIGN AA and 1A64 U+TAI THAM VOWEL SIGN TALL AA start grapheme clusters, just promote them to BASE. It also solves the problem of tone mark placement. It does postpone the handling of the ligature to after the dissolution of syllable boundaries, which could force unwelcome changes in a Pali-only Tai Tham font, if such exist. At least one font has an extensive set of ligatures for the sequences . I have to handle the ligature after the dissolution because of the syllable boundary in . A quick hack for the likes of Tai L? ?????? /p?i va?/ ?because? may be more troublesome even if one omits U+1A7B TAI THAM SIGN MAI SAM. You probably won't like it anyway, because a good rendering looks more like the nonsense words /pv?i pa?/ or /pv?i pva?/. (I think the cluster /pv/ does not exist in any form in Tai L?, and that would rule it out.) Richard. From unicode at unicode.org Thu Feb 28 15:03:21 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 28 Feb 2019 14:03:21 -0700 Subject: Unicode CLDR 35 alpha available for testing Message-ID: <20190228140321.665a7a7059d7ee80bb4d670165c8327d.eb8137361b.wbe@email03.godaddy.com> announcements at unicode.org wrote: > The alpha version of Unicode CLDR 35 > is available for > testing. No downloadable data files in the sense of released builds, correct? -- Doug Ewell | Thornton, CO, US | ewellic.org