From kenwhistler at sonic.net Thu Oct 1 09:27:11 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Thu, 1 Oct 2020 07:27:11 -0700 Subject: Please fix the trademark policy in regards to code In-Reply-To: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> References: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> Message-ID: <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> References to "unicode" in code and related files and libraries would generally be considered just a functional "fair use" reference, and nothing more.? Accordingly, the Unicode Consortium would not take exception to such a use,?nor would it require the?use of the trademark symbol in code,?for the reasons you state.? We don't cover this in the Trademark Usage Policy because it is simply not necessary to do so - covering every possible manner of fair use reference would make the policy overly long and complex. --Ken Whistler, Technical Director, Unicode, Inc. On 9/29/2020 11:35 PM, Ellie via Unicode wrote: > if I am reading the trademark policy correctly I might be required to > rename my "unicode.c" source code file to "Unicode? implementation.c" or > some similar ugliness (in my humble opinion) to satisfy the "Trademark > Usage Policy", because it seems any sort of exception for source code > was left out. Not only does this not fit well with how I see many people > name their code files, but also special symbols can cause issues in > archives/tarballs when sharing the code. Furthermore, it seems like I > would need to add the ? into my variable names as well, even if the > language/compiler in question doesn't even support unicode characters, > and uppercase the U even if that doesn't fit with any of the coding style. From ra_hardy at hotmail.com Thu Oct 1 07:46:38 2020 From: ra_hardy at hotmail.com (ra_hardy at hotmail.com) Date: Thu, 01 Oct 2020 13:46:38 +0100 Subject: Teletext separated mosaic graphics Message-ID: An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Oct 1 13:26:35 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 1 Oct 2020 19:26:35 +0100 Subject: Please fix the trademark policy in regards to code In-Reply-To: <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> References: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> Message-ID: <20201001192635.2cccbb7c@JRWUBU2> On Thu, 1 Oct 2020 07:27:11 -0700 Ken Whistler via Unicode wrote: > References to "unicode" in code and related files and libraries would > generally be considered just a functional "fair use" reference, and > nothing more.? Accordingly, the Unicode Consortium would not take > exception to such a use,?nor would it require the?use of the > trademark symbol in code,?for the reasons you state. I don't see any evidence of the Unicode Consortium going after Wikipedia either, despite articles simply entitled "Unicode", or even after respellings in other scripts. As for https://la.wikipedia.org/wiki/Unicodex..., that even uses the genitive singular for "Signum Unicodicis" - isn't that required to be "Signum Unicodex" if "Unicodex" is pukka Latin? Richard. (lingua in bucca posita) From wjgo_10009 at btinternet.com Thu Oct 1 12:44:17 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 1 Oct 2020 18:44:17 +0100 (BST) Subject: Teletext separated mosaic graphics In-Reply-To: References: Message-ID: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> The 1976 Teletext Specification has three meanings for sixty-four of the character code points - lowercase letters and a few others, contiguous graphics, separated graphics. The Unicode Standard at present has the "lowercase letters and a few others" encoded and the "contiguous graphics" encoded separately, although, alas, all sixty-four contiguous graphic characters are not encoded as one block. My opinion is that that one-to-one directly mapped approach would have been preferable, but the situation is as it is. The twenty-seven teletext control characters have not been encoded at this time. I opine that these twenty-seven codes could be encoded within a block of thirty-two code points as characters that display as visual glyphs in most circumstances, yet are control codes in teletext apps. For example, Alphanumerics Green would have a visible glyph of an A above a G on a pale. That way, teletext pages from long ago and new designs could be recorded elegantly and conserved as the control codes in the teletext page would not conflict with the usual control codes of computing. If those twenty-seven teletext control characters were encoded separately, would that help in developing your app, or are you using a different approach? William Overington Thursday 1 October 2020 http://www.users.globalnet.co.uk/~ngo/ ------ Original Message ------ From: "Rob H via Unicode" To: unicode at unicode.org Sent: Thursday, 2020 Oct 1 At 13:46 Subject: Teletext separated mosaic graphics Hi, I've started to develop a teletext app and plan to use the recently added graphic mosaic characters in the legacy computing block (the sextets). I see that Unicode includes the contiguous mosaics characters and not the separated form of those characters. I'm wondering if that was intentional? On one hand, that matches the teletext spec, which has one set of byte codes for the graphics, and uses control codes to switch between contiguous or separated. On the other hand it means I'll need to use styling tricks or a different font or glyph variations to recreate the separated graphics. It also means a simple text-only file of just the characters won't recreate a screen as the control codes to switch between contiguous/separated won't work. A font I've found which maps these characters uses the new codepoints for the contiguous graphics and a private codepoints for separated, which seems awkward to me. If having just the contiguous graphics was intentional, that's fine, but I just wanted to check. Regards, Rob. -------------- next part -------------- An HTML attachment was scrubbed... URL: From harjitmoe at outlook.com Thu Oct 1 15:17:04 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Thu, 1 Oct 2020 20:17:04 +0000 Subject: Teletext separated mosaic graphics In-Reply-To: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> References: , <42717374.14f0.174e543c410.Webtop.49@btinternet.com> Message-ID: It's worth pointing out that the control codes for showing mosaic characters as separated are also used in at least some formats to switch alphabetical characters to underlined display. See for example the definitions for SPL and STL here: https://www.itscj.ipsj.or.jp/iso-ir/056.pdf (that document details the C1 control codes for Data Syntax 2 Serial Videotex?which would seem to be the Teletext set but as a C1 set, and as such with CSI rather than ESC). Essentially, the expectation seems to be that an emphasised variant of a font would display mosaic characters separated, while a regular variant of a font would display them connected. --Har. ________________________________ From: Unicode on behalf of William_J_G Overington via Unicode Sent: 01 October 2020 18:44 To: unicode at unicode.org Subject: Re: Teletext separated mosaic graphics The 1976 Teletext Specification has three meanings for sixty-four of the character code points - lowercase letters and a few others, contiguous graphics, separated graphics. The Unicode Standard at present has the "lowercase letters and a few others" encoded and the "contiguous graphics" encoded separately, although, alas, all sixty-four contiguous graphic characters are not encoded as one block. My opinion is that that one-to-one directly mapped approach would have been preferable, but the situation is as it is. The twenty-seven teletext control characters have not been encoded at this time. I opine that these twenty-seven codes could be encoded within a block of thirty-two code points as characters that display as visual glyphs in most circumstances, yet are control codes in teletext apps. For example, Alphanumerics Green would have a visible glyph of an A above a G on a pale. That way, teletext pages from long ago and new designs could be recorded elegantly and conserved as the control codes in the teletext page would not conflict with the usual control codes of computing. If those twenty-seven teletext control characters were encoded separately, would that help in developing your app, or are you using a different approach? William Overington Thursday 1 October 2020 http://www.users.globalnet.co.uk/~ngo/ ------ Original Message ------ From: "Rob H via Unicode" To: unicode at unicode.org Sent: Thursday, 2020 Oct 1 At 13:46 Subject: Teletext separated mosaic graphics Hi, I've started to develop a teletext app and plan to use the recently added graphic mosaic characters in the legacy computing block (the sextets). I see that Unicode includes the contiguous mosaics characters and not the separated form of those characters. I'm wondering if that was intentional? On one hand, that matches the teletext spec, which has one set of byte codes for the graphics, and uses control codes to switch between contiguous or separated. On the other hand it means I'll need to use styling tricks or a different font or glyph variations to recreate the separated graphics. It also means a simple text-only file of just the characters won't recreate a screen as the control codes to switch between contiguous/separated won't work. A font I've found which maps these characters uses the new codepoints for the contiguous graphics and a private codepoints for separated, which seems awkward to me. If having just the contiguous graphics was intentional, that's fine, but I just wanted to check. Regards, Rob. -------------- next part -------------- An HTML attachment was scrubbed... URL: From beckiergb at gmail.com Thu Oct 1 16:32:41 2020 From: beckiergb at gmail.com (Rebecca Bettencourt) Date: Thu, 1 Oct 2020 14:32:41 -0700 Subject: Teletext separated mosaic graphics In-Reply-To: References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> Message-ID: Separated mosaic graphics were intentionally not proposed in the first Symbols for Legacy Proposal because it was believed at the time that it would be possible for applications to support separated graphics using a higher-level protocol. Since then: 1.) we have received feedback such as yours suggesting that this is easier said than done 2.) we have found an existing private-use encoding that encodes contiguous and separated graphics separately (possibly the one used by the font you found) 3.) we have also found a legacy character set that encodes contiguous and separated *2x2* block graphics separately For these reasons we will be proposing the separated graphics in a second proposal, and hopefully these reasons are enough for the UTC to approve them. However it will be several years before they appear in the Standard, if approved. -- Rebecca Bettencourt On Thu, Oct 1, 2020 at 1:19 PM Harriet Riddle via Unicode < unicode at unicode.org> wrote: > It's worth pointing out that the control codes for showing mosaic > characters as separated are also used in at least some formats to switch > alphabetical characters to underlined display. > > See for example the definitions for SPL and STL here: > https://www.itscj.ipsj.or.jp/iso-ir/056.pdf (that document details the C1 > control codes for Data Syntax 2 Serial Videotex?which would seem to be the > Teletext set but as a C1 set, and as such with CSI rather than ESC). > > Essentially, the expectation seems to be that an emphasised variant of a > font would display mosaic characters separated, while a regular variant of > a font would display them connected. > > --Har. > > ------------------------------ > *From:* Unicode on behalf of William_J_G > Overington via Unicode > *Sent:* 01 October 2020 18:44 > *To:* unicode at unicode.org > *Subject:* Re: Teletext separated mosaic graphics > > > The 1976 Teletext Specification has three meanings for sixty-four of the > character code points - lowercase letters and a few others, contiguous > graphics, separated graphics. > > > The Unicode Standard at present has the "lowercase letters and a few > others" encoded and the "contiguous graphics" encoded separately, although, > alas, all sixty-four contiguous graphic characters are not encoded as one > block. My opinion is that that one-to-one directly mapped approach would > have been preferable, but the situation is as it is. > > The twenty-seven teletext control characters have not been encoded at this > time. > > > I opine that these twenty-seven codes could be encoded within a block of > thirty-two code points as characters that display as visual glyphs in most > circumstances, yet are control codes in teletext apps. > > > For example, Alphanumerics Green would have a visible glyph of an A above > a G on a pale. > > > That way, teletext pages from long ago and new designs could be recorded > elegantly and conserved as the control codes in the teletext page would not > conflict with the usual control codes of computing. > > > If those twenty-seven teletext control characters were encoded separately, > would that help in developing your app, or are you using a different > approach? > > > William Overington > > > Thursday 1 October 2020 > > > http://www.users.globalnet.co.uk/~ngo/ > > > > > > ------ Original Message ------ > From: "Rob H via Unicode" > To: unicode at unicode.org > Sent: Thursday, 2020 Oct 1 At 13:46 > Subject: Teletext separated mosaic graphics > > Hi, > > I've started to develop a teletext app and plan to use the recently added > graphic mosaic characters in the legacy computing block (the sextets). I > see that Unicode includes the contiguous mosaics characters and not the > separated form of those characters. I'm wondering if that was intentional? > On one hand, that matches the teletext spec, which has one set of byte > codes for the graphics, and uses control codes to switch between contiguous > or separated. On the other hand it means I'll need to use styling tricks or > a different font or glyph variations to recreate the separated graphics. It > also means a simple text-only file of just the characters won't recreate a > screen as the control codes to switch between contiguous/separated won't > work. > > A font I've found which maps these characters uses the new codepoints for > the contiguous graphics and a private codepoints for separated, which seems > awkward to me. > > If having just the contiguous graphics was intentional, that's fine, but I > just wanted to check. > > Regards, > Rob. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Oct 3 13:54:56 2020 From: doug at ewellic.org (Doug Ewell) Date: Sat, 3 Oct 2020 12:54:56 -0600 Subject: Teletext separated mosaic graphics In-Reply-To: References: , <42717374.14f0.174e543c410.Webtop.49@btinternet.com> Message-ID: <000001d699b6$b02f5d40$108e17c0$@ewellic.org> Harriet Riddle wrote: > It's worth pointing out that the control codes for showing mosaic > characters as separated are also used in at least some formats to > switch alphabetical characters to underlined display. > > See for example the definitions for SPL and STL here: > https://www.itscj.ipsj.or.jp/iso-ir/056.pdf (that document details the > C1 control codes for Data Syntax 2 Serial Videotex?which would seem to > be the Teletext set but as a C1 set, and as such with CSI rather than > ESC). Applications of any sort that are compliant with ISO/IEC 6429 (ECMA-48, ANSI X3.64) should understand ESC [ as a synonym for CSI. > Essentially, the expectation seems to be that an emphasised variant of > a font would display mosaic characters separated, while a regular > variant of a font would display them connected. We still haven't written the Technical Note for using the Legacy Symbols -- that's largely on me -- but as far as teletext is concerned, the recommended practice is to translate teletext control codes directly onto the Basic Latin space. For example: - "contiguous graphics" becomes U+0019 - "separated graphics" becomes U+001A - "double height" becomes U+000D - "end box" becomes U+000A There is no conflict with the normal meanings of U+000D and U+000A because teletext does not use these to separate lines. In general, a teletext application should treat control codes the way teletext would treat them, and should not try to mix C0 and teletext interpretations. This also means Rob's scenario: > It also means a simple text-only file of just the characters won't > recreate a screen as the control codes to switch between contiguous/ > separated won't work. may not be well-conceived; the file should probably not be "text" in the sense that its lines end with some combination of CR and/or LF, unless there is an intermediate translation step. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From kent.b.karlsson at bahnhof.se Sat Oct 3 19:25:30 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sun, 4 Oct 2020 02:25:30 +0200 Subject: Teletext separated mosaic graphics In-Reply-To: <000001d699b6$b02f5d40$108e17c0$@ewellic.org> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> Message-ID: <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> > 3 okt. 2020 kl. 20:54 skrev Doug Ewell via Unicode : > > Harriet Riddle wrote: > >> It's worth pointing out that the control codes for showing mosaic >> characters as separated are also used in at least some formats to >> switch alphabetical characters to underlined display. >> >> See for example the definitions for SPL and STL here: >> https://www.itscj.ipsj.or.jp/iso-ir/056.pdf (that document details the >> C1 control codes for Data Syntax 2 Serial Videotex?which would seem to >> be the Teletext set but as a C1 set, and as such with CSI rather than >> ESC). > > Applications of any sort that are compliant with ISO/IEC 6429 (ECMA-48, ANSI X3.64) should understand ESC [ as a synonym for CSI. Teletext is not compliant with ECMA-48 (unless converted). >> Essentially, the expectation seems to be that an emphasised variant of >> a font would display mosaic characters separated, while a regular >> variant of a font would display them connected. > > We still haven't written the Technical Note for using the Legacy Symbols -- that's largely on me -- but as far as teletext is concerned, the recommended practice is to translate teletext control codes directly onto the Basic Latin space. For example: > > - "contiguous graphics" becomes U+0019 > - "separated graphics" becomes U+001A > - "double height" becomes U+000D > - "end box" becomes U+000A That would be an extremely bad idea (as well as being completely non-compliant with ECMA-48, if that is still the approach, as I think it should be). > There is no conflict with the normal meanings of U+000D and U+000A because teletext does not use these to separate lines. I don?t know how Teletext is represented in DVB or IP-TV; but those digital representations of TV images do not use traditional ?analog? representation of TV images, and hence cannot have the ?analog? representation of ?rows? (lines) of text in Teletext. (And yes, Teletext does work fine with IP-TV.) Note also that Teletext is rife with ?code page switching?. ESC toggles between a primary and a secondary charset (for text). In a control part of the Teletext protocol one sets the charsets for text (options include various ?national variants? of ISO/IEC 646, as well as Greek, Hebrew and Arabic (visual order, preshaped). Toggling between separated and contiguous ?mosaics? is also best seen as a switch between charsets. Regarding it as a styling is odd, since this particular styling would only apply to a few very rarely used characters, and the change is not one that is recognized as styling elsewhere. In addition, you have already encoded separated and contiguous other but similar ?mosaics? characters as separate characters. Even the colour controls in Teletext switch between text and mosaics (and in addition are usually displayed as a space, as is the norm in Teletext for ?control? characters). Part of the Teletext protocol specifies how to set/unset bold/italic/underline. But that is not inline in the text, it is ?out-of-line? elsewhere in the protocol (in a control part). But colouring, certain sizing, blink, conceal, and ?boxing? (used for (optional) subtitling and news flash messages) are inline. Note that Teletext is still often used for subtitling. Most of Teletext styling can be converted to ECMA-48 styling as is. Some others will need an extension of ECMA-48 to be representable in that framework. Teletext these days are often displayed in things that are not analog (of even digital) TVs; you can find web pages displaying Teletext texts, as well as mobile phone (or tablets) apps that display Teletext texts. They need not all convert those pages to an image before displaying then in the web page/app? (Though one may want to have a partial conversion to HTML rather than to ECMA-48; but for HTML that would not handle ?box? (at all) nor blink (since that is deprecated in HTML), ....) /Kent K > In general, a teletext application should treat control codes the way teletext would treat them, and should not try to mix C0 and teletext interpretations. This also means Rob's scenario: > >> It also means a simple text-only file of just the characters won't >> recreate a screen as the control codes to switch between contiguous/ >> separated won't work. > > may not be well-conceived; the file should probably not be "text" in the sense that its lines end with some combination of CR and/or LF, unless there is an intermediate translation step. > > -- > Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Oct 4 19:07:26 2020 From: doug at ewellic.org (Doug Ewell) Date: Sun, 4 Oct 2020 18:07:26 -0600 Subject: Teletext separated mosaic graphics In-Reply-To: <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> Message-ID: <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> Kent Karlsson wrote: >>> See for example the definitions for SPL and STL here: >>> https://www.itscj.ipsj.or.jp/iso-ir/056.pdf (that document details >>> the C1 control codes for Data Syntax 2 Serial Videotex?which would >>> seem to be the Teletext set but as a C1 set, and as such with CSI >>> rather than ESC). >> >> Applications of any sort that are compliant with ISO/IEC 6429 >> (ECMA-48, ANSI X3.64) should understand ESC [ as a synonym for CSI. > > Teletext is not compliant with ECMA-48 (unless converted). You're right, and I had sort of said that farther down. I didn't read the definitions or Harriet's synopsis carefully enough, and misinterpreted the reference to ?CSI rather than ESC.? The UK Videotex control codes are single bytes in the ECMA-35 C1 space, and can be adapted for 7-bit systems to ESC plus a corresponding value in the G0 space; but that does not make the system compliant with ECMA-48, and indeed it is not. >> - "contiguous graphics" becomes U+0019 >> - "separated graphics" becomes U+001A >> - "double height" becomes U+000D >> - "end box" becomes U+000A > > That would be an extremely bad idea (as well as being completely non- > compliant with ECMA-48, if that is still the approach, as I think it > should be). As you just said, correctly, teletext is not compliant with ECMA-48. UTC has confirmed it will not add more control characters for backward compatibility purposes like this. (I don't think there is a promise not to encode more completely novel control characters, such as for hieroglyphics, but that is not the question here.) We all know there is no such thing in Unicode as a "hybrid" character that is sometimes a control character and sometimes a graphic character in normal use. We know that Unicode has defined fixed meanings for a subset of the C0 control characters, including CR and LF. But a teletext application for a modern computer is not "normal use." It is reasonable for a non-standard application like this to interpret characters from U+0000 to U+001F as the corresponding ISO 646 characters would be in teletext. It is, frankly, the only choice. > I don?t know how Teletext is represented in DVB or IP-TV; but those > digital representations of TV images do not use traditional ?analog? > representation of TV images, and hence cannot have the ?analog? > representation of ?rows? (lines) of text in Teletext. (And yes, > Teletext does work fine with IP-TV.) Rows in teletext are defined in a completely different way from the now-standard model of a continuous stream of characters that are delimited by a sequence of one or more "end-of-line" control characters. The teletext row model is more akin to the fixed-length model from the punch-card and tape era. > Note also that Teletext is rife with ?code page switching?. ESC > toggles between a primary and a secondary charset (for text). In a > control part of the Teletext protocol one sets the charsets for text > (options include various ?national variants? of ISO/IEC 646, as well > as Greek, Hebrew and Arabic (visual order, preshaped). A teletext application would probably be expected to implement that as well. > Toggling between separated and contiguous ?mosaics? is also best seen > as a switch between charsets. Which is why we did not propose the separated mosaics in Round 1, and Script Ad-Hoc and UTC agreed. > Regarding it as a styling is odd, since this particular styling would > only apply to a few very rarely used characters, and the change is not > one that is recognized as styling elsewhere. In addition, you have > already encoded separated and contiguous other but similar ?mosaics? > characters as separate characters. We tried to be as consistent as possible with the Legacy Symbols proposal, and to propose things separately only where some legacy platform encoded them separately, not just with a mode shift or by masking the code point with 0x80. There may be imperfections in the model, based on what SAH did and did not approve. > Even the colour controls in Teletext switch between text and mosaics > (and in addition are usually displayed as a space, as is the norm in > Teletext for ?control? characters). That is certainly behavior that a teletext application should emulate. > Part of the Teletext protocol specifies how to set/unset bold/italic/ > underline. But that is not inline in the text, it is ?out-of-line? > elsewhere in the protocol (in a control part). But colouring, certain > sizing, blink, conceal, and ?boxing? (used for (optional) subtitling > and news flash messages) are inline. Note that Teletext is still often > used for subtitling. Another reason why it is probably not appropriate to try to represent teletext in a plain-text file. You can certainly convert it to a plain-text file, with ECMA-48 sequences for styling and lines ending in CR and/or LF, but then it is no longer "teletext data" but a conversion. > Most of Teletext styling can be converted to ECMA-48 styling as is. > Some others will need an extension of ECMA-48 to be representable in > that framework. I read with interest your proposal last year to update ECMA-48. I think the proposed extensions and clarifications had a better chance of adoption than the suggestions to change existing functionality outright. I am curious about the current status of that proposal; was it submitted anywhere? -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From ra_hardy at hotmail.com Mon Oct 5 07:24:59 2020 From: ra_hardy at hotmail.com (Rob Hardy) Date: Mon, 5 Oct 2020 12:24:59 +0000 Subject: Teletext separated mosaic graphics In-Reply-To: <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se>, <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> Message-ID: Thanks for the replies. When I mentioned the text file scenario, I was just thinking about a copy and paste from my application into some other app, so it's not really the main scenario. I'm actually using SVG/CSS to draw the page, which includes flashing and a 'press reveal' function for concealed characters that a text file obviously wouldn't recreate. As mentioned, the font I found (UNSCII) uses private use characters for the separated mosaics, so I'll use those for now. (Another option would have been to draw a mask over the contiguous mosaics, but then that's not far removed from drawing the mosaics as SVG shapes instead of text). Rob. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwidion at gmail.com Mon Oct 5 09:05:47 2020 From: gwidion at gmail.com (Joao S. O. Bueno) Date: Mon, 5 Oct 2020 11:05:47 -0300 Subject: Teletext separated mosaic graphics In-Reply-To: References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> Message-ID: I know I am going off your topic here, and I apologize: But for backwards bug-for-bug compatibility, I wonder how these characters can be of any use as separate characters: their usefulness lies exactly in being able to use 1/6 character blocks as pixels in a contiguous image laid-out with characters. I mean, that even for an application that should behave like a legacy application, would not its visuals be improved by contiguous mosaics? Do you have any example were separate mosaics looks better, or do you need it just to achieve the same look and feel? I emphasize I am just asking this out of curiosity. Regards, js -><- Do you have On Mon, 5 Oct 2020 at 09:27, Rob Hardy via Unicode wrote: > Thanks for the replies. > > When I mentioned the text file scenario, I was just thinking about a copy > and paste from my application into some other app, so it's not really the > main scenario. I'm actually using SVG/CSS to draw the page, which includes > flashing and a 'press reveal' function for concealed characters that a text > file obviously wouldn't recreate. > > As mentioned, the font I found (UNSCII) uses private use characters for > the separated mosaics, so I'll use those for now. (Another option would > have been to draw a mask over the contiguous mosaics, but then that's not > far removed from drawing the mosaics as SVG shapes instead of text). > > Rob. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ra_hardy at hotmail.com Mon Oct 5 09:24:26 2020 From: ra_hardy at hotmail.com (ra_hardy at hotmail.com) Date: Mon, 05 Oct 2020 15:24:26 +0100 Subject: Teletext separated mosaic graphics In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Oct 5 07:25:56 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 5 Oct 2020 13:25:56 +0100 (BST) Subject: Teletext separated mosaic graphics In-Reply-To: <000001d699b6$b02f5d40$108e17c0$@ewellic.org> References: , <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> Message-ID: Doug Ewell wrote: > UTC has confirmed it will not add more control characters for backward > compatibility purposes like this. (I don't think there is a promise > not to encode more completely novel control characters, such as for > hieroglyphics, but that is not the question here.) Well, at one stage the Unicode Technical Committee decided not to encode emoji then later changed its mind, so changes can be made if later consideration is assessed as justifying that change. That does not mean that the Unicode Technical Committee will necessarily change its mind, it just means that it could change its mind if it so chooses. > We all know there is no such thing in Unicode as a "hybrid" character > that is sometimes a control character and sometimes a graphic > character in normal use. We know that Unicode has defined fixed > meanings for a subset of the C0 control characters, including CR and > LF. But a teletext application for a modern computer is not "normal > use." It is reasonable for a non-standard application like this to > interpret characters from U+0000 to U+001F as the corresponding ISO > 646 characters would be in teletext. It is, frankly, the only choice. There is a choice. If the teletext control codes are encoded as if ordinary displayable characters with a note saying that they may, but need not, be interpreted as teletext control characters then it would be possible to have teletext-aware web browsers and teletext-aware PDF readers and so on where a teletext page could be included within a document using a plain text representation. William Overington Monday 5 October 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From billposer2 at gmail.com Tue Oct 6 18:11:07 2020 From: billposer2 at gmail.com (Bill Poser) Date: Tue, 6 Oct 2020 16:11:07 -0700 Subject: Please fix the trademark policy in regards to code In-Reply-To: <20201001192635.2cccbb7c@JRWUBU2> References: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> <20201001192635.2cccbb7c@JRWUBU2> Message-ID: It is not the case that any use of a trademark constitutes infringement or requires permission. Trademark owners sometimes try to persuade people that this is so and issue rules as to how their trademark should be used, but they're blowing smoke. A trademark is only infringed when there is a risk of confusion. I can't name my beverage "Coca-Cola" or something similar like "Cokacola", because there is a risk that consumers will confuse it with the original product. I can write an article about Coca-Cola, and refer to it by that name as much as I like, because I am using the trademark to identify the product with which the owner has associated it. I am not creating any risk of confusion in the eye of the consumer. Similarly, the Unicode Consortium has the right to block the use of "Unicode" in reference to some competing encoding standard, but it has no right to block the use of the term in the title or text of articles about the Unicode standard. On Thu, Oct 1, 2020 at 11:28 AM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Thu, 1 Oct 2020 07:27:11 -0700 > Ken Whistler via Unicode wrote: > > > References to "unicode" in code and related files and libraries would > > generally be considered just a functional "fair use" reference, and > > nothing more. Accordingly, the Unicode Consortium would not take > > exception to such a use, nor would it require the use of the > > trademark symbol in code, for the reasons you state. > > I don't see any evidence of the Unicode Consortium going after > Wikipedia either, despite articles simply entitled "Unicode", or even > after respellings in other scripts. As for > https://la.wikipedia.org/wiki/Unicodex..., that even uses the genitive > singular for "Signum Unicodicis" - isn't that required to be "Signum > Unicodex" if "Unicodex" is pukka Latin? > > Richard. (lingua in bucca posita) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Tue Oct 6 18:11:56 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 7 Oct 2020 01:11:56 +0200 Subject: Teletext separated mosaic graphics In-Reply-To: <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> Message-ID: <3AC7C2C9-9B55-4E9B-81EA-D15D7C0D021E@bahnhof.se> > 5 okt. 2020 kl. 02:07 skrev Doug Ewell via Unicode : > > Kent Karlsson wrote: > >>>> See for example the definitions for SPL and STL here: >>>> https://www.itscj.ipsj.or.jp/iso-ir/056.pdf (that document details >>>> the C1 control codes for Data Syntax 2 Serial Videotex?which would >>>> seem to be the Teletext set but as a C1 set, and as such with CSI >>>> rather than ESC). >>> >>> Applications of any sort that are compliant with ISO/IEC 6429 >>> (ECMA-48, ANSI X3.64) should understand ESC [ as a synonym for CSI. >> >> Teletext is not compliant with ECMA-48 (unless converted). > > You're right, and I had sort of said that farther down. I didn't read the definitions or Harriet's synopsis carefully enough, and misinterpreted the reference to ?CSI rather than ESC.? > > The UK Videotex And I?m talking about the current ETSI EN 300 706 V1.2.1 (2003-04), Enhanced Teletext specification, https://www.etsi.org/deliver/etsi_en/300700_300799/300706/01.02.01_60/en_300706v010201p.pdf. That seems to be the latest version, and is, AFAICT, implemented in all(?) TV sets and ?TV boxes?, sold, I would think, worldwide. I also just found "Digital Video Broadcasting (DVB); Specification for conveying ITU-R System B Teletext in DVB bitstreams? (ETSI EN 300 472 V1.4.1 (2017-04), https://www.etsi.org/deliver/etsi_en/300400_300499/300472/01.04.01_60/en_300472v010401p.pdf). (I haven?t scanned through it yet.) > control codes are single bytes in the ECMA-35 C1 space, and can be adapted for 7-bit systems to ESC plus a corresponding value in the G0 space; but that does not make the system compliant with ECMA-48, and indeed it is not. > >>> - "contiguous graphics" becomes U+0019 >>> - "separated graphics" becomes U+001A >>> - "double height" becomes U+000D >>> - "end box" becomes U+000A >> >> That would be an extremely bad idea (as well as being completely non- >> compliant with ECMA-48, if that is still the approach, as I think it >> should be). > > As you just said, correctly, teletext is not compliant with ECMA-48. > > UTC has confirmed it will not add more control characters for backward compatibility purposes like this. And these controls are not good anyway? They do three things in one go (i.e. per ?control? code): 1. Change charset (most of them) 2. Change color (most of them) 3. Display as a SPACE (or as a ?mosaic character?, if ?hold mosaics? is active) I wouldn?t even think of proposing, or even perpetuating, this kind of thing. They are horrendous! In addition, all of them can be overridden by formatting (and character substitutions) in control ?objects" given in the Teletext protocol. In addition, the Teletext protocol allows for ?user defined? fonts (called DRCS in the Teletext specification). Converting those (and their use) is a different headache... > (I don't think there is a promise not to encode more completely novel control characters, such as for hieroglyphics, but that is not the question here.) > > We all know there is no such thing in Unicode as a "hybrid" character that is sometimes a control character and sometimes a graphic character in normal use. We know that Unicode has defined fixed meanings for a subset of the C0 control characters, including CR and LF. But a teletext application for a modern computer is not "normal use.? Sure it is. Teletext pages are already displayed in HTML pages (and they don't convert the Teletext pages to images before display; they could, but is not necessary, and don't). Teletext pages are also displayed in mobile phone (tablet) apps. Try out the web site ?texttv.nu? (also available as an iOS app under the same name); it displays the current(!!!) Teletext pages from SVT (it may have some minutes of delay, if there is a change, and the app does notify of changes). Perfectly normal web pages (with text, not images), perfectly normal mobile app. There are several other web pages and apps that do similar display of Teletext pages, also for other TV channels. (I listed a few more in another email a few months ago.) (SVT do their own web pages for their Teletext content, but those pages are less faithful to the TV rendering: https://www.svt.se/svttext/webu/pages/100.html.) > It is reasonable for a non-standard application like this to interpret characters from U+0000 to U+001F as the corresponding ISO 646 characters would be in teletext. It is, frankly, the only choice. Quite the contrary, that is a definite NON-option. All of these Teletext ?controls? can be converted to HTML/CSS (including charset switching before conversion to Unicode and including styling and character ?object overrides? in Teletext. They include underline, bold, italics, proportional spacing, more colors and character replacements [the latter would be part of character conversion, not part of styling]. It is not that hard to figure out extensions to ECMA-48 to cover also the more odd bits (except ?user defined" fonts), like ?boxing?. What is missing (currently) is the ?separated mosaics graphic? characters? > >> I don?t know how Teletext is represented in DVB or IP-TV; but those >> digital representations of TV images do not use traditional ?analog? >> representation of TV images, and hence cannot have the ?analog? >> representation of ?rows? (lines) of text in Teletext. (And yes, >> Teletext does work fine with IP-TV.) > > Rows in teletext are defined in a completely different way from the now-standard model of a continuous stream of characters that are delimited by a sequence of one or more "end-of-line" control characters. The teletext row model is more akin to the fixed-length model from the punch-card and tape era. Yes, but that does in no way prevent conversion to using ?normal? line breaking characters instead of the ?row? concept. >> Note also that Teletext is rife with ?code page switching?. ESC >> toggles between a primary and a secondary charset (for text). In a >> control part of the Teletext protocol one sets the charsets for text >> (options include various ?national variants? of ISO/IEC 646, as well >> as Greek, Hebrew and Arabic (visual order, preshaped). > > A teletext application would probably be expected to implement that as well. Yes, one that is ?general purpose?. (I cannot vouch for that current converters to HTML are that complete.) > >> Toggling between separated and contiguous ?mosaics? is also best seen >> as a switch between charsets. > > Which is why we did not propose the separated mosaics in Round 1, and Script Ad-Hoc and UTC agreed. ?? That seems to contradict what I said. > >> Regarding it as a styling is odd, since this particular styling would >> only apply to a few very rarely used characters, and the change is not >> one that is recognized as styling elsewhere. In addition, you have >> already encoded separated and contiguous other but similar ?mosaics? >> characters as separate characters. > > We tried to be as consistent as possible with the Legacy Symbols proposal, Teletext is not legacy (yet). > and to propose things separately only where some legacy platform encoded them separately, not just with a mode shift or by masking the code point with 0x80. ???? > There may be imperfections in the model, based on what SAH did and did not approve. ???? > >> Even the colour controls in Teletext switch between text and mosaics >> (and in addition are usually displayed as a space, as is the norm in >> Teletext for ?control? characters). > > That is certainly behavior that a teletext application should emulate. Part of character encoding conversion, not of styling. > >> Part of the Teletext protocol specifies how to set/unset bold/italic/ >> underline. But that is not inline in the text, it is ?out-of-line? >> elsewhere in the protocol (in a control part). But colouring, certain >> sizing, blink, conceal, and ?boxing? (used for (optional) subtitling >> and news flash messages) are inline. Note that Teletext is still often >> used for subtitling. > > Another reason why it is probably not appropriate to try to represent teletext in a plain-text file. You will need the styling, either as HTML/CSS (as is already done, though the conversion might not be complete), or using an extension of ECMA-48 for that. But there is no reason to perpetuate the arcane ?Teletext controls? and (also arcane) ?Teletext objects?. Otherwise it is perfectly reasonable to represent Teletext pages as HTML/CSS files (and that is done already, often including a navigation section to navigate more comfortably between pages, and converting triple-digits to links to other pages), or as (extended) ECMA-48 files. Perfectly normal files, with linefeed or HTML markup for representing lines/?rows?. > You can certainly convert it to a plain-text file, with ECMA-48 sequences for styling and lines ending in CR and/or LF, but then it is no longer "teletext data" but a conversion. So? If you convert Teletext text (skipping over styling and such for the moment) to Unicode, it is no longer Teletext, since Teletext has nothing in Unicode? But you do want certain characters in Unicode just for the purpose of such a conversion? I think one needs to distinguish between the Teletext protocol (the synch scan line representation is already obsolete; but Teletext does still exist in DVB and IP-TV; the low level representation there I do not know, but see reference above to an ETSI standard about just that) and Teletext pages (the content). Teletext content is still being produced and presented via DVB/IP-TV as well as web pages and apps. The latter two obviously do not use the Teletext protocol; I don?t know how, and in what format, they get the base page data from the TV channels. > >> Most of Teletext styling can be converted to ECMA-48 styling as is. >> Some others will need an extension of ECMA-48 to be representable in >> that framework. > > I read with interest your proposal last year to update ECMA-48. > I think the proposed extensions and clarifications had a better chance of adoption than the suggestions to change existing functionality outright. Some things have just diverged for absolutely no benefit. Some other things have been outright wrong in some implementations, and cannot be carried forward. > I am curious about the current status of that proposal; was it submitted anywhere? I?m still editing it; the very last changes (I have to stop tinkering?). I hope it will be a UTN, I have proposed it as such. I think it would fit very well as a UTN. "Control functions?, whether as singular codes or as escape sequences or as control sequences, has traditionally been seen as in the character encoding realm, and my proposal has several suggestion pertaining directly to Unicode in a ECMA-48 control sequence context. I?m not proposing that Unicode TC take over ECMA-48, but I have no hope of ?reviving? in some way an ECMA-48 committee. But ECMA-48 control sequences are still very much part of our ?digital text ecosystem?, even though it is currently used almost exclusively in terminal emulators. HTML/CSS is not at all all-encompassing. So I think ECMA-48 needs an update for Unicode, as well as for other functionality. /Kent K > > -- > Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kittens at wobble.ninja Wed Oct 7 08:02:38 2020 From: kittens at wobble.ninja (Ellie) Date: Wed, 7 Oct 2020 15:02:38 +0200 Subject: Please fix the trademark policy in regards to code In-Reply-To: References: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> <20201001192635.2cccbb7c@JRWUBU2> Message-ID: Thanks for your responses, they have been very helpful! I do wonder, would it maybe be possible and useful to add a note for this anyway, just for informational purposes? E.g. the Linux foundation puts the following: "There are also some basic rights that everyone has to use any trademark, which are often referred to as ?fair use,? and The Linux Foundation does not intend to restrict those rights. You may make fair use of word marks to make true factual statements. But fair use does not permit you to state or imply that the owner of a mark produces, endorses, or supports your company, products, or services. Even when making fair use of a trademark, you should acknowledge the owner of the trademark with a trademark notice, such as the notice displayed on The Linux Foundation project websites." I would find such a remark helpful, although the last sentence kind of makes it again sound like they expect me to put (R) into the source code which I find a bit unfortunate. Some qualifier like "you should, +where that is practical to do, acknowledge ..." might help alleviate this, however. In general, unless you only expect people to care about the guidelines as soon as you send them angry legal letters, I feel like it would be helpful to try to be more explanatory for uneducated people like me regarding such practical questions. After all my intent reading the guidelines as a non-lawyer was to see if I could comply in reasonable ways anyway, which I would assume would be in your interest. Regards, Ellie On 10/7/20 1:11 AM, Bill Poser via Unicode wrote: > It is not the case that any use of a trademark constitutes infringement > or requires permission. Trademark owners sometimes try to persuade > people that this is so and issue rules as to how their trademark should > be used, but they're blowing smoke.? A trademark is only infringed when > there is a risk of confusion. I can't name my beverage "Coca-Cola" or > something similar like "Cokacola",? because there is a risk that > consumers will confuse it with the original product. I can write an > article about Coca-Cola, and refer to it by that name as much as I like, > because I am using the trademark to identify the product with which the > owner has associated it. I am not creating any risk of confusion in the > eye of the consumer. Similarly, the Unicode Consortium has the right to > block the use of "Unicode" in reference to some competing encoding > standard, but it has no right to block the use of the term in the title > or text of articles about the Unicode standard.? > > On Thu, Oct 1, 2020 at 11:28 AM Richard Wordingham via Unicode > > wrote: > > On Thu, 1 Oct 2020 07:27:11 -0700 > Ken Whistler via Unicode > wrote: > > > References to "unicode" in code and related files and libraries would > > generally be considered just a functional "fair use" reference, and > > nothing more.? Accordingly, the Unicode Consortium would not take > > exception to such a use,?nor would it require the?use of the > > trademark symbol in code,?for the reasons you state. > > I don't see any evidence of the Unicode Consortium going after > Wikipedia either, despite articles simply entitled "Unicode", or even > after respellings in other scripts.? As for > https://la.wikipedia.org/wiki/Unicodex..., that even uses the genitive > singular for "Signum Unicodicis" - isn't that required to be "Signum > Unicodex" if "Unicodex" is pukka Latin? > > Richard.? (lingua in bucca posita) > From richard.wordingham at ntlworld.com Wed Oct 7 08:41:15 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 7 Oct 2020 14:41:15 +0100 Subject: Please fix the trademark policy in regards to code In-Reply-To: References: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> <20201001192635.2cccbb7c@JRWUBU2> Message-ID: <20201007144115.576c459c@JRWUBU2> On Wed, 7 Oct 2020 15:02:38 +0200 Ellie via Unicode wrote: > E.g. the Linux foundation puts the following: > > "... Even when making fair use of a trademark, you should acknowledge > the owner of the trademark with a trademark notice, such as the notice > displayed on The Linux Foundation project websites." > > I would find such a remark helpful, although the last sentence kind of > makes it again sound like they expect me to put (R) into the source > code which I find a bit unfortunate. Some qualifier like "you should, > +where that is practical to do, acknowledge ..." might help alleviate > this, however. In the type of specifications I normally encounter, the auxiliary "should" means that one doesn't have to if there is good reason not to. (Some compliance reviewers simply ignore any requirement modified by "should"!) Richard. From sosipiuk at gmail.com Wed Oct 7 10:31:28 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Wed, 7 Oct 2020 11:31:28 -0400 Subject: Please fix the trademark policy in regards to code In-Reply-To: References: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> <20201001192635.2cccbb7c@JRWUBU2> Message-ID: On Wed, Oct 7, 2020 at 9:04 AM Ellie via Unicode wrote: > > I would find such a remark helpful, although the last sentence kind of > makes it again sound like they expect me to put (R) into the source code > which I find a bit unfortunate. Some qualifier like "you should, +where > that is practical to do, acknowledge ..." might help alleviate this, > however. I believe rather than attaching a symbol to every instance of a trademark, a single comment at the beginning of the file would suffice, e.g. "Within this file, the word 'Unicode', and its variants, refers to the Unicode(R) Standard. Unicode(R) is a registered trademark of the Unicode Consortium". Or some similar legal boilerplate. That said, I think a note regarding source code and filenames can be added to the "Special Situations" section of the page you originally linked and would be helpful. S?awomir Osipiuk From richard.wordingham at ntlworld.com Wed Oct 7 15:21:43 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 7 Oct 2020 21:21:43 +0100 Subject: Teletext separated mosaic graphics In-Reply-To: <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> Message-ID: <20201007212143.2d4f534a@JRWUBU2> On Sun, 4 Oct 2020 18:07:26 -0600 Doug Ewell via Unicode wrote: > We all know there is no such thing in Unicode as a "hybrid" character > that is sometimes a control character and sometimes a graphic > character in normal use. That strikes me as a very good description of most of the 27 (as at Version 12) characters with an Indic syllabic category of virama. Richard. From doug at ewellic.org Wed Oct 7 15:54:29 2020 From: doug at ewellic.org (Doug Ewell) Date: Wed, 7 Oct 2020 14:54:29 -0600 Subject: Teletext separated mosaic graphics In-Reply-To: <20201007212143.2d4f534a@JRWUBU2> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> <20201007212143.2d4f534a@JRWUBU2> Message-ID: <000001d69cec$0d4b1e50$27e15af0$@ewellic.org> Richard Wordingham wrote: > Doug Ewell via Unicode wrote: > >> We all know there is no such thing in Unicode as a "hybrid" character >> that is sometimes a control character and sometimes a graphic >> character in normal use. > > That strikes me as a very good description of most of the 27 (as at > Version 12) characters with an Indic syllabic category of virama. A non-spacing mark (Mn) is not a control character (Cc). Whether it is rendered as a separate glyph or by modifying the glyph of a neighboring character is not the issue. There is no such thing in Unicode as a character which has more than General_Category value. Either a character is a control character, or it is not. Of course, I can create a program or a protocol that takes ordinary graphic characters such as < and >, and handles them in some special way, but then I am creating a new layer on top of plain text. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From harjitmoe at outlook.com Wed Oct 7 17:25:08 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Wed, 7 Oct 2020 23:25:08 +0100 Subject: Teletext separated mosaic graphics In-Reply-To: <000001d69cec$0d4b1e50$27e15af0$@ewellic.org> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> <20201007212143.2d4f534a@JRWUBU2> <000001d69cec$0d4b1e50$27e15af0$@ewellic.org> Message-ID: Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > >> [?] >> That strikes me as a very good description of most of the 27 (as at >> Version 12) characters with an Indic syllabic category of virama. > A non-spacing mark (Mn) is not a control character (Cc). Whether it is rendered as a separate glyph or by modifying the glyph of a neighboring character is not the issue. > > There is no such thing in Unicode as a character which has more than General_Category value. Either a character is a control character, or it is not. > > Of course, I can create a program or a protocol that takes ordinary graphic characters such as < and >, and handles them in some special way, but then I am creating a new layer on top of plain text. > > -- > Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org > --- Some comparisons of type-Cc and non-type-Cc characters with comparable, although not necessarily identical, behaviours (provided that the type-Cc characters are interpreted in accordance with ECMA-48, as I shall come to later): * CR (U+000D), LF (U+000A) and NEL (U+0085) are all Cc ? versus LS/LSEP (U+2028), which is Zs. * VT (U+000B) and FF (U+000C) are Cc, whereas PS/PSEP (U+2029) is Zp. * BPH (U+0082) is Cc, whereas SHY (U+00AD) and ZWSP (U+200B) are both Cf. * NBH (U+0083) is Cc, whereas WJ (U+2060) and ZWNBSP/BOM (U+FEFF) are both Cf. * PLU (U+008C) to start a superscript is Cc, whereas IAS (U+FFFA) to start a furigana section is Cf. * SSA (U+0086) and its terminator ESA (U+0087) are Cc, whereas for example RLO (U+202E), which similarly affects all following characters until further notice, is Cf. That being said, not everything which is appropriate for a Cc character is appropriate elsewhere: it would clearly be inappropriate for (say) DC1 or BEL, both of which issue instructions to something very much outside of the sandbox (so to speak) of the text render, to be anything other than Cc characters. However, format effector functions (such as the above), i.e. those which constitute instructors to the text render and/or layout engine specifically, evidently do not have to be possessed by Cc characters. Indeed, this is the entire purpose of the Cf (format) category. It is perhaps helpful to draw a distinction, in fine, between a control code in the vernacular sense (non-printing but does something) versus in the much more restricted sense of a category Cc character. The former may have functions defined by Unicode itself, whereas the latter are the domain of a control code standard such as ECMA-48. Anyway, regarding ECMA-48 versus not ECMA-48: Interpretation of Cc characters seems to be treated as a higher-level protocol, per chapter 23.1 of the Unicode core specification, which names ISO 6429 (i.e. ECMA-48) as /one possible/ such protocol but not the only one, while only listing semantics for HT, LF, VT, FF, CR, FS, GS, RS, US and NEL (i.e. the format effectors and information separators) and describing the basic concept of an ESC sequence without fully specifying their higher-level syntax, expressly leaving escape sequences and interpretation of most control codes to higher level protocols. ISO 10646 similarly names ISO 6429 (i.e. ECMA-48) in section 11, but qualifies this with "or similarly structured standards".?Section 12.4 specifies the escape sequences to indicate use of ECMA-48 within UCS, but then (on the next page) specifies the general sequences to indicate use of other ISO-IR control code sets within UCS. Confusingly, this specification of how an ECMA-35 control code set designation is to be represented in UCS (i.e. padded to the word size of the encoding?a moot point in UTF-8) comes after section 11's statement of ISO 2022 (i.e. ECMA-35) designation escapes being forbidden in UCS. I personally understand this apparent contradiction in the standard as meaning that designation escapes for /graphical sets/ are forbidden per section 11 (UCS being a monolithic graphical set in itself, they would be ambiguous and nonsensical in meaning were they used), but that those for /control code sets/ may be used with appropriate padding if required by higher level protocols per section 12.4, since the semantics of category Cc characters are left more open to higher protocols. I understand the sum of this to be that, while use of ECMA-48 for interpreting category Cc characters is recommended, this can be overridden by prior agreement on another higher level standard protocol. However: although MARC 21, the standard defining character encodings for Library of Congress records, uses a subset of ISO 6630 with some extensions (in positions not used by ISO 6630) as its C1 set within MARC-8 (its 8-bit, somewhat ECMA-35-based encoding), it however uses ECMA-48 as its C1 within Unicode, which means that it resorts to using SOS and ST instead of NSB and NSE (marking up a range of characters to be ignored during collation but nonetheless displayed). Notably, MARC-8's extensions to the ISO 6630 C1 set are ZWJ and ZWNJ, which are included in Unicode as non-Cc characters (U+200D and U+200C, both Cf). So there is some precedent to considering it inappropriate to just copy C0 and C1 codes from non-ECMA-48 sets into Unicode streams. However: EBCDIC mappings (both UTF-EBCDIC and the Microsoft-supplied ones on Unicode.org) conventionally map the EBCDIC control codes to Unicode in a specific manner (well, two specific manners, differing only in LF?LF and NL?NEL versus NL?LF and LF?NEL) but, apart from aligning either LF or NL up with NEL, these make no attempt at any sort of partial compatibility with the ECMA-48 C1 set (e.g. putting SBS at U+0098 and SPS at U+008D, as opposed to aligning them with PLD and PLU at U+008B and U+008C respectively, which do the same thing). They do, however, match ASCII/ECMA-48 with their C0 mappings. So using C1 control mappings which pay little or no regard to ECMA-48 is not without precedent either. Final note: I previously linked the ISO-IR document for the Videotex Data Syntax 2 (ITU T.101 Annex C) "Serial" variant C1 controls, otherwise known as the "Attribute Control Set for UK Videotex". This is registered with ISO-IR, and hence does also have an escape sequence to declare it as stipulated in section 12.4 of ISO 10646 (the bit on page 20, specifically). The teletext set, by contrast, is not. However, the Data Syntax 2 Serial Videotex C1 controls are basically the same as the ETS Teletext control set but with ESC removed, CSI added in its place, and encoding them over the C1 range rather than the C0 range as in Teletext. Since Teletext's unusual use of ESC for code switching would presumably be handled in the process of transcoding to Unicode, this would be one way of marshalling Teletext control data through Unicode with a higher level protocol, provided that interoperation with something using ECMA-48 codes besides CSI or its sequences is not needed (e.g. DCS in terminals or OSC in terminal emulators). -- Har. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Thu Oct 8 07:40:12 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 8 Oct 2020 13:40:12 +0100 (BST) Subject: Teletext separated mosaic graphics In-Reply-To: <000001d69cec$0d4b1e50$27e15af0$@ewellic.org> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> <20201007212143.2d4f534a@JRWUBU2> <000001d69cec$0d4b1e50$27e15af0$@ewellic.org> Message-ID: <8e0cabc.e4a.1750839e2aa.Webtop.45@btinternet.com> Doug Ewell wrote: > Of course, I can create a program or a protocol that takes ordinary > graphic characters such as < and >, and handles them in some special > way, but then I am creating a new layer on top of plain text. So could the twenty-seven control characters in the 1976 teletext specification be encoded as ordinary displayable characters in plane 14 such that they may, but need not, be used as control characters in such a program or protocol please? Other codes used in later teletext formats and videotext formats could also be encoded if so desired. By the way, I once had a page on viewdata. I saw an article on viewdata by Mr Fedida in an issue of the magazine Wireless World and I wrote to him enclosing a design for a page and he kindly arranged for it to be keyed in on page 786. I saw it on a viewdata television in September 1977 and I wonder if it survives in an archive somewhere. I suppose that it is now part of the historic graphic art from that era. I produced my design on a sheet of paper from one of those quadrille-ruled notebooks intended for arithmetic. I used black ink for the text, one character per square, and red ink for the control codes, providing a key below the page diagram. G for Alphanumerics Green, g for Graphics Green, and so on, H for Hold Graphics. In fact, this design, which I called Colour Check, used only the seven graphic colour codes and the Hold Graphics code, and the graphic character corresponding to lowercase e. There was a large red filled rectangle at the left, a large green filled rectangle at the right, and a large blue filled rectangle centred and lower down the page. The top line of the green filled rectangle was not on the same line as the top line of the red filled rectangle: I am not sure whether it was higher or lower. The three filled rectangles overlapped and where the overlaps occurred the contiguous graphic characters were in yellow, white, magenta, cyan as appropriate. A line typically started with a graphic colour code, green possibly, red, or blue according to which line, followed by a space and then a Hold graphics code. So where the control codes to change colour within the image occurred there was not a space displayed as the Hold Graphics facility repaced the space with a copy of the previous graphic character. The background was black and the fact that I used the graphic equivalent of a letter e produced the effect that the coloured areas were not solid filled but had a teletext-format look. The graphic was centred, there was black to the left, to the right, above and below. Some early teletext pages have been recovered from super-VHS tapes that had recorded television programs. I am wondering if the early ITV Oracle graphic of the Blue Lady has been recovered? William Overington Thursday 8 October 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Thu Oct 8 17:10:14 2020 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 8 Oct 2020 18:10:14 -0400 Subject: Teletext separated mosaic graphics In-Reply-To: <8e0cabc.e4a.1750839e2aa.Webtop.45@btinternet.com> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> <20201007212143.2d4f534a@JRWUBU2> <000001d69cec$0d4b1e50$27e15af0$@ewellic.org> <8e0cabc.e4a.1750839e2aa.Webtop.45@btinternet.com> Message-ID: An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Oct 9 10:00:31 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 9 Oct 2020 16:00:31 +0100 (BST) Subject: Teletext separated mosaic graphics Message-ID: <15b543e.5f7.1750de0b5d7.Webtop.42@btinternet.com> Mark E. Shoulson wrote: > Isn't that kind of what the Control Pictures block (U+2400) is? ? > So if you're willing to do that, go ahead and create a program or > protocol that takes the ordinary graphic characters U+2400 through > U+2426 and handles them in some special way, creating a new layer on > top of plain text. Thank you for the suggestion. One could indeed use twenty-seven of the characters in the range U+2400 .. U+241F in that manner. As a short-term solution it is, in my opinion, a bit better than using a Private Use Area solution, much better than using twenty-seven codes from the range U+0080 .. U+009F and very much better than using twenty-seven codes from the range U+0000 . U+001F. However, for long-term storage and archiving of teletext pages within documents that contain notes about them, all of those solutions have problems. They are all essentially markup solutions. I have had similar issues with one of my inventions where encoding into regular Unicode has thus far not been achieved as such encoding has been declared out of scope. Thus I have used various markup solutions in order to make progress. One is to use an integral sign followed by a sequence of circled digits. Another markup solution that I am using is an exclamation mark followed by ordinary digits. Both are effective, but are not regular Unicode plain text solutions. Maybe one day a regular Unicode plain text encoding will be possible. The proposal for the invention to become an international standard is with ISO. There had been a good chance of a slide show that I produced being presented at a plenary conference in June 2020, but the conference was cancelled due to the COVID-19 situation and a virtual meeting replacement has been difficult to hold because of time zone issues. Maybe if ISO decides to standardize the invention as an international standard that it will then be possible to have a rigorous regular Unicode encoding of the codes, thus providing a plain text unambiguous non-proprietary interoperable format. I opine that the elegant long-term solution for the teletext control characters is to encode twenty-seven codes from a block of thirty-two code points in plane 14, keeping one-to-one correspondence with the final five bits of the original teletext control code encoding. They could be encoded as displayable characters so as to provide a graceful, helpful, fallback if specialist software to act upon the characters as if they are teletext control characters is not available. William Overington Friday 9 October 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Fri Oct 9 14:37:52 2020 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 9 Oct 2020 15:37:52 -0400 Subject: Teletext separated mosaic graphics In-Reply-To: <15b543e.5f7.1750de0b5d7.Webtop.42@btinternet.com> References: <15b543e.5f7.1750de0b5d7.Webtop.42@btinternet.com> Message-ID: <572057df-7cae-5114-cde6-fa8877ff16ae@kli.org> An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Oct 10 05:40:35 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 10 Oct 2020 11:40:35 +0100 (BST) Subject: Teletext separated mosaic graphics In-Reply-To: <73694d96.bc1.175121648be.Webtop.226@btinternet.com> References: <73694d96.bc1.175121648be.Webtop.226@btinternet.com> Message-ID: <71025709.bc7.175121915bf.Webtop.226@btinternet.com> Mark E. Shoulson wrote: > I assumed, since you were responding to that, that you were drawing > some sort of parallel, that it should be possible to have a layer on > top of plain text, i.e. a markup layer, wherein certain printable > characters could represent other meanings. Your assumption was correct. > It sounds like your request didn't actually relate to what Doug Ewell > was saying, since you then reject markup solutions. Well, I suppose that there may well quite justifiably seem to be a logical inconsistency in me not being enthusiastic about some particular markup solutions and then suggesting what is another markup solution. In my mind there is a big difference in that my suggestion does not involve overloading the original meaning of the character with another distinctly different meaning. So in my mind my request for encoding the teletext control characters as displayable characters in plane 14 with the specific and documented intention that they may, but need not, be used in a markup manner did, and still does, seem to build upon what Doug wrote. Yet maybe I am pushing the envelope too far, though maybe not. I opine that, notwithstanding any possible logical inconsistency with longstanding custom and practice, encoding the teletext control characters as displayable characters in plane 14 as I suggest is the best long term solution. However, I am open to other suggestions if anyone can think of a better way to proceed. > I drew the wrong conclusion; ... No, you were correct. > ... I guess you meant to post your request/suggestion in a new thread > and not as a reply. I was building on what Doug wrote. William Overington Saturday 10 October 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at honermann.net Sat Oct 10 13:54:56 2020 From: tom at honermann.net (Tom Honermann) Date: Sat, 10 Oct 2020 14:54:56 -0400 Subject: Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature Message-ID: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text.? This is follow up to discussion on the Unicode mailing list back in June. Feedback is welcome.? I plan to submit this to the UTC in a week or so pending review feedback. Tom. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Unicode-BOM-guidance.pdf Type: application/pdf Size: 67891 bytes Desc: not available URL: From kent.b.karlsson at bahnhof.se Sat Oct 10 17:02:35 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Sun, 11 Oct 2020 00:02:35 +0200 Subject: Teletext separated mosaic graphics References: Message-ID: Here are a few more web sites showing Teletext pages from various European TV channels. THE LIST IS SURELY FAR FROM COMPLETE, it is just a sample. But it does show that Teletext is commonly displayed as web pages, not just via TV channels (whether "analog" or DVB). I haven't seen these combined with web versions of TV channels, but that would surely be possible to combine. That would be especially useful for optional subtilting, where Teletext is still much used, as a useful accessibility feature. I have no prediction of how long any channels will continue to produce Teletext content. But optional subtilting seems to "survive" longer. I do not know what source format(s) may be used, but it is surely not HTML *nor* close to the Teletext protocol. But see the Teletext page edit tool referenced below. Spain, RTVE: https://www.rtve.es/television/teletexto/100/ Sweden, SVT: https://texttv.nu/ (also as iOS app, same name) https://www.svt.se/svttext/web/pages/100.html Iceland, R?V: http://textavarp.is/sida/100 Denmark, DR: https://www.dr.dk/cgi-bin/fttx1.exe/100 Norway, NRK: https://www.nrk.no/tekst-tv/100/ Finland, YLE: https://yle.fi/aihe/tekstitv Switzerland, SRF: https://www.teletext.ch/ Croatia, HRT: https://teletekst.hrt.hr/ Greece: https://www.greektvidents.com/Teletext_ERTEXT.shtml And more; I have not done a complete survey! There are also several apps for iOS and for Android that display Teletext content from various (TV channel) providers. What the source format is for the Teletext pages as produced today, I don't know. But I would guess that it is likely "plain text" files, with Teletext specific markup, that is then converted to 1) Teletext analog format, 2) Teletext DVB format, 3) HTML. But that is just my guess. Again note that Teletext is still commonly used for optional subtitles. (DVB subtitles, a "bitmapped" format (i.e. the subtitles are sent as images, not text) does not seem to be used much. At least, I haven't seen it.) This requires timing, which is not part of the Teletext protocol, but must be in the source in order to control when a subtitle is output as Teletext for optional display. So "But a teletext application for a modern computer is not "normal use." It is reasonable for a non-standard application like this to interpret characters from U+0000 to U+001F as the corresponding ISO 646 characters would be in teletext." is very false. Further, the "object" overrides in the Teletext *protocol*, in several levels, "objects" prioritized depending on "implementation level", can specify: 1) Bold, italic, underline, proportional font. 2) More colours (but only 16 levels per red/green/blue, no transparency though). 3) Character substitutions (likely replacing spaces) to be able to display characters from "G3". These cannot be handled by "retaining" the ill-designed control codes of Teletext anyway. Have an urge to edit your own Teletext pages? Here?s the web page for doing just that: https://zxnet.co.uk/teletext/editor You can save your page in a handful of formats (plus as image). I haven?t analyzed these formats, but presumably they are storage formats actually used for ?real? Teletext pages that are converted to be transmitted (?analog? (outdated) or DVB) or given as web pages (HTML, but no ?separated mosaic? characters, since they are not yet allocated in Unicode; could use small images though...). /Kent K -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sun Oct 11 01:12:20 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 10 Oct 2020 23:12:20 -0700 Subject: Teletext separated mosaic graphics In-Reply-To: References: Message-ID: <1a68ffdb-1b40-e01a-d689-ff3fa20cf75c@ix.netcom.com> An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Oct 11 15:52:15 2020 From: doug at ewellic.org (Doug Ewell) Date: Sun, 11 Oct 2020 14:52:15 -0600 Subject: Teletext separated mosaic graphics In-Reply-To: <15b543e.5f7.1750de0b5d7.Webtop.42@btinternet.com> References: <15b543e.5f7.1750de0b5d7.Webtop.42@btinternet.com> Message-ID: <002001d6a010$67536d90$35fa48b0$@ewellic.org> William_J_G Overington wrote: > I opine that the elegant long-term solution for the teletext control > characters is to encode twenty-seven codes from a block of thirty-two > code points in plane 14, keeping one-to-one correspondence with the > final five bits of the original teletext control code encoding. They > could be encoded as displayable characters so as to provide a > graceful, helpful, fallback if specialist software to act upon the > characters as if they are teletext control characters is not > available. Anyone may write a proposal for whatever characters they wish. I will not be participating in such a proposal, because those of us who worked on the Symbols for Legacy Computing proposal asked two or three years ago about encoding control characters disguised as graphic characters, and the answer from Script Ad Hoc was that they were outside the scope of what UTC would encode; and I do not believe UTC's attention span, in terms of fundamentally changing what they will and will not encode (as suggested by the earlier reference to emoji), is as short as two or three years. Members of Script Ad Hoc and UTC recommended to us that code points in the UCS range from U+0000 to U+001F be used to implement teletext functions originally encoded in the 7-bit range from 0x00 to 0x1F. I suggest that those who disagree with this approach take it up with those who recommended it. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From tom at honermann.net Sun Oct 11 22:22:46 2020 From: tom at honermann.net (Tom Honermann) Date: Sun, 11 Oct 2020 23:22:46 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> Message-ID: On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: > One concern I have, that might lead into rationale for the current > discouragement, > is that I would hate to see a best practice that pushes a BOM into > ASCII files. > One of the nice properties of UTF-8 is that a valid ASCII file (still > very common) is > also a valid UTF-8 file. ?Changing best practice would encourage > updating those > files to be no longer ASCII. Thanks, Alisdair.? I think that concern is implicitly addressed by the suggested resolutions, but perhaps that can be made more clear.? One possibility would be to modify the "protocol designer" guidelines to address the case where a protocol's default encoding is ASCII based and to specify that a BOM is only required for UTF-8 text that contains non-ASCII characters.? Would that be helpful? Tom. > > AlisdairM > >> On Oct 10, 2020, at 14:54, Tom Honermann via SG16 >> > wrote: >> >> Attached is a draft proposal for the Unicode standard that intends to >> clarify the current recommendation regarding use of a BOM in UTF-8 >> text. This is follow up to discussion on the Unicode mailing list >> >> back in June. >> >> Feedback is welcome.? I plan to submit >> this to the UTC in a >> week or so pending review feedback. >> >> Tom. >> >> -- >> SG16 mailing list >> SG16 at lists.isocpp.org >> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at honermann.net Sun Oct 11 22:37:04 2020 From: tom at honermann.net (Tom Honermann) Date: Sun, 11 Oct 2020 23:37:04 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> Message-ID: <6615002f-7f7d-c1c6-fc25-4418f7af06ee@honermann.net> On 10/11/20 11:32 PM, JF Bastien wrote: > It?s a bit odd: if you assume the default is ascii then you don?t need > this. If you assume the default is utf8 then you don?t need this... so > when do you need the BOM? It seems like making bad prior choices more > acceptable... even though they were bad choices. I?m not sure it?s a > good idea. A BOM would be needed when: 1. The default encoding is ASCII based (ISO-8859-1, Windows-1252, etc...) and the UTF-8 text to be produced contains non-ASCII characters.? Or, 2. The default encoding is not ASCII based (e.g., EBCDIC). Both of these cases presume that the default encoding can't be made UTF-8 for backward compatibility reasons. Tom. > > On Sun, Oct 11, 2020 at 8:22 PM Tom Honermann via SG16 > > wrote: > > On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: >> One concern I have, that might lead into rationale for the >> current discouragement, >> is that I would hate to see a best practice that pushes a BOM >> into ASCII files. >> One of the nice properties of UTF-8 is that a valid ASCII file >> (still very common) is >> also a valid UTF-8 file.? Changing best practice would encourage >> updating those >> files to be no longer ASCII. > > Thanks, Alisdair.? I think that concern is implicitly addressed by > the suggested resolutions, but perhaps that can be made more > clear.? One possibility would be to modify the "protocol designer" > guidelines to address the case where a protocol's default encoding > is ASCII based and to specify that a BOM is only required for > UTF-8 text that contains non-ASCII characters.? Would that be helpful? > > > Tom. > >> >> AlisdairM >> >>> On Oct 10, 2020, at 14:54, Tom Honermann via SG16 >>> > wrote: >>> >>> Attached is a draft proposal for the Unicode standard that >>> intends to clarify the current recommendation regarding use of a >>> BOM in UTF-8 text.? This is follow up to discussion on the >>> Unicode mailing list >>> >>> back in June. >>> >>> Feedback is welcome.? I plan to submit >>> this to the UTC >>> in a week or so pending review feedback. >>> >>> Tom. >>> >>> -- >>> SG16 mailing list >>> SG16 at lists.isocpp.org >>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 >> >> > > -- > SG16 mailing list > SG16 at lists.isocpp.org > https://lists.isocpp.org/mailman/listinfo.cgi/sg16 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cxx at jfbastien.com Sun Oct 11 22:32:39 2020 From: cxx at jfbastien.com (JF Bastien) Date: Sun, 11 Oct 2020 20:32:39 -0700 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> Message-ID: It?s a bit odd: if you assume the default is ascii then you don?t need this. If you assume the default is utf8 then you don?t need this... so when do you need the BOM? It seems like making bad prior choices more acceptable... even though they were bad choices. I?m not sure it?s a good idea. On Sun, Oct 11, 2020 at 8:22 PM Tom Honermann via SG16 < sg16 at lists.isocpp.org> wrote: > On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: > > One concern I have, that might lead into rationale for the current > discouragement, > is that I would hate to see a best practice that pushes a BOM into ASCII > files. > One of the nice properties of UTF-8 is that a valid ASCII file (still very > common) is > also a valid UTF-8 file. Changing best practice would encourage updating > those > files to be no longer ASCII. > > Thanks, Alisdair. I think that concern is implicitly addressed by the > suggested resolutions, but perhaps that can be made more clear. One > possibility would be to modify the "protocol designer" guidelines to > address the case where a protocol's default encoding is ASCII based and to > specify that a BOM is only required for UTF-8 text that contains non-ASCII > characters. Would that be helpful? > > > Tom. > > > AlisdairM > > On Oct 10, 2020, at 14:54, Tom Honermann via SG16 > wrote: > > Attached is a draft proposal for the Unicode standard that intends to > clarify the current recommendation regarding use of a BOM in UTF-8 text. > This is follow up to discussion on the Unicode mailing list > back > in June. > > Feedback is welcome. I plan to submit > this to the UTC in a > week or so pending review feedback. > > Tom. > -- > SG16 mailing list > SG16 at lists.isocpp.org > https://lists.isocpp.org/mailman/listinfo.cgi/sg16 > > > > > -- > SG16 mailing list > SG16 at lists.isocpp.org > https://lists.isocpp.org/mailman/listinfo.cgi/sg16 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Sun Oct 11 23:28:53 2020 From: jameskass at code2001.com (James Kass) Date: Mon, 12 Oct 2020 04:28:53 +0000 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <6615002f-7f7d-c1c6-fc25-4418f7af06ee@honermann.net> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <6615002f-7f7d-c1c6-fc25-4418f7af06ee@honermann.net> Message-ID: <40ba9ef0-6618-24f0-f3e6-fb06c0dbf753@code2001.com> On 2020-10-12 3:37 AM, Tom Honermann via Unicode wrote: > On 10/11/20 11:32 PM, JF Bastien wrote: >> It?s a bit odd: if you assume the default is ascii then you don?t >> need this. If you assume the default is utf8 then you don?t need >> this... so when do you need the BOM? It seems like making bad prior >> choices more acceptable... even though they were bad choices. I?m not >> sure it?s a good idea. > > A BOM would be needed when: > > 1. The default encoding is ASCII based (ISO-8859-1, Windows-1252, > ?? etc...) and the UTF-8 text to be produced contains non-ASCII > ?? characters.? Or, > 2. The default encoding is not ASCII based (e.g., EBCDIC). > > Both of these cases presume that the default encoding can't be made > UTF-8 for backward compatibility reasons. > > Tom. 1.? UTF-8 text consists only of ASCII characters.? Even if some ASCII strings reference non-ASCII characters.? It's the same idea as HTML numeric character references which point to non-ASCII characters while being composed of ASCII characters.? It shouldn't matter whether a string of ASCII digits form the charcter number or a string of UTF-8 hex bytes form that number.? A Unicode-aware application will display the string as a special character while legacy applications will show the string as mojibake.? Either way, UTF-8 remains an ASCII-preserving encoding format. 2.? Files using non-standard encodings should be converted to Unicode. Any plain-text file should be presumed to be UTF-8 unless marked otherwise. Years ago, the UTF-8 signature was sometimes considered helpful. Nowadays it seems be more of an anachronism. From tom at honermann.net Mon Oct 12 08:35:23 2020 From: tom at honermann.net (Tom Honermann) Date: Mon, 12 Oct 2020 09:35:23 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <40ba9ef0-6618-24f0-f3e6-fb06c0dbf753@code2001.com> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <6615002f-7f7d-c1c6-fc25-4418f7af06ee@honermann.net> <40ba9ef0-6618-24f0-f3e6-fb06c0dbf753@code2001.com> Message-ID: <7b7fabe4-5cc8-6f72-0dfc-02d53591da65@honermann.net> On 10/12/20 12:28 AM, James Kass via Unicode wrote: > > > On 2020-10-12 3:37 AM, Tom Honermann via Unicode wrote: >> On 10/11/20 11:32 PM, JF Bastien wrote: >>> It?s a bit odd: if you assume the default is ascii then you don?t >>> need this. If you assume the default is utf8 then you don?t need >>> this... so when do you need the BOM? It seems like making bad prior >>> choices more acceptable... even though they were bad choices. I?m >>> not sure it?s a good idea. >> >> A BOM would be needed when: >> >> 1. The default encoding is ASCII based (ISO-8859-1, Windows-1252, >> ?? etc...) and the UTF-8 text to be produced contains non-ASCII >> ?? characters.? Or, >> 2. The default encoding is not ASCII based (e.g., EBCDIC). >> >> Both of these cases presume that the default encoding can't be made >> UTF-8 for backward compatibility reasons. >> >> Tom. > > > 1.? UTF-8 text consists only of ASCII characters.? Even if some ASCII > strings reference non-ASCII characters.? It's the same idea as HTML > numeric character references which point to non-ASCII characters while > being composed of ASCII characters.? It shouldn't matter whether a > string of ASCII digits form the charcter number or a string of UTF-8 > hex bytes form that number.? A Unicode-aware application will display > the string as a special character while legacy applications will show > the string as mojibake.? Either way, UTF-8 remains an ASCII-preserving > encoding format. I don't understand this response.? UTF-8 lead and trailing bytes are not ASCII characters.? Perhaps you are using "ASCII" to refer to the set of 8-bit ASCII-based encodings?? ASCII is a 7-bit encoding. > > 2.? Files using non-standard encodings should be converted to Unicode. > > Any plain-text file should be presumed to be UTF-8 unless marked > otherwise. That doesn't match existing practice on Windows where most applications assume the encoding of the Active Code Page (e.g., Windows-1252). > > Years ago, the UTF-8 signature was sometimes considered helpful. > Nowadays it seems be more of an anachronism. I think that is true in some contexts; e.g., on the web and on most POSIX systems.? I don't think it is true in general though. Tom. From tom at honermann.net Mon Oct 12 09:02:49 2020 From: tom at honermann.net (Tom Honermann) Date: Mon, 12 Oct 2020 10:02:49 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> Message-ID: <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> Great, here is the change I'm making to address this: Protocol designers: * If possible, mandate use of UTF-8 without a BOM; diagnose the presence of a BOM in consumed text as an error, and produce text without a BOM. * Otherwise, if possible, mandate use of UTF-8 with or without a BOM; accept and discard a BOM in consumed text, and produce text without a BOM. * Otherwise, if possible, use UTF-8 as the default encoding with use of other encodings negotiated using information other than a BOM; accept and discard a BOM in consumed text, and produce text without a BOM. * Otherwise, require the presence of a BOM to differentiate UTF-8 encoded text in both consumed and produced text*unless the absence of a BOM would result in the text being interpreted as an ASCII-based encoding and the UTF-8 text contains no non-ASCII characters (the exception is intended to avoid the addition of a BOM to ASCII text thus rendering such text as non-ASCII)*. This approach should be reserved for scenarios in which UTF-8 cannot be adopted as a default due to backward compatibility concerns. Tom. On 10/12/20 8:40 AM, Alisdair Meredith wrote: > That addresses my main concern. ?Essentially, best practice (for > UTF-8) would be no BOM unless the document contains code points that > require multiple code units to express. > > AlisdairM > >> On Oct 11, 2020, at 23:22, Tom Honermann > > wrote: >> >> On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: >>> One concern I have, that might lead into rationale for the current >>> discouragement, >>> is that I would hate to see a best practice that pushes a BOM into >>> ASCII files. >>> One of the nice properties of UTF-8 is that a valid ASCII file >>> (still very common) is >>> also a valid UTF-8 file. ?Changing best practice would encourage >>> updating those >>> files to be no longer ASCII. >> >> Thanks, Alisdair.? I think that concern is implicitly addressed by >> the suggested resolutions, but perhaps that can be made more clear.? >> One possibility would be to modify the "protocol designer" guidelines >> to address the case where a protocol's default encoding is ASCII >> based and to specify that a BOM is only required for UTF-8 text that >> contains non-ASCII characters.? Would that be helpful? >> >> Tom. >> >>> >>> AlisdairM >>> >>>> On Oct 10, 2020, at 14:54, Tom Honermann via SG16 >>>> > wrote: >>>> >>>> Attached is a draft proposal for the Unicode standard that intends >>>> to clarify the current recommendation regarding use of a BOM in >>>> UTF-8 text. This is follow up to discussion on the Unicode mailing >>>> list >>>> >>>> back in June. >>>> >>>> Feedback is welcome.? I plan to submit >>>> this to the UTC in >>>> a week or so pending review feedback. >>>> >>>> Tom. >>>> >>>> -- >>>> SG16 mailing list >>>> SG16 at lists.isocpp.org >>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Oct 12 11:15:06 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 12 Oct 2020 09:15:06 -0700 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <40ba9ef0-6618-24f0-f3e6-fb06c0dbf753@code2001.com> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <6615002f-7f7d-c1c6-fc25-4418f7af06ee@honermann.net> <40ba9ef0-6618-24f0-f3e6-fb06c0dbf753@code2001.com> Message-ID: <17c8d0e3-0186-1e36-96f2-988973743629@ix.netcom.com> An HTML attachment was scrubbed... URL: From mark at kli.org Mon Oct 12 15:25:13 2020 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 12 Oct 2020 16:25:13 -0400 Subject: Teletext separated mosaic graphics In-Reply-To: <71025709.bc7.175121915bf.Webtop.226@btinternet.com> References: <73694d96.bc1.175121648be.Webtop.226@btinternet.com> <71025709.bc7.175121915bf.Webtop.226@btinternet.com> Message-ID: An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Oct 12 15:54:53 2020 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 12 Oct 2020 20:54:53 +0000 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> Message-ID: I?m having trouble with the attempt to be this prescriptive. These make sense: ?Use Unicode!? * If possible, mandate use of UTF-8 without a BOM; diagnose the presence of a BOM in consumed text as an error, and produce text without a BOM. * Alternatively, swallow the BOM if present. After that the situation is clearly hopeless. Applications should Use Unicode, eg: UTF-8, and clearly there are cases happening where that isn?t happening. Trying to prescribe that negotiation should therefore happen, or that BOMs should be interpreted or whatever is fairly meaningless at that point. Given that the higher-order guidance of ?Use Unicode? has already been ignored, at this point it?s garbage-in, garbage-out. Clearly the app/whatever is ignoring the ?use unicode? guidance for some legacy reason. If they could adapt, it should be to use UTF-8. It *might* be helpful to say something about a BOM likely indicating UTF-8 text in otherwise unspecified data, but prescriptive stuff is pointless, it?s legacy stuff that behaves in a legacy fashion for a reason and saying they should have done it differently 20 years ago isn?t going to help ? -Shawn From: Unicode On Behalf Of Tom Honermann via Unicode Sent: Monday, October 12, 2020 7:03 AM To: Alisdair Meredith Cc: sg16 at lists.isocpp.org; Unicode List Subject: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature Great, here is the change I'm making to address this: Protocol designers: * If possible, mandate use of UTF-8 without a BOM; diagnose the presence of a BOM in consumed text as an error, and produce text without a BOM. * Otherwise, if possible, mandate use of UTF-8 with or without a BOM; accept and discard a BOM in consumed text, and produce text without a BOM. * Otherwise, if possible, use UTF-8 as the default encoding with use of other encodings negotiated using information other than a BOM; accept and discard a BOM in consumed text, and produce text without a BOM. * Otherwise, require the presence of a BOM to differentiate UTF-8 encoded text in both consumed and produced text unless the absence of a BOM would result in the text being interpreted as an ASCII-based encoding and the UTF-8 text contains no non-ASCII characters (the exception is intended to avoid the addition of a BOM to ASCII text thus rendering such text as non-ASCII). This approach should be reserved for scenarios in which UTF-8 cannot be adopted as a default due to backward compatibility concerns. Tom. On 10/12/20 8:40 AM, Alisdair Meredith wrote: That addresses my main concern. Essentially, best practice (for UTF-8) would be no BOM unless the document contains code points that require multiple code units to express. AlisdairM On Oct 11, 2020, at 23:22, Tom Honermann > wrote: On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: One concern I have, that might lead into rationale for the current discouragement, is that I would hate to see a best practice that pushes a BOM into ASCII files. One of the nice properties of UTF-8 is that a valid ASCII file (still very common) is also a valid UTF-8 file. Changing best practice would encourage updating those files to be no longer ASCII. Thanks, Alisdair. I think that concern is implicitly addressed by the suggested resolutions, but perhaps that can be made more clear. One possibility would be to modify the "protocol designer" guidelines to address the case where a protocol's default encoding is ASCII based and to specify that a BOM is only required for UTF-8 text that contains non-ASCII characters. Would that be helpful? Tom. AlisdairM On Oct 10, 2020, at 14:54, Tom Honermann via SG16 > wrote: Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text. This is follow up to discussion on the Unicode mailing list back in June. Feedback is welcome. I plan to submit this to the UTC in a week or so pending review feedback. Tom. -- SG16 mailing list SG16 at lists.isocpp.org https://lists.isocpp.org/mailman/listinfo.cgi/sg16 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Mon Oct 12 17:38:30 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Tue, 13 Oct 2020 00:38:30 +0200 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <40ba9ef0-6618-24f0-f3e6-fb06c0dbf753@code2001.com> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <6615002f-7f7d-c1c6-fc25-4418f7af06ee@honermann.net> <40ba9ef0-6618-24f0-f3e6-fb06c0dbf753@code2001.com> Message-ID: <2940C362-AE6D-4959-A3BC-6C76E51DDA89@bahnhof.se> > 12 okt. 2020 kl. 06:28 skrev James Kass via Unicode : > 1. UTF-8 text consists only of ASCII characters. ???? > Even if some ASCII strings reference non-ASCII characters. It's the same idea as HTML numeric character references which point to non-ASCII characters while being composed of ASCII characters. It shouldn't matter whether a string of ASCII digits form the charcter number or a string of UTF-8 hex bytes form that number. That is a HUGE difference. If you are using character references, you rely upon a conversion of those references to the ?actual? characters in a target encoding. And it matters if you ?work on? (e.g. view, or let a program process the characters) the source (with character references) or the target (where the character references have been replaced by the characters they represent, if any). Further there are several different, and not freely mixable, ways of doing character references. HTML has its way, C++ (and many other programming languages) have another way (and they may differ slightly). So it depend on context how, and if, supposed character references are interpreted as the character referenced (if any). C++ style character references are not interpreted in HTML, and HTML style character references are not interpreted in C++ (as such). Without character references (or where they are not interpreted), you have one character encoding. With (possible) character references, you have a source character encoding and a target character encoding, that need not be the same; in addition to which syntax is used for the character references (and there are several different syntaxes). Any ?charset? declaration of a string (or file) would be for the source encoding of that string/file. > A Unicode-aware application will display the string as a special character while legacy applications will show the string as mojibake. Either way, UTF-8 remains an ASCII-preserving encoding format. What is a ?special character?? Any ?non-ASCII? one?? That could be seen as offensive... /Kent K From kent.b.karlsson at bahnhof.se Mon Oct 12 17:38:38 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Tue, 13 Oct 2020 00:38:38 +0200 Subject: Teletext separated mosaic graphics In-Reply-To: <1a68ffdb-1b40-e01a-d689-ff3fa20cf75c@ix.netcom.com> References: <1a68ffdb-1b40-e01a-d689-ff3fa20cf75c@ix.netcom.com> Message-ID: > 11 okt. 2020 kl. 08:12 skrev Asmus Freytag via Unicode : > > The simple solution would be to try to contact the people behind some of these websites and simply asking how they do it. That might provide useful answers to whether any encoding solution will be taken up by implementers. As for the production of Teletext pages, my guess is that there will be no change in tooling or encoding for as long as Teletext pages are going to be produced (except for updates to use newly allocated Unicode characters). The production is, admittedly, waning, and may stop in a few years. At least for ?news? pages; subtitling may go on much longer. What may be of more interest is archiving. Digital archives saving ?old? data have faced, and are still facing, at least two issues. One is the physical media themselves. Tape (formats), diskette (formats), and tape readers, diskette readers, and now also more ?modern? storage media are getting outdated; and keeping ?old? data needs storage media transfer. Another is encoding; for text there has been a need to convert from various ?old? encodings to ?modern? ones; now often converting to a Unicode encoding. I don?t know if anyone is actively trying to archive (in a future retrievable manner) Teletext pages. But if there are, they face a text encoding issue. Not just for the ordinary text, but also for the styling of the text. The storage formats (likely vendor specific) will likely go outdated; the ?broadcast formats/protocols? (that we are discussing) are already quaint and incompatible with modern computers. Saving the pages as HTML (including the linkage between pages) may be sufficient for quite some time. It is possible to convert Teletext pages to use an ECMA-48-based format (using some extensions); that would make the pages directly displayable on terminal emulators (assuming that the terminal emulator implements the extensions?). Teletext pages do, after all, have a ?look? that is close to the ?look? of terminal emulators? Or be displayable in text editors that are ECMA-48 enabled? This would be closer in concept to Teletext for the styling controls than HTML is. I have a suggestion for extensions to ECMA-48 styling that covers Teletext styling capabilities. But those are just suggestions from me; a proof of concept, and not wide-spread implementations. /Kent K > A./ > > On 10/10/2020 3:02 PM, Kent Karlsson via Unicode wrote: >> >> Here are a few more web sites showing Teletext pages from various European TV channels. >> THE LIST IS SURELY FAR FROM COMPLETE, it is just a sample. But it does show that Teletext >> is commonly displayed as web pages, not just via TV channels (whether "analog" or DVB). >> >> I haven't seen these combined with web versions of TV channels, but that would surely >> be possible to combine. That would be especially useful for optional subtilting, where >> Teletext is still much used, as a useful accessibility feature. >> >> I have no prediction of how long any channels will continue to produce Teletext content. >> But optional subtilting seems to "survive" longer. >> >> I do not know what source format(s) may be used, but it is surely not HTML *nor* close >> to the Teletext protocol. But see the Teletext page edit tool referenced below. >> >> >> Spain, RTVE: >> https://www.rtve.es/television/teletexto/100/ >> >> Sweden, SVT: >> https://texttv.nu/ (also as iOS app, same name) >> https://www.svt.se/svttext/web/pages/100.html >> >> Iceland, R?V: >> http://textavarp.is/sida/100 >> >> Denmark, DR: >> https://www.dr.dk/cgi-bin/fttx1.exe/100 >> >> Norway, NRK: >> https://www.nrk.no/tekst-tv/100/ >> >> Finland, YLE: >> https://yle.fi/aihe/tekstitv >> >> Switzerland, SRF: >> https://www.teletext.ch/ >> >> Croatia, HRT: >> https://teletekst.hrt.hr/ >> >> Greece: >> https://www.greektvidents.com/Teletext_ERTEXT.shtml >> >> And more; I have not done a complete survey! >> >> There are also several apps for iOS and for Android that display Teletext content from >> various (TV channel) providers. >> >> What the source format is for the Teletext pages as produced today, I don't know. But I would >> guess that it is likely "plain text" files, with Teletext specific markup, that is then converted to >> 1) Teletext analog format, 2) Teletext DVB format, 3) HTML. But that is just my guess. >> >> Again note that Teletext is still commonly used for optional subtitles. (DVB subtitles, a "bitmapped" >> format (i.e. the subtitles are sent as images, not text) does not seem to be used much. At least, I >> haven't seen it.) This requires timing, which is not part of the Teletext protocol, but must be in >> the source in order to control when a subtitle is output as Teletext for optional display. >> >> >> So >> "But a teletext application for a modern computer is not "normal use." It is reasonable >> for a non-standard application like this to interpret characters from U+0000 to U+001F >> as the corresponding ISO 646 characters would be in teletext." >> is very false. >> >> Further, the "object" overrides in the Teletext *protocol*, in several levels, "objects" >> prioritized depending on "implementation level", can specify: >> 1) Bold, italic, underline, proportional font. >> 2) More colours (but only 16 levels per red/green/blue, no transparency though). >> 3) Character substitutions (likely replacing spaces) to be able to display characters from "G3". >> These cannot be handled by "retaining" the ill-designed control codes of Teletext anyway. >> >> >> Have an urge to edit your own Teletext pages? Here?s the web page for doing just that: >> >> https://zxnet.co.uk/teletext/editor >> >> You can save your page in a handful of formats (plus as image). I haven?t analyzed these formats, >> but presumably they are storage formats actually used for ?real? Teletext pages that are converted >> to be transmitted (?analog? (outdated) or DVB) or given as web pages (HTML, but no ?separated >> mosaic? characters, since they are not yet allocated in Unicode; could use small images though...). >> >> /Kent K >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Mon Oct 12 17:40:05 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Tue, 13 Oct 2020 00:40:05 +0200 Subject: Teletext separated mosaic graphics In-Reply-To: References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <89A76137-A63E-4C91-835D-3FBB8126268F@bahnhof.se> <001e01d69aab$82a19140$87e4b3c0$@ewellic.org> <20201007212143.2d4f534a@JRWUBU2> <000001d69cec$0d4b1e50$27e15af0$@ewellic.org> Message-ID: > 8 okt. 2020 kl. 00:25 skrev Harriet Riddle via Unicode : ... > However: although MARC 21, the standard defining character encodings for Library of Congress records, uses a subset of ISO 6630 with some extensions (in positions not used by ISO 6630) as its C1 set within MARC-8 (its 8-bit, somewhat ECMA-35-based encoding), it however uses ECMA-48 as its C1 within Unicode, which means that it resorts to using SOS and ST instead of NSB and NSE (marking up a range of characters to be ignored during collation but nonetheless displayed). ECMA-48 on SOS?ST: ?A control string is a string of bit combinations which may occur in the data stream as a logical entity for control purposes.? What is implicit there, and not sufficiently clearly stated, is that the control string as such is not supposed to be displayed. It may be commands that do something, but is not intended to be text to be displayed as-is. /Kent K From d3ck0r at gmail.com Mon Oct 12 19:09:56 2020 From: d3ck0r at gmail.com (J Decker) Date: Mon, 12 Oct 2020 17:09:56 -0700 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> Message-ID: On Sun, Oct 11, 2020 at 8:24 PM Tom Honermann via Unicode < unicode at unicode.org> wrote: > On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: > > One concern I have, that might lead into rationale for the current > discouragement, > is that I would hate to see a best practice that pushes a BOM into ASCII > files. > One of the nice properties of UTF-8 is that a valid ASCII file (still very > common) is > also a valid UTF-8 file. Changing best practice would encourage updating > those > files to be no longer ASCII. > > Thanks, Alisdair. I think that concern is implicitly addressed by the > suggested resolutions, but perhaps that can be made more clear. One > possibility would be to modify the "protocol designer" guidelines to > address the case where a protocol's default encoding is ASCII based and to > specify that a BOM is only required for UTF-8 text that contains non-ASCII > characters. Would that be helpful? > 'and to specify that a BOM is only required for UTF-8 ' this should NEVER be 'required' or 'must', it shouldn't even be 'suggested'; fortunately BOM is just a ZWNBSP, so it's certainly a 'may' start with a such and such. These days the standard 'everything IS utf-8' works really well, except in firefox where the charset is required to be specified for JS scripts (but that's a bug in that) EBCDIC should be converted on the edge to internal ascii, since, thankfully, this is a niche application and everything thinks in ASCII or some derivative thereof. Byte Order Mark is irrelatvent to utf-8 since bytes are ordered in the correct order. I have run into several editors that have insisted on emitted BOM for UTF8 when initially promoted from ASCII, but subsequently deleting it doesn't bother anything. I am curious though, what was the actual problem you ran into that makes you even consider this modification? J > Tom. > > > AlisdairM > > On Oct 10, 2020, at 14:54, Tom Honermann via SG16 > wrote: > > Attached is a draft proposal for the Unicode standard that intends to > clarify the current recommendation regarding use of a BOM in UTF-8 text. > This is follow up to discussion on the Unicode mailing list > back > in June. > > Feedback is welcome. I plan to submit > this to the UTC in a > week or so pending review feedback. > > Tom. > -- > SG16 mailing list > SG16 at lists.isocpp.org > https://lists.isocpp.org/mailman/listinfo.cgi/sg16 > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Mon Oct 12 19:39:33 2020 From: jameskasskrv at gmail.com (James Kass) Date: Tue, 13 Oct 2020 00:39:33 +0000 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <2940C362-AE6D-4959-A3BC-6C76E51DDA89@bahnhof.se> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <6615002f-7f7d-c1c6-fc25-4418f7af06ee@honermann.net> <40ba9ef0-6618-24f0-f3e6-fb06c0dbf753@code2001.com> <2940C362-AE6D-4959-A3BC-6C76E51DDA89@bahnhof.se> Message-ID: On double-checking it turns out that these aren't (upper-) ASCII strings after all.? They're just ANSI strings.? Please see attached graphic.? My bad, sorry for the confusion (and the two typos) in my earlier post. I was trying to point out the similarity between the hex byte strings used in UTF-8 and hex byte strings used in HTML NCRs to point to a character's USV.? That similarity exists no matter how poorly my assertion was phrased. Perhaps it would have been better to point out the similarity between surrogate pairs and UTF-8.? Think of UTF-8 as being surrogate pairs (or trios or quadrupeds or whatever) which point to a Unicode character. A system substitutes a Unicode value for the UTF-8 hex byte string before further processing can occur. -------------- next part -------------- A non-text attachment was scrubbed... Name: 20201012_Capture.jpg Type: image/jpeg Size: 98051 bytes Desc: not available URL: From jameskass at code2001.com Mon Oct 12 20:54:24 2020 From: jameskass at code2001.com (James Kass) Date: Tue, 13 Oct 2020 01:54:24 +0000 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <2940C362-AE6D-4959-A3BC-6C76E51DDA89@bahnhof.se> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <6615002f-7f7d-c1c6-fc25-4418f7af06ee@honermann.net> <40ba9ef0-6618-24f0-f3e6-fb06c0dbf753@code2001.com> <2940C362-AE6D-4959-A3BC-6C76E51DDA89@bahnhof.se> Message-ID: On 2020-10-12 10:38 PM, Kent Karlsson via Unicode wrote: > What is a ?special character?? Any ?non-ASCII? one?? That could be seen as offensive... It seemed kinder to call them "special" instead of "7-bit challenged", but you take the point for this observation.? (smiles) From haberg-1 at telia.com Tue Oct 13 03:57:22 2020 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Tue, 13 Oct 2020 10:57:22 +0200 Subject: Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> Message-ID: It would be best if stated that its use is a type of metadata, and such, Unicode has no opinion on its use. > On 10 Oct 2020, at 20:54, Tom Honermann via Unicode wrote: > > Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text. This is follow up to discussion on the Unicode mailing list back in June. > > Feedback is welcome. I plan to submit this to the UTC in a week or so pending review feedback. > > Tom. > > From wjgo_10009 at btinternet.com Tue Oct 13 09:30:56 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 13 Oct 2020 15:30:56 +0100 (BST) Subject: Teletext separated mosaic graphics In-Reply-To: <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> References: , <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> Message-ID: <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> I am now thinking that the best solution for encoding the teletext control characters using just already existing Unicode characters is to use the Escape format listed in the PDF document linked from the post by Harriet Riddle. https://www.itscj.ipsj.or.jp/iso-ir/056.pdf https://corp.unicode.org/pipermail/unicode/2020-October/009048.html This appears to be what is used in the export format named viewdata from the editor that Kent Karlsson mentioned. https://zxnet.co.uk/teletext/editor https://corp.unicode.org/pipermail/unicode/2020-October/009071.html If one then uses a specially made OpenType font, one can arrange for each such two character escape sequence to be displayed as one of the glyph designs that I mentioned in the following post, by using the OpenType liga facility.. https://corp.unicode.org/pipermail/unicode/2020-October/009047.html > For example, Alphanumerics Green would have a visible glyph of an A > above a G on a pale. This morning I tried making a test font with a visible glyph for the Escape character and a liga glyph substitution for Escape followed by capital A. I made the font using the High-Logic FontCreator program and tested it in the Serif Affinity Publisher program, producing a PDF document. I was hoping to be able to paste a copy of the substituted glyph copied from the PDF to WordPad and recover the underlying two character sequence. However I could only seem to get the capital A back. Maybe I did not get the technique quite right and so it might perhaps be possible to get the underlying sequence back from a PDF, but that requires further investigation. William Overington Tuesday 13 October 2020 From jameskass at code2001.com Tue Oct 13 11:00:40 2020 From: jameskass at code2001.com (James Kass) Date: Tue, 13 Oct 2020 16:00:40 +0000 Subject: Teletext separated mosaic graphics In-Reply-To: <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> Message-ID: On 2020-10-13 2:30 PM, William_J_G Overington via Unicode wrote: > This morning I tried making a test font with a visible glyph for the > Escape character and a liga glyph substitution for Escape followed by > capital A. > > I made the font using the High-Logic FontCreator program and tested it > in the Serif Affinity Publisher program, producing a PDF document. > > I was hoping to be able to paste a copy of the substituted glyph > copied from the PDF to WordPad and recover the underlying two > character sequence. However I could only seem to get the capital A > back. Maybe I did not get the technique quite right and so it might > perhaps be possible to get the underlying sequence back from a PDF, > but that requires further investigation. This may have something to do with the post script names in the font, or it might depend on which program is being used to create the PDF. https://forum.affinity.serif.com/index.php?/topic/95991-text-not-rendering-properly-with-copypaste-from-pdf/ The post script names in the font should be standard format. Besides the "A" glyph, both the Escape visible glyph and the output "ligature" glyph should have valid post script names. You might try "uni0041" for "A" and I think "Escape" might work as "uni001B".? So the output glyph should have the post script name "uni0041001B". From jameskass at code2001.com Tue Oct 13 11:10:28 2020 From: jameskass at code2001.com (James Kass) Date: Tue, 13 Oct 2020 16:10:28 +0000 Subject: Teletext separated mosaic graphics In-Reply-To: References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> Message-ID: On 2020-10-13 4:00 PM, James Kass via Unicode wrote: > So the output glyph should have the post script name "uni0041001B". That's backwards, should be "uni001B0041". From kent.b.karlsson at bahnhof.se Tue Oct 13 11:34:29 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Tue, 13 Oct 2020 18:34:29 +0200 Subject: Teletext separated mosaic graphics In-Reply-To: <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> Message-ID: > 13 okt. 2020 kl. 16:30 skrev William_J_G Overington via Unicode : > > I am now thinking that the best solution for encoding the teletext control characters using just already existing Unicode characters is to use the Escape format listed in the PDF document linked from the post by Harriet Riddle. That is out of the question for several reasons. 1) ECMA-48 specifies such escape sequences as aliases (formally for 7-bit encodings, but in practice not limited that way) for the ECMA-48 C1 control codes. This suggestion is thus incompatible with ECMA-48. (And promoting anything else is a bad idea, even though compatibility with ECMA-48 is not required by Unicode/10646.) 2) That ?solution? does not in any way remove the gross ill-designedness of the Teletext ?control? codes (most of them do three things in one go: colour change, code page change, display as SPACE or as ?kept? ?mosaic? character). 3) That ?solution? still cannot handle the ?object? format overrides (more colors, bold, Italics, underline, proportional font [and G3 character substitutions, but that falls under encoding conversion, not under styling]) in Teletext (a horrendous idea, the only excuse for which is compatibility with the original Teletext ?controls? which are left untouched in ?advanced? Teletext). The ?object? overrides are in a control section of the Teletext protocol. /Kent Karlsson > https://www.itscj.ipsj.or.jp/iso-ir/056.pdf > > https://corp.unicode.org/pipermail/unicode/2020-October/009048.html > > This appears to be what is used in the export format named viewdata from the editor that Kent Karlsson mentioned. > > https://zxnet.co.uk/teletext/editor > > https://corp.unicode.org/pipermail/unicode/2020-October/009071.html > > If one then uses a specially made OpenType font, one can arrange for each such two character escape sequence to be displayed as one of the glyph designs that I mentioned in the following post, by using the OpenType liga facility.. > > https://corp.unicode.org/pipermail/unicode/2020-October/009047.html > >> For example, Alphanumerics Green would have a visible glyph of an A above a G on a pale. > > This morning I tried making a test font with a visible glyph for the Escape character and a liga glyph substitution for Escape followed by capital A. > > I made the font using the High-Logic FontCreator program and tested it in the Serif Affinity Publisher program, producing a PDF document. > > I was hoping to be able to paste a copy of the substituted glyph copied from the PDF to WordPad and recover the underlying two character sequence. However I could only seem to get the capital A back. Maybe I did not get the technique quite right and so it might perhaps be possible to get the underlying sequence back from a PDF, but that requires further investigation. > > William Overington > > Tuesday 13 October 2020 > > > > > > From wjgo_10009 at btinternet.com Tue Oct 13 12:06:09 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 13 Oct 2020 18:06:09 +0100 (BST) Subject: Teletext separated mosaic graphics In-Reply-To: References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> Message-ID: <11b9e827.131c.17522ed2bfe.Webtop.224@btinternet.com> Kent Karlsson wrote: > That is out of the question for several reasons. So would you support the idea of having new characters encoded in plane 14 as I suggested in the following post? https://corp.unicode.org/pipermail/unicode/2020-October/009047.html >> I opine that these twenty-seven codes could be encoded within a block >> of thirty-two code points as characters that display as visual glyphs in most circumstances, yet are control codes in teletext apps. >> For example, Alphanumerics Green would have a visible glyph of an A above a G on a pale. Does anybody support that idea please? If enough of us each send an individual request to The Unicode Technical Committee then maybe progress can be made. William Overington Tuesday 13 October 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Tue Oct 13 12:21:35 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Tue, 13 Oct 2020 19:21:35 +0200 Subject: Teletext separated mosaic graphics In-Reply-To: <11b9e827.131c.17522ed2bfe.Webtop.224@btinternet.com> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> <11b9e827.131c.17522ed2bfe.Webtop.224@btinternet.com> Message-ID: <70799911-C8E9-41DC-8E8A-7F39DDB891FA@bahnhof.se> > 13 okt. 2020 kl. 19:06 skrev William_J_G Overington via Unicode : > > Kent Karlsson wrote: > > > That is out of the question for several reasons. > > So would you support the idea of having new characters encoded in plane 14 as I suggested in the following post? Certainly not. > https://corp.unicode.org/pipermail/unicode/2020-October/009047.html > > >> I opine that these twenty-seven codes could be encoded within a block of > thirty-two code points as characters that display as visual glyphs in > most circumstances, yet are control codes in teletext apps. > > >> For example, Alphanumerics Green would have a visible glyph of an A > above a G on a pale. > > Does anybody support that idea please? Definitely not. /Kent K > If enough of us each send an individual request to The Unicode Technical Committee then maybe progress can be made. > > William Overington > > Tuesday 13 October 2020 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Oct 13 12:26:34 2020 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Tue, 13 Oct 2020 10:26:34 -0700 Subject: Teletext separated mosaic graphics In-Reply-To: References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> <547afa9d-fc6a-7fd3-6d74-591b4819eb6b@ix.netcom.com> Message-ID: <89532dfb-ecac-1291-a8c5-690a95261653@ix.netcom.com> On 10/13/2020 10:00 AM, Kent Karlsson wrote: > > Well, the short answer is of course: HTML. Details may vary. Right, it's those details, A./ > > /Kent K > > >> 13 okt. 2020 kl. 18:45 skrev Asmus Freytag > >: >> >> How do existing websites represent teletext? >> A./ >> >> >> On 10/13/2020 9:34 AM, Kent Karlsson via Unicode wrote: >>>> 13 okt. 2020 kl. 16:30 skrev William_J_G Overington via Unicode: >>>> >>>> I am now thinking that the best solution for encoding the teletext control characters using just already existing Unicode characters is to use the Escape format listed in the PDF document linked from the post by Harriet Riddle. >>> That is out of the question for several reasons. >>> >>> 1) ECMA-48 specifies such escape sequences as aliases (formally for 7-bit encodings, but in practice not limited that way) for the ECMA-48 C1 control codes. This suggestion is thus incompatible with ECMA-48. (And promoting anything else is a bad idea, even though compatibility with ECMA-48 is not required by Unicode/10646.) >>> >>> 2) That ?solution? does not in any way remove the gross ill-designedness of the Teletext ?control? codes (most of them do three things in one go: colour change, code page change, display as SPACE or as ?kept? ?mosaic? character). >>> >>> 3) That ?solution? still cannot handle the ?object? format overrides (more colors, bold, Italics, underline, proportional font [and G3 character substitutions, but that falls under encoding conversion, not under styling]) in Teletext (a horrendous idea, the only excuse for which is compatibility with the original Teletext ?controls? which are left untouched in ?advanced? Teletext). The ?object? overrides are in a control section of the Teletext protocol. >>> >>> /Kent Karlsson >>> >>> >>>> https://www.itscj.ipsj.or.jp/iso-ir/056.pdf >>>> >>>> https://corp.unicode.org/pipermail/unicode/2020-October/009048.html >>>> >>>> This appears to be what is used in the export format named viewdata from the editor that Kent Karlsson mentioned. >>>> >>>> https://zxnet.co.uk/teletext/editor >>>> >>>> https://corp.unicode.org/pipermail/unicode/2020-October/009071.html >>>> >>>> If one then uses a specially made OpenType font, one can arrange for each such two character escape sequence to be displayed as one of the glyph designs that I mentioned in the following post, by using the OpenType liga facility.. >>>> >>>> https://corp.unicode.org/pipermail/unicode/2020-October/009047.html >>>> >>>>> For example, Alphanumerics Green would have a visible glyph of an A above a G on a pale. >>>> This morning I tried making a test font with a visible glyph for the Escape character and a liga glyph substitution for Escape followed by capital A. >>>> >>>> I made the font using the High-Logic FontCreator program and tested it in the Serif Affinity Publisher program, producing a PDF document. >>>> >>>> I was hoping to be able to paste a copy of the substituted glyph copied from the PDF to WordPad and recover the underlying two character sequence. However I could only seem to get the capital A back. Maybe I did not get the technique quite right and so it might perhaps be possible to get the underlying sequence back from a PDF, but that requires further investigation. >>>> >>>> William Overington >>>> >>>> Tuesday 13 October 2020 >>>> >>>> >>>> >>>> >>>> >>>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at honermann.net Tue Oct 13 12:45:48 2020 From: tom at honermann.net (Tom Honermann) Date: Tue, 13 Oct 2020 13:45:48 -0400 Subject: Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> Message-ID: <624522d5-67e2-8199-109c-c8c3bfb3d180@honermann.net> On 10/13/20 4:57 AM, Hans ?berg wrote: > It would be best if stated that its use is a type of metadata, and such, Unicode has no opinion on its use. I'm interpreting that as an endorsement for the first suggested resolution in the paper. Tom. > > >> On 10 Oct 2020, at 20:54, Tom Honermann via Unicode wrote: >> >> Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text. This is follow up to discussion on the Unicode mailing list back in June. >> >> Feedback is welcome. I plan to submit this to the UTC in a week or so pending review feedback. >> >> Tom. >> >> From haberg-1 at telia.com Tue Oct 13 13:32:26 2020 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Tue, 13 Oct 2020 20:32:26 +0200 Subject: Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <624522d5-67e2-8199-109c-c8c3bfb3d180@honermann.net> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <624522d5-67e2-8199-109c-c8c3bfb3d180@honermann.net> Message-ID: <4377AAF3-16D0-4051-BAFB-D90D3F9C9BCD@telia.com> There is only a U+FEFF ZERO WIDTH NO-BREAK SPACE, and if somebody wants to use it to mean something else, that is something Unicode should not worry about. > On 13 Oct 2020, at 19:45, Tom Honermann wrote: > > On 10/13/20 4:57 AM, Hans ?berg wrote: >> It would be best if stated that its use is a type of metadata, and such, Unicode has no opinion on its use. > > I'm interpreting that as an endorsement for the first suggested resolution in the paper. > > Tom. > >> >> >>> On 10 Oct 2020, at 20:54, Tom Honermann via Unicode wrote: >>> >>> Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text. This is follow up to discussion on the Unicode mailing list back in June. >>> >>> Feedback is welcome. I plan to submit this to the UTC in a week or so pending review feedback. >>> >>> Tom. >>> >>> > > From kent.b.karlsson at bahnhof.se Tue Oct 13 13:44:18 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Tue, 13 Oct 2020 20:44:18 +0200 Subject: Fwd: Teletext separated mosaic graphics References: <5CC9C8F2-7162-4D08-A832-088DAC5FE163@bahnhof.se> Message-ID: <2F5B9D00-ECA5-4A90-BB12-09C976A94214@bahnhof.se> > Vidarebefordrat brev: > > Fr?n: Kent Karlsson > ?mne: Re: Teletext separated mosaic graphics > Datum: 13 oktober 2020 19:31:16 CEST > Till: Asmus Freytag > > Just to give one example, an HTML(5) code snippet from one of those Teletext sites: > > > > > > SVT Text TV > > ?. > ?. > > Stortingsledam?ters e-post hackades >

Norge: Ryssland l?g bakom cyberattack

> 136 > > Flydde fr?n polisen - > dog efter balkongfall > 118 > >

Ronaldo har testats positivt f?r covid

> Fixstj?rnan missar matchen mot Sverige > 300 > ?. > > > > Note that ?triple-digits? are (usually) converted to a link to the referenced Teletext page (Teletext pages have triple digit page numbers, starting at 100). > > > > /Kent K > > >> 13 okt. 2020 kl. 18:45 skrev Asmus Freytag : >> >> How do existing websites represent teletext? >> A./ >> >> >> On 10/13/2020 9:34 AM, Kent Karlsson via Unicode wrote: >>> >>>> 13 okt. 2020 kl. 16:30 skrev William_J_G Overington via Unicode >>>> : >>>> >>>> I am now thinking that the best solution for encoding the teletext control characters using just already existing Unicode characters is to use the Escape format listed in the PDF document linked from the post by Harriet Riddle. >>>> >>> That is out of the question for several reasons. >>> >>> 1) ECMA-48 specifies such escape sequences as aliases (formally for 7-bit encodings, but in practice not limited that way) for the ECMA-48 C1 control codes. This suggestion is thus incompatible with ECMA-48. (And promoting anything else is a bad idea, even though compatibility with ECMA-48 is not required by Unicode/10646.) >>> >>> 2) That ?solution? does not in any way remove the gross ill-designedness of the Teletext ?control? codes (most of them do three things in one go: colour change, code page change, display as SPACE or as ?kept? ?mosaic? character). >>> >>> 3) That ?solution? still cannot handle the ?object? format overrides (more colors, bold, Italics, underline, proportional font [and G3 character substitutions, but that falls under encoding conversion, not under styling]) in Teletext (a horrendous idea, the only excuse for which is compatibility with the original Teletext ?controls? which are left untouched in ?advanced? Teletext). The ?object? overrides are in a control section of the Teletext protocol. >>> >>> /Kent Karlsson >>> >>> >>> >>>> https://www.itscj.ipsj.or.jp/iso-ir/056.pdf >>>> >>>> >>>> >>>> https://corp.unicode.org/pipermail/unicode/2020-October/009048.html >>>> >>>> >>>> This appears to be what is used in the export format named viewdata from the editor that Kent Karlsson mentioned. >>>> >>>> >>>> https://zxnet.co.uk/teletext/editor >>>> >>>> >>>> >>>> https://corp.unicode.org/pipermail/unicode/2020-October/009071.html >>>> >>>> >>>> If one then uses a specially made OpenType font, one can arrange for each such two character escape sequence to be displayed as one of the glyph designs that I mentioned in the following post, by using the OpenType liga facility.. >>>> >>>> >>>> https://corp.unicode.org/pipermail/unicode/2020-October/009047.html >>>> >>>> >>>> >>>>> For example, Alphanumerics Green would have a visible glyph of an A above a G on a pale. >>>>> >>>> This morning I tried making a test font with a visible glyph for the Escape character and a liga glyph substitution for Escape followed by capital A. >>>> >>>> I made the font using the High-Logic FontCreator program and tested it in the Serif Affinity Publisher program, producing a PDF document. >>>> >>>> I was hoping to be able to paste a copy of the substituted glyph copied from the PDF to WordPad and recover the underlying two character sequence. However I could only seem to get the capital A back. Maybe I did not get the technique quite right and so it might perhaps be possible to get the underlying sequence back from a PDF, but that requires further investigation. >>>> >>>> William Overington >>>> >>>> Tuesday 13 October 2020 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Tue Oct 13 14:04:37 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Tue, 13 Oct 2020 21:04:37 +0200 Subject: Teletext separated mosaic graphics In-Reply-To: <89532dfb-ecac-1291-a8c5-690a95261653@ix.netcom.com> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> <547afa9d-fc6a-7fd3-6d74-591b4819eb6b@ix.netcom.com> <89532dfb-ecac-1291-a8c5-690a95261653@ix.netcom.com> Message-ID: Here is snippet from another one, just to give a different example (of course not the whole page source, just a snippet): ? ? ??. and then using Javascript code to retrieve a generated image (faithful) and ?just the text? (not faithful formatting)?.. So a completely different approach to the other example, where the text was generated (fairly) faithfully (compared to on TV). So this approach do not allow for links in the faithfully generated page. /Kent K > 13 okt. 2020 kl. 19:26 skrev Asmus Freytag (c) : > > On 10/13/2020 10:00 AM, Kent Karlsson wrote: >> >> Well, the short answer is of course: HTML. Details may vary. > Right, it's those details, > > A./ >> >> /Kent K >> >> >>> 13 okt. 2020 kl. 18:45 skrev Asmus Freytag : >>> >>> How do existing websites represent teletext? >>> A./ >>> >>> >>> On 10/13/2020 9:34 AM, Kent Karlsson via Unicode wrote: >>>>> 13 okt. 2020 kl. 16:30 skrev William_J_G Overington via Unicode >>>>> : >>>>> >>>>> I am now thinking that the best solution for encoding the teletext control characters using just already existing Unicode characters is to use the Escape format listed in the PDF document linked from the post by Harriet Riddle. >>>>> >>>> That is out of the question for several reasons. >>>> >>>> 1) ECMA-48 specifies such escape sequences as aliases (formally for 7-bit encodings, but in practice not limited that way) for the ECMA-48 C1 control codes. This suggestion is thus incompatible with ECMA-48. (And promoting anything else is a bad idea, even though compatibility with ECMA-48 is not required by Unicode/10646.) >>>> >>>> 2) That ?solution? does not in any way remove the gross ill-designedness of the Teletext ?control? codes (most of them do three things in one go: colour change, code page change, display as SPACE or as ?kept? ?mosaic? character). >>>> >>>> 3) That ?solution? still cannot handle the ?object? format overrides (more colors, bold, Italics, underline, proportional font [and G3 character substitutions, but that falls under encoding conversion, not under styling]) in Teletext (a horrendous idea, the only excuse for which is compatibility with the original Teletext ?controls? which are left untouched in ?advanced? Teletext). The ?object? overrides are in a control section of the Teletext protocol. >>>> >>>> /Kent Karlsson >>>> >>>> >>>> >>>>> https://www.itscj.ipsj.or.jp/iso-ir/056.pdf >>>>> >>>>> >>>>> >>>>> https://corp.unicode.org/pipermail/unicode/2020-October/009048.html >>>>> >>>>> >>>>> This appears to be what is used in the export format named viewdata from the editor that Kent Karlsson mentioned. >>>>> >>>>> >>>>> https://zxnet.co.uk/teletext/editor >>>>> >>>>> >>>>> >>>>> https://corp.unicode.org/pipermail/unicode/2020-October/009071.html >>>>> >>>>> >>>>> If one then uses a specially made OpenType font, one can arrange for each such two character escape sequence to be displayed as one of the glyph designs that I mentioned in the following post, by using the OpenType liga facility.. >>>>> >>>>> >>>>> https://corp.unicode.org/pipermail/unicode/2020-October/009047.html >>>>> >>>>> >>>>> >>>>>> For example, Alphanumerics Green would have a visible glyph of an A above a G on a pale. >>>>>> >>>>> This morning I tried making a test font with a visible glyph for the Escape character and a liga glyph substitution for Escape followed by capital A. >>>>> >>>>> I made the font using the High-Logic FontCreator program and tested it in the Serif Affinity Publisher program, producing a PDF document. >>>>> >>>>> I was hoping to be able to paste a copy of the substituted glyph copied from the PDF to WordPad and recover the underlying two character sequence. However I could only seem to get the capital A back. Maybe I did not get the technique quite right and so it might perhaps be possible to get the underlying sequence back from a PDF, but that requires further investigation. >>>>> >>>>> William Overington >>>>> >>>>> Tuesday 13 October 2020 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> >> > From kent.b.karlsson at bahnhof.se Tue Oct 13 14:05:05 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Tue, 13 Oct 2020 21:05:05 +0200 Subject: Teletext separated mosaic graphics In-Reply-To: <5CC9C8F2-7162-4D08-A832-088DAC5FE163@bahnhof.se> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> <547afa9d-fc6a-7fd3-6d74-591b4819eb6b@ix.netcom.com> <5CC9C8F2-7162-4D08-A832-088DAC5FE163@bahnhof.se> Message-ID: A final example (semi-faithful text, not an image): Teletexto El Tiempo - 301 | RTVE.es Predicci?n hoy/ma?ana y mapa 302 a 304
  • El Tiempo en Espa?a por CC.AA305
  • ?? > 13 okt. 2020 kl. 19:31 skrev Kent Karlsson : > > Just to give one example, an HTML(5) code snippet from one of those Teletext sites: > > > > > > SVT Text TV > > ?. > ?. > > Stortingsledam?ters e-post hackades >

    Norge: Ryssland l?g bakom cyberattack

    > 136 > > Flydde fr?n polisen - > dog efter balkongfall > 118 > >

    Ronaldo har testats positivt f?r covid

    > Fixstj?rnan missar matchen mot Sverige > 300 > ?. > > > > Note that ?triple-digits? are (usually) converted to a link to the referenced Teletext page (Teletext pages have triple digit page numbers, starting at 100). > > > > /Kent K > > >> 13 okt. 2020 kl. 18:45 skrev Asmus Freytag : >> >> How do existing websites represent teletext? >> A./ >> >> >> On 10/13/2020 9:34 AM, Kent Karlsson via Unicode wrote: >>> >>>> 13 okt. 2020 kl. 16:30 skrev William_J_G Overington via Unicode >>>> : >>>> >>>> I am now thinking that the best solution for encoding the teletext control characters using just already existing Unicode characters is to use the Escape format listed in the PDF document linked from the post by Harriet Riddle. >>>> >>> That is out of the question for several reasons. >>> >>> 1) ECMA-48 specifies such escape sequences as aliases (formally for 7-bit encodings, but in practice not limited that way) for the ECMA-48 C1 control codes. This suggestion is thus incompatible with ECMA-48. (And promoting anything else is a bad idea, even though compatibility with ECMA-48 is not required by Unicode/10646.) >>> >>> 2) That ?solution? does not in any way remove the gross ill-designedness of the Teletext ?control? codes (most of them do three things in one go: colour change, code page change, display as SPACE or as ?kept? ?mosaic? character). >>> >>> 3) That ?solution? still cannot handle the ?object? format overrides (more colors, bold, Italics, underline, proportional font [and G3 character substitutions, but that falls under encoding conversion, not under styling]) in Teletext (a horrendous idea, the only excuse for which is compatibility with the original Teletext ?controls? which are left untouched in ?advanced? Teletext). The ?object? overrides are in a control section of the Teletext protocol. >>> >>> /Kent Karlsson >>> >>> >>> >>>> https://www.itscj.ipsj.or.jp/iso-ir/056.pdf >>>> >>>> >>>> >>>> https://corp.unicode.org/pipermail/unicode/2020-October/009048.html >>>> >>>> >>>> This appears to be what is used in the export format named viewdata from the editor that Kent Karlsson mentioned. >>>> >>>> >>>> https://zxnet.co.uk/teletext/editor >>>> >>>> >>>> >>>> https://corp.unicode.org/pipermail/unicode/2020-October/009071.html >>>> >>>> >>>> If one then uses a specially made OpenType font, one can arrange for each such two character escape sequence to be displayed as one of the glyph designs that I mentioned in the following post, by using the OpenType liga facility.. >>>> >>>> >>>> https://corp.unicode.org/pipermail/unicode/2020-October/009047.html >>>> >>>> >>>> >>>>> For example, Alphanumerics Green would have a visible glyph of an A above a G on a pale. >>>>> >>>> This morning I tried making a test font with a visible glyph for the Escape character and a liga glyph substitution for Escape followed by capital A. >>>> >>>> I made the font using the High-Logic FontCreator program and tested it in the Serif Affinity Publisher program, producing a PDF document. >>>> >>>> I was hoping to be able to paste a copy of the substituted glyph copied from the PDF to WordPad and recover the underlying two character sequence. However I could only seem to get the capital A back. Maybe I did not get the technique quite right and so it might perhaps be possible to get the underlying sequence back from a PDF, but that requires further investigation. >>>> >>>> William Overington >>>> >>>> Tuesday 13 October 2020 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >> > From tom at honermann.net Tue Oct 13 14:07:35 2020 From: tom at honermann.net (Tom Honermann) Date: Tue, 13 Oct 2020 15:07:35 -0400 Subject: Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <4377AAF3-16D0-4051-BAFB-D90D3F9C9BCD@telia.com> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <624522d5-67e2-8199-109c-c8c3bfb3d180@honermann.net> <4377AAF3-16D0-4051-BAFB-D90D3F9C9BCD@telia.com> Message-ID: On 10/13/20 2:32 PM, Hans ?berg wrote: > There is only a U+FEFF ZERO WIDTH NO-BREAK SPACE, and if somebody wants to use it to mean something else, that is something Unicode should not worry about. The Unicode standard already discusses use of U+FEFF as both a BOM and as an encoding signature.? If you feel that it should not do so, I think that would be a separate proposal. Tom. > > >> On 13 Oct 2020, at 19:45, Tom Honermann wrote: >> >> On 10/13/20 4:57 AM, Hans ?berg wrote: >>> It would be best if stated that its use is a type of metadata, and such, Unicode has no opinion on its use. >> I'm interpreting that as an endorsement for the first suggested resolution in the paper. >> >> Tom. >> >>> >>>> On 10 Oct 2020, at 20:54, Tom Honermann via Unicode wrote: >>>> >>>> Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text. This is follow up to discussion on the Unicode mailing list back in June. >>>> >>>> Feedback is welcome. I plan to submit this to the UTC in a week or so pending review feedback. >>>> >>>> Tom. >>>> >>>> >> From haberg-1 at telia.com Tue Oct 13 14:13:45 2020 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Tue, 13 Oct 2020 21:13:45 +0200 Subject: Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <624522d5-67e2-8199-109c-c8c3bfb3d180@honermann.net> <4377AAF3-16D0-4051-BAFB-D90D3F9C9BCD@telia.com> Message-ID: <2832BA11-4857-4253-A2CA-3CC6BD3B2AFA@telia.com> > On 13 Oct 2020, at 21:07, Tom Honermann wrote: > > On 10/13/20 2:32 PM, Hans ?berg wrote: >> There is only a U+FEFF ZERO WIDTH NO-BREAK SPACE, and if somebody wants to use it to mean something else, that is something Unicode should not worry about. > > The Unicode standard already discusses use of U+FEFF as both a BOM and as an encoding signature. If you feel that it should not do so, I think that would be a separate proposal. It is a good question why they write so much about it. It belongs to a protocol on a higher level. From tom at honermann.net Tue Oct 13 15:04:14 2020 From: tom at honermann.net (Tom Honermann) Date: Tue, 13 Oct 2020 16:04:14 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> Message-ID: <74f39a75-72d8-aacf-02f8-d21a18450a56@honermann.net> On 10/12/20 4:54 PM, Shawn Steele wrote: > > I?m having trouble with the attempt to be this prescriptive. > > These make sense:? ?Use Unicode!? > > * If possible, mandate use of UTF-8 without a BOM; diagnose the > presence of a BOM in consumed text as an error, and produce text > without a BOM. > * Alternatively, swallow the BOM if present. > > After that the situation is clearly hopeless. ?Applications should Use > Unicode, eg: UTF-8, and clearly there are cases happening where that > isn?t happening.? Trying to prescribe that negotiation should > therefore happen, or that BOMs should be interpreted or whatever is > fairly meaningless at that point. ?Given that the higher-order > guidance of ?Use Unicode? has already been ignored, at this point it?s > garbage-in, garbage-out.? Clearly the app/whatever is ignoring the > ?use unicode? guidance for some legacy reason.? If they could adapt, > it should be to use UTF-8.? ?It **might** be helpful to say something > about a BOM likely indicating UTF-8 text in otherwise unspecified > data, but prescriptive stuff is pointless, it?s legacy stuff that > behaves in a legacy fashion for a reason and saying they should have > done it differently 20 years ago isn?t going to help ? > There are applications that, for legacy reasons, are unable to change their default encoding to UTF-8, but that also need to handle UTF-8 text.? It is not clear to me that such situations are hopeless or that they cannot be improved. The prescription offered follows what you suggest.? The first three cases are are all of the "use Unicode!" variety.? The distinction between the third and the fourth is to relegate use of a BOM as an encoding signature to the last resort option.? The intent is to make it clear, with stronger motivation than is currently present in the Unicode standard, that use of a BOM in UTF-8 is not a best practice today. Tom. > -Shawn > > *From:* Unicode *On Behalf Of *Tom > Honermann via Unicode > *Sent:* Monday, October 12, 2020 7:03 AM > *To:* Alisdair Meredith > *Cc:* sg16 at lists.isocpp.org; Unicode List > *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of a > BOM as a UTF-8 encoding signature > > Great, here is the change I'm making to address this: > > Protocol designers: > > * If possible, mandate use of UTF-8 without a BOM; diagnose the > presence of a BOM in consumed text as an error, and produce > text without a BOM. > * Otherwise, if possible, mandate use of UTF-8 with or without a > BOM; accept and discard a BOM in consumed text, and produce > text without a BOM. > * Otherwise, if possible, use UTF-8 as the default encoding with > use of other encodings negotiated using information other than > a BOM; accept and discard a BOM in consumed text, and produce > text without a BOM. > * Otherwise, require the presence of a BOM to differentiate > UTF-8 encoded text in both consumed and produced text*unless > the absence of a BOM would result in the text being > interpreted as an ASCII-based encoding and the UTF-8 text > contains no non-ASCII characters (the exception is intended to > avoid the addition of a BOM to ASCII text thus rendering such > text as non-ASCII)*. This approach should be reserved for > scenarios in which UTF-8 cannot be adopted as a default due to > backward compatibility concerns. > > Tom. > > On 10/12/20 8:40 AM, Alisdair Meredith wrote: > > That addresses my main concern. ?Essentially, best practice (for > UTF-8) would be no BOM unless the document contains code points > that require multiple code units to express. > > AlisdairM > > > > On Oct 11, 2020, at 23:22, Tom Honermann > wrote: > > On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: > > One concern I have, that might lead into rationale for the > current discouragement, > > is that I would hate to see a best practice that pushes a > BOM into ASCII files. > > One of the nice properties of UTF-8 is that a valid ASCII > file (still very common) is > > also a valid UTF-8 file. ?Changing best practice would > encourage updating those > > files to be no longer ASCII. > > Thanks, Alisdair.? I think that concern is implicitly > addressed by the suggested resolutions, but perhaps that can > be made more clear.? One possibility would be to modify the > "protocol designer" guidelines to address the case where a > protocol's default encoding is ASCII based and to specify that > a BOM is only required for UTF-8 text that contains non-ASCII > characters.? Would that be helpful? > > Tom. > > AlisdairM > > > > On Oct 10, 2020, at 14:54, Tom Honermann via SG16 > > > wrote: > > Attached is a draft proposal for the Unicode standard > that intends to clarify the current recommendation > regarding use of a BOM in UTF-8 text.? This is follow > up to discussion on the Unicode mailing list > > back in June. > > Feedback is welcome.? I plan to submit > this > to the UTC in a week or so pending review feedback. > > Tom. > > -- > SG16 mailing list > SG16 at lists.isocpp.org > https://lists.isocpp.org/mailman/listinfo.cgi/sg16 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Tue Oct 13 15:42:37 2020 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Tue, 13 Oct 2020 20:42:37 +0000 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <74f39a75-72d8-aacf-02f8-d21a18450a56@honermann.net> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> <74f39a75-72d8-aacf-02f8-d21a18450a56@honermann.net> Message-ID: My assertion is that if the application cannot change to UTF-8 due to legacy considerations, that the subtleties of whether to use a BOM or not also cannot be prescribed. If the application could follow best practices, it would use UTF-8. Since it cannot use UTF-8, therefore it can?t follow any prescribed behavior. Therefore anything beyond ?Use Unicode!? is merely suggestions. Terminology like ?require? implies a false sense of rigor that these applications can?t follow in practice. Eg: Presume I have a text editor that has been used in some context for some time. If I?m told ?use UTF-8?, that?s cool, I could try to do that, but if I cannot, then I?m in an exceptional path. Unicode could suggest that I consider behavior for BOMs (such as ignoring them if present), however I?m already stuck in my legacy behavior, so there?s a limit to what my application can do. However, if Unicode says ?if you see a BOM, then you must use UTF-8?, then users of my legacy application that is difficult to change, may have expectations of the application that don?t match reality. They could even enter bugs like ?The app isn?t recognizing data being tagged with BOMs.? Or ?your system isn?t compliant, so we can?t license it.? If the app could properly handle UTF-8, we?d have been captured in the first requirements and wouldn?t even be having this part of the conversation. Since they can?t handle UTF-8, trying to enforce it through the BOM isn?t going to add much. IMO it?s better that everyone involved understand that this legacy app that can?t handle UTF-8 by default isn?t necessarily going to behave per any set expectations and likely has legacy behaviors that users may need to deal with. Granted, the difference between ?requiring,? and ?suggesting? or ?recommending?, may be subtle, however those subtleties can sometimes cause unnecessary pain. I don?t mind mandating UTF-8 without BOM if possible. I don?t really mind mandating that BOMs be ignored if ?without BOM? isn?t reasonable to mandate. After that though, it?s trying to create a higher order protocol for codepage detection. BOM isn?t a great way to identify UTF-8 data. (It?s probably more effective to decode it as UTF-8. If it decodes properly, then it?s likely UTF-8. With a certainty of about as many ?nines? as you have bytes of input. Linguistically appropriate strings that fail that test are rare.) -Shawn From: Tom Honermann Sent: Tuesday, October 13, 2020 1:04 PM To: Shawn Steele ; Alisdair Meredith Cc: sg16 at lists.isocpp.org; Unicode Mail List Subject: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature On 10/12/20 4:54 PM, Shawn Steele wrote: I?m having trouble with the attempt to be this prescriptive. These make sense: ?Use Unicode!? * If possible, mandate use of UTF-8 without a BOM; diagnose the presence of a BOM in consumed text as an error, and produce text without a BOM. * Alternatively, swallow the BOM if present. After that the situation is clearly hopeless. Applications should Use Unicode, eg: UTF-8, and clearly there are cases happening where that isn?t happening. Trying to prescribe that negotiation should therefore happen, or that BOMs should be interpreted or whatever is fairly meaningless at that point. Given that the higher-order guidance of ?Use Unicode? has already been ignored, at this point it?s garbage-in, garbage-out. Clearly the app/whatever is ignoring the ?use unicode? guidance for some legacy reason. If they could adapt, it should be to use UTF-8. It *might* be helpful to say something about a BOM likely indicating UTF-8 text in otherwise unspecified data, but prescriptive stuff is pointless, it?s legacy stuff that behaves in a legacy fashion for a reason and saying they should have done it differently 20 years ago isn?t going to help ? There are applications that, for legacy reasons, are unable to change their default encoding to UTF-8, but that also need to handle UTF-8 text. It is not clear to me that such situations are hopeless or that they cannot be improved. The prescription offered follows what you suggest. The first three cases are are all of the "use Unicode!" variety. The distinction between the third and the fourth is to relegate use of a BOM as an encoding signature to the last resort option. The intent is to make it clear, with stronger motivation than is currently present in the Unicode standard, that use of a BOM in UTF-8 is not a best practice today. Tom. -Shawn From: Unicode On Behalf Of Tom Honermann via Unicode Sent: Monday, October 12, 2020 7:03 AM To: Alisdair Meredith Cc: sg16 at lists.isocpp.org; Unicode List Subject: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature Great, here is the change I'm making to address this: Protocol designers: * If possible, mandate use of UTF-8 without a BOM; diagnose the presence of a BOM in consumed text as an error, and produce text without a BOM. * Otherwise, if possible, mandate use of UTF-8 with or without a BOM; accept and discard a BOM in consumed text, and produce text without a BOM. * Otherwise, if possible, use UTF-8 as the default encoding with use of other encodings negotiated using information other than a BOM; accept and discard a BOM in consumed text, and produce text without a BOM. * Otherwise, require the presence of a BOM to differentiate UTF-8 encoded text in both consumed and produced text unless the absence of a BOM would result in the text being interpreted as an ASCII-based encoding and the UTF-8 text contains no non-ASCII characters (the exception is intended to avoid the addition of a BOM to ASCII text thus rendering such text as non-ASCII). This approach should be reserved for scenarios in which UTF-8 cannot be adopted as a default due to backward compatibility concerns. Tom. On 10/12/20 8:40 AM, Alisdair Meredith wrote: That addresses my main concern. Essentially, best practice (for UTF-8) would be no BOM unless the document contains code points that require multiple code units to express. AlisdairM On Oct 11, 2020, at 23:22, Tom Honermann > wrote: On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: One concern I have, that might lead into rationale for the current discouragement, is that I would hate to see a best practice that pushes a BOM into ASCII files. One of the nice properties of UTF-8 is that a valid ASCII file (still very common) is also a valid UTF-8 file. Changing best practice would encourage updating those files to be no longer ASCII. Thanks, Alisdair. I think that concern is implicitly addressed by the suggested resolutions, but perhaps that can be made more clear. One possibility would be to modify the "protocol designer" guidelines to address the case where a protocol's default encoding is ASCII based and to specify that a BOM is only required for UTF-8 text that contains non-ASCII characters. Would that be helpful? Tom. AlisdairM On Oct 10, 2020, at 14:54, Tom Honermann via SG16 > wrote: Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text. This is follow up to discussion on the Unicode mailing list back in June. Feedback is welcome. I plan to submit this to the UTC in a week or so pending review feedback. Tom. -- SG16 mailing list SG16 at lists.isocpp.org https://lists.isocpp.org/mailman/listinfo.cgi/sg16 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at honermann.net Tue Oct 13 15:46:43 2020 From: tom at honermann.net (Tom Honermann) Date: Tue, 13 Oct 2020 16:46:43 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> Message-ID: On 10/12/20 8:09 PM, J Decker via Unicode wrote: > > > On Sun, Oct 11, 2020 at 8:24 PM Tom Honermann via Unicode > > wrote: > > On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: >> One concern I have, that might lead into rationale for the >> current discouragement, >> is that I would hate to see a best practice that pushes a BOM >> into ASCII files. >> One of the nice properties of UTF-8 is that a valid ASCII file >> (still very common) is >> also a valid UTF-8 file.? Changing best practice would encourage >> updating those >> files to be no longer ASCII. > > Thanks, Alisdair.? I think that concern is implicitly addressed by > the suggested resolutions, but perhaps that can be made more > clear.? One possibility would be to modify the "protocol designer" > guidelines to address the case where a protocol's default encoding > is ASCII based and to specify that a BOM is only required for > UTF-8 text that contains non-ASCII characters.? Would that be helpful? > > > 'and to specify that a BOM is only required for UTF-8 ' this should > NEVER be 'required' or 'must', it shouldn't even be 'suggested'; > fortunately BOM is just a ZWNBSP, so it's certainly a 'may' start with > a such and such. > These days the standard 'everything IS utf-8' works really well, > except in firefox where the charset is required to be specified for JS > scripts (but that's a bug in that) > EBCDIC should be converted on the edge to internal ascii, since, > thankfully, this is a niche application and everything thinks in ASCII > or some derivative thereof. > Byte Order Mark is irrelatvent to utf-8 since bytes are ordered in the > correct order. > I have run into several editors that have insisted on emitted?BOM for > UTF8 when initially promoted from ASCII, but subsequently deleting it > doesn't bother anything. I mostly agree.? Please note that the paper suggests use of a BOM only as a last resort.? The goal is to further discourage its use with rationale. > > I am curious though, what was the actual problem you ran into that > makes you even consider this modification? I'm working on improving support for portable C++ source code. Today, there is no character encoding that is supported by all C++ implementations (not even ASCII).? I'd like to make UTF-8 that commonly supported character encoding.? For backward compatibility reasons, compilers cannot change their default source code character encoding to UTF-8. Most C++ applications are created from components that have different release schedules and that are maintained by different organizations.? Synchronizing a conversion to UTF-8 across dependent projects isn't feasible, nor is converting all of the source files used by an application to UTF-8 as simple as just running them through 'iconv'.? Migration to UTF-8 will therefore require an incremental approach for at least some applications, though many are likely to find success by simply invoking their compiler with the appropriate -everything-is-utf8 option since most source files are ASCII. Microsoft Visual C++ recognizes a UTF-8 BOM as an encoding signature and allows differently encoded source files to be used in the same translation unit.? Support for differently encoded source files in the same translation unit is the feature that will be needed to enable incremental migration.? Normative discouragement (with rationale) for use of a BOM by the Unicode standard would be helpful to explain why a solution other than a BOM (perhaps something like Python's encoding declaration ) should be standardized in favor of the existing practice demonstrated by Microsoft's solution. Tom. > > J > > Tom. > >> >> AlisdairM >> >>> On Oct 10, 2020, at 14:54, Tom Honermann via SG16 >>> > wrote: >>> >>> Attached is a draft proposal for the Unicode standard that >>> intends to clarify the current recommendation regarding use of a >>> BOM in UTF-8 text.? This is follow up to discussion on the >>> Unicode mailing list >>> >>> back in June. >>> >>> Feedback is welcome.? I plan to submit >>> this to the UTC >>> in a week or so pending review feedback. >>> >>> Tom. >>> >>> -- >>> SG16 mailing list >>> SG16 at lists.isocpp.org >>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at honermann.net Tue Oct 13 16:06:28 2020 From: tom at honermann.net (Tom Honermann) Date: Tue, 13 Oct 2020 17:06:28 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> <74f39a75-72d8-aacf-02f8-d21a18450a56@honermann.net> Message-ID: On 10/13/20 4:42 PM, Shawn Steele wrote: > > My assertion is that if the application cannot change to UTF-8 due to > legacy considerations, that the subtleties of whether to use a BOM or > not also cannot be prescribed.? If the application could follow best > practices, it would use UTF-8.? Since it cannot use UTF-8, therefore > it can?t follow any prescribed behavior.? Therefore anything beyond > ?Use Unicode!? is merely suggestions.? Terminology like ?require? > implies a false sense of rigor that these applications can?t follow in > practice. > This is why the prescription remains abstract: * If possible, use something other than a BOM. * As a last resort, use a BOM. I am effectively proposing that as a best practice. > Eg:? Presume I have a text editor that has been used in some context > for some time. ?If I?m told ?use UTF-8?, that?s cool, I could try to > do that, but if I cannot, then I?m in an exceptional path.? Unicode > could suggest that I consider behavior for BOMs (such as ignoring them > if present), however I?m already stuck in my legacy behavior, so > there?s a limit to what my application can do. > This scenario fits the advice above.? The "use something other than a BOM" could mean adding a command line option, adding a menu option, remembering what encoding was used for that file last time, performing a heuristic analysis (that may or may not include the presence of a BOM in its calculation), prompting the user, etc... > > However, if Unicode says ?if you see a BOM, then you must use UTF-8?, > then users of my legacy application that is difficult to change, may > have expectations of the application that don?t match reality.? They > could even enter bugs like ?The app isn?t recognizing data being > tagged with BOMs.?? Or ?your system isn?t compliant, so we can?t > license it.?? If the app could properly handle UTF-8, we?d have been > captured in the first requirements and wouldn?t even be having this > part of the conversation.? Since they can?t handle UTF-8, trying to > enforce it through the BOM isn?t going to add much. > No part of this proposal states "if you see a BOM, then you must use UTF-8".? It only suggests guidelines; requirements are imposed by protocols as deemed appropriate by the protocol designers. > > IMO it?s better that everyone involved understand that this legacy app > that can?t handle UTF-8 by default isn?t necessarily going to behave > per any set expectations and likely has legacy behaviors that users > may need to deal with. > > Granted, the difference between ?requiring,? and ?suggesting? or > ?recommending?, may be subtle, however those subtleties can sometimes > cause unnecessary pain. > > I don?t mind mandating UTF-8 without BOM if possible.? I don?t really > mind mandating that BOMs be ignored if ?without BOM? isn?t reasonable > to mandate. > > After that though, it?s trying to create a higher order protocol for > codepage detection.? BOM isn?t a great way to identify UTF-8 data.? > (It?s probably more effective to decode it as UTF-8.? If it decodes > properly, then it?s likely UTF-8.? With a certainty of about as many > ?nines? as you have bytes of input.? Linguistically appropriate > strings that fail that test are rare.) > We are agreed on these points. Tom. > -Shawn > > *From:* Tom Honermann > *Sent:* Tuesday, October 13, 2020 1:04 PM > *To:* Shawn Steele ; Alisdair Meredith > > *Cc:* sg16 at lists.isocpp.org; Unicode Mail List > *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of a > BOM as a UTF-8 encoding signature > > On 10/12/20 4:54 PM, Shawn Steele wrote: > > I?m having trouble with the attempt to be this prescriptive. > > These make sense:? ?Use Unicode!? > > * If possible, mandate use of UTF-8 without a BOM; diagnose the > presence of a BOM in consumed text as an error, and produce > text without a BOM. > * Alternatively, swallow the BOM if present. > > After that the situation is clearly hopeless. ?Applications should > Use Unicode, eg: UTF-8, and clearly there are cases happening > where that isn?t happening.? Trying to prescribe that negotiation > should therefore happen, or that BOMs should be interpreted or > whatever is fairly meaningless at that point. ?Given that the > higher-order guidance of ?Use Unicode? has already been ignored, > at this point it?s garbage-in, garbage-out.? Clearly the > app/whatever is ignoring the ?use unicode? guidance for some > legacy reason. If they could adapt, it should be to use UTF-8.? > ?It **might** be helpful to say something about a BOM likely > indicating UTF-8 text in otherwise unspecified data, but > prescriptive stuff is pointless, it?s legacy stuff that behaves in > a legacy fashion for a reason and saying they should have done it > differently 20 years ago isn?t going to help ? > > There are applications that, for legacy reasons, are unable to change > their default encoding to UTF-8, but that also need to handle UTF-8 > text.? It is not clear to me that such situations are hopeless or that > they cannot be improved. > > The prescription offered follows what you suggest.? The first three > cases are are all of the "use Unicode!" variety.? The distinction > between the third and the fourth is to relegate use of a BOM as an > encoding signature to the last resort option.? The intent is to make > it clear, with stronger motivation than is currently present in the > Unicode standard, that use of a BOM in UTF-8 is not a best practice today. > > Tom. > > -Shawn > > *From:* Unicode > *On Behalf Of *Tom Honermann > via Unicode > *Sent:* Monday, October 12, 2020 7:03 AM > *To:* Alisdair Meredith > *Cc:* sg16 at lists.isocpp.org ; > Unicode List > *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of > a BOM as a UTF-8 encoding signature > > Great, here is the change I'm making to address this: > > Protocol designers: > > * If possible, mandate use of UTF-8 without a BOM; diagnose > the presence of a BOM in consumed text as an error, and > produce text without a BOM. > * Otherwise, if possible, mandate use of UTF-8 with or > without a BOM; accept and discard a BOM in consumed text, > and produce text without a BOM. > * Otherwise, if possible, use UTF-8 as the default encoding > with use of other encodings negotiated using information > other than a BOM; accept and discard a BOM in consumed > text, and produce text without a BOM. > * Otherwise, require the presence of a BOM to differentiate > UTF-8 encoded text in both consumed and produced > text*unless the absence of a BOM would result in the text > being interpreted as an ASCII-based encoding and the UTF-8 > text contains no non-ASCII characters (the exception is > intended to avoid the addition of a BOM to ASCII text thus > rendering such text as non-ASCII)*. This approach should > be reserved for scenarios in which UTF-8 cannot be adopted > as a default due to backward compatibility concerns. > > Tom. > > On 10/12/20 8:40 AM, Alisdair Meredith wrote: > > That addresses my main concern. ?Essentially, best practice > (for UTF-8) would be no BOM unless the document contains code > points that require multiple code units to express. > > AlisdairM > > > > > On Oct 11, 2020, at 23:22, Tom Honermann > > wrote: > > On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: > > One concern I have, that might lead into rationale for > the current discouragement, > > is that I would hate to see a best practice that > pushes a BOM into ASCII files. > > One of the nice properties of UTF-8 is that a valid > ASCII file (still very common) is > > also a valid UTF-8 file. ?Changing best practice would > encourage updating those > > files to be no longer ASCII. > > Thanks, Alisdair.? I think that concern is implicitly > addressed by the suggested resolutions, but perhaps that > can be made more clear.? One possibility would be to > modify the "protocol designer" guidelines to address the > case where a protocol's default encoding is ASCII based > and to specify that a BOM is only required for UTF-8 text > that contains non-ASCII characters.? Would that be helpful? > > Tom. > > AlisdairM > > > > > On Oct 10, 2020, at 14:54, Tom Honermann via SG16 > > wrote: > > Attached is a draft proposal for the Unicode > standard that intends to clarify the current > recommendation regarding use of a BOM in UTF-8 > text.? This is follow up to discussion on the > Unicode mailing list > > back in June. > > Feedback is welcome.? I plan to submit > > this to the UTC in a week or so pending review > feedback. > > Tom. > > -- > SG16 mailing list > SG16 at lists.isocpp.org > https://lists.isocpp.org/mailman/listinfo.cgi/sg16 > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Tue Oct 13 16:29:45 2020 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Tue, 13 Oct 2020 21:29:45 +0000 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> <74f39a75-72d8-aacf-02f8-d21a18450a56@honermann.net> Message-ID: > The "use something other than a BOM" could mean adding a command line option, adding a menu option, remembering what encoding was used for that file last time, performing a heuristic analysis (that may or may not include the presence of a BOM in its calculation), prompting the user, etc... That?s the catch. ?Adding? adding? remembering? performing.? If the code was doing the best practices/right thing, it?d be using UTF-8. It isn?t, and it?s sort of a given that it?s legacy behavior. Therefore ?adding?, etc. means that changes have to happen to the applications &/or processes. Which aren?t necessarily going to be deployed promptly, if at all. This isn?t a problem that a standard or best practices can solve. Everyone already knows the best practice: ?Use UTF-8?. Any resources/effort is going to be getting toward that best practice, not edge cases of legacy behaviors that are offshoots of something that isn?t the desired end state of ?use UTF-8?. I sympathize with the problem, since I encounter variations of it every day, I just don?t think any tweaking of this text will have any practical impact with moving the needle. -Shawn From: Tom Honermann Sent: Tuesday, October 13, 2020 2:06 PM To: Shawn Steele ; Alisdair Meredith Cc: sg16 at lists.isocpp.org; Unicode Mail List Subject: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature On 10/13/20 4:42 PM, Shawn Steele wrote: My assertion is that if the application cannot change to UTF-8 due to legacy considerations, that the subtleties of whether to use a BOM or not also cannot be prescribed. If the application could follow best practices, it would use UTF-8. Since it cannot use UTF-8, therefore it can?t follow any prescribed behavior. Therefore anything beyond ?Use Unicode!? is merely suggestions. Terminology like ?require? implies a false sense of rigor that these applications can?t follow in practice. This is why the prescription remains abstract: * If possible, use something other than a BOM. * As a last resort, use a BOM. I am effectively proposing that as a best practice. Eg: Presume I have a text editor that has been used in some context for some time. If I?m told ?use UTF-8?, that?s cool, I could try to do that, but if I cannot, then I?m in an exceptional path. Unicode could suggest that I consider behavior for BOMs (such as ignoring them if present), however I?m already stuck in my legacy behavior, so there?s a limit to what my application can do. This scenario fits the advice above. The "use something other than a BOM" could mean adding a command line option, adding a menu option, remembering what encoding was used for that file last time, performing a heuristic analysis (that may or may not include the presence of a BOM in its calculation), prompting the user, etc... However, if Unicode says ?if you see a BOM, then you must use UTF-8?, then users of my legacy application that is difficult to change, may have expectations of the application that don?t match reality. They could even enter bugs like ?The app isn?t recognizing data being tagged with BOMs.? Or ?your system isn?t compliant, so we can?t license it.? If the app could properly handle UTF-8, we?d have been captured in the first requirements and wouldn?t even be having this part of the conversation. Since they can?t handle UTF-8, trying to enforce it through the BOM isn?t going to add much. No part of this proposal states "if you see a BOM, then you must use UTF-8". It only suggests guidelines; requirements are imposed by protocols as deemed appropriate by the protocol designers. IMO it?s better that everyone involved understand that this legacy app that can?t handle UTF-8 by default isn?t necessarily going to behave per any set expectations and likely has legacy behaviors that users may need to deal with. Granted, the difference between ?requiring,? and ?suggesting? or ?recommending?, may be subtle, however those subtleties can sometimes cause unnecessary pain. I don?t mind mandating UTF-8 without BOM if possible. I don?t really mind mandating that BOMs be ignored if ?without BOM? isn?t reasonable to mandate. After that though, it?s trying to create a higher order protocol for codepage detection. BOM isn?t a great way to identify UTF-8 data. (It?s probably more effective to decode it as UTF-8. If it decodes properly, then it?s likely UTF-8. With a certainty of about as many ?nines? as you have bytes of input. Linguistically appropriate strings that fail that test are rare.) We are agreed on these points. Tom. -Shawn From: Tom Honermann Sent: Tuesday, October 13, 2020 1:04 PM To: Shawn Steele ; Alisdair Meredith Cc: sg16 at lists.isocpp.org; Unicode Mail List Subject: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature On 10/12/20 4:54 PM, Shawn Steele wrote: I?m having trouble with the attempt to be this prescriptive. These make sense: ?Use Unicode!? * If possible, mandate use of UTF-8 without a BOM; diagnose the presence of a BOM in consumed text as an error, and produce text without a BOM. * Alternatively, swallow the BOM if present. After that the situation is clearly hopeless. Applications should Use Unicode, eg: UTF-8, and clearly there are cases happening where that isn?t happening. Trying to prescribe that negotiation should therefore happen, or that BOMs should be interpreted or whatever is fairly meaningless at that point. Given that the higher-order guidance of ?Use Unicode? has already been ignored, at this point it?s garbage-in, garbage-out. Clearly the app/whatever is ignoring the ?use unicode? guidance for some legacy reason. If they could adapt, it should be to use UTF-8. It *might* be helpful to say something about a BOM likely indicating UTF-8 text in otherwise unspecified data, but prescriptive stuff is pointless, it?s legacy stuff that behaves in a legacy fashion for a reason and saying they should have done it differently 20 years ago isn?t going to help ? There are applications that, for legacy reasons, are unable to change their default encoding to UTF-8, but that also need to handle UTF-8 text. It is not clear to me that such situations are hopeless or that they cannot be improved. The prescription offered follows what you suggest. The first three cases are are all of the "use Unicode!" variety. The distinction between the third and the fourth is to relegate use of a BOM as an encoding signature to the last resort option. The intent is to make it clear, with stronger motivation than is currently present in the Unicode standard, that use of a BOM in UTF-8 is not a best practice today. Tom. -Shawn From: Unicode On Behalf Of Tom Honermann via Unicode Sent: Monday, October 12, 2020 7:03 AM To: Alisdair Meredith Cc: sg16 at lists.isocpp.org; Unicode List Subject: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature Great, here is the change I'm making to address this: Protocol designers: * If possible, mandate use of UTF-8 without a BOM; diagnose the presence of a BOM in consumed text as an error, and produce text without a BOM. * Otherwise, if possible, mandate use of UTF-8 with or without a BOM; accept and discard a BOM in consumed text, and produce text without a BOM. * Otherwise, if possible, use UTF-8 as the default encoding with use of other encodings negotiated using information other than a BOM; accept and discard a BOM in consumed text, and produce text without a BOM. * Otherwise, require the presence of a BOM to differentiate UTF-8 encoded text in both consumed and produced text unless the absence of a BOM would result in the text being interpreted as an ASCII-based encoding and the UTF-8 text contains no non-ASCII characters (the exception is intended to avoid the addition of a BOM to ASCII text thus rendering such text as non-ASCII). This approach should be reserved for scenarios in which UTF-8 cannot be adopted as a default due to backward compatibility concerns. Tom. On 10/12/20 8:40 AM, Alisdair Meredith wrote: That addresses my main concern. Essentially, best practice (for UTF-8) would be no BOM unless the document contains code points that require multiple code units to express. AlisdairM On Oct 11, 2020, at 23:22, Tom Honermann > wrote: On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: One concern I have, that might lead into rationale for the current discouragement, is that I would hate to see a best practice that pushes a BOM into ASCII files. One of the nice properties of UTF-8 is that a valid ASCII file (still very common) is also a valid UTF-8 file. Changing best practice would encourage updating those files to be no longer ASCII. Thanks, Alisdair. I think that concern is implicitly addressed by the suggested resolutions, but perhaps that can be made more clear. One possibility would be to modify the "protocol designer" guidelines to address the case where a protocol's default encoding is ASCII based and to specify that a BOM is only required for UTF-8 text that contains non-ASCII characters. Would that be helpful? Tom. AlisdairM On Oct 10, 2020, at 14:54, Tom Honermann via SG16 > wrote: Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text. This is follow up to discussion on the Unicode mailing list back in June. Feedback is welcome. I plan to submit this to the UTC in a week or so pending review feedback. Tom. -- SG16 mailing list SG16 at lists.isocpp.org https://lists.isocpp.org/mailman/listinfo.cgi/sg16 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mnh48mail at gmail.com Tue Oct 13 19:00:15 2020 From: mnh48mail at gmail.com (Yaya MNH48) Date: Wed, 14 Oct 2020 08:00:15 +0800 Subject: Question for Malay Jawi letter in Unicode - three quarter hamza Message-ID: Hello everyone, I'm new on this list. I've got question for Jawi (Malay in Arabic script) in Unicode. ## The codepoint for Jawi Letter Hamza Three Quarter High? I have been seeing "Jawi Letter Hamzah Three Quarter High" (sic) mentioned on many local documents, that it was not in Unicode but said to be proposed back then to be included after Unicode 5.0, but I could not find it on Unicode even on the current version 13.0. Is "Jawi Letter Hamzah Three Quarter High" even formally proposed and encoded yet? If so, what is the actual codepoint for it in Unicode? or did no one brought it to Unicode's attention in the first place? Note: The spelling "Hamzah" in all those documents are influenced from Malay (and also pronounced as such in Malay, with an H sound at the end), seemed like none of them realized that the final H does not exist in the English spelling "Hamza", and they just carried over the Malay spelling "Hamzah" in all of their English documents. For consistency, I'm using the proper English spelling "Hamza" instead of the actual spelling being used over here "Hamzah". While most of the documents are local, some of them do exist online, such as this document (linked after quote) in 2009 from Malaysia Network Information Center (MYNIC) to Internet Assigned Numbers Authority (IANA) for inclusion in their repository of Internationalized Domain Names (IDN) tables for .my Malay (macrolanguage) (Malaysia) entry, in which the document ends with (sic) > This character is not in the Unicode Table 5.0. The linguist came up with the decision to propose the inclusion of Jawi Letter Hamzah Three Quarter into the Unicode table. Link: https://www.iana.org/domains/idn-tables/tables/my_ms-my_1.0.pdf (via https://www.iana.org/domains/idn-tables ) The Jawi letter Hamza Three Quarter High (marked as HTQ in the examples from now onwards) is part of everyday use words, for example in the Jawi spelling of the word "air" which mean "water" or "drink" (noun) which is [alef-HTQ-yeh-reh]. Another example would be in the Jawi spelling of the word "perduaan" which mean "binary" (term in computing, science and mathematics) which is [PA(Veh)-reh-dal-waw-alef-HTQ-noon]. These are the most common usage of Hamza Three Quarter High in Malay Jawi: [A] consecutive vowels for au, ai, and ui, mostly in native Malay words. Example: ? "laut" [la?ut] (meaning: sea) is spelt [lam-alef-HTQ-waw-teh] ? "baik" [ba?i?] (meaning: good / nice) is spelt [beh-alef-HTQ-yeh-qaf] ? "buih" [bu?ih] (meaning: bubble) is spelt [beh-waw-HTQ-yeh-heh] [B] diphtongs for au and ai, mostly in Malay words loaned from English. Example: ? "audio" [au?dio] (meaning: audio) is spelt [alef-HTQ-waw-dal-yeh-waw] ? "aising" [ai?si?] (meaning: icing) is spelt [alef-HTQ-yeh-seen-yeh-NGA(AinWithThreeDotsAbove)] [C] suffix -an after vowel a, and suffix -i after vowels a and u, as part of Malay grammar rule which attaches affixes to modify word form. Example: ? "kenyataan" [k???a?ta?an] (meaning: statement) is spelt [keheh-NYA(NoonWithThreeDotsAbove)-alef-teh-alef-HTQ-noon], it got affixed from root word "nyata" [?a?ta] (meaning: to state) ? "cintai" [t?in?ta?i] (meaning: love (conjugated verb)) is spelt [tcheh-yeh-noon-teh-alef-HTQ-yeh], it got affixed from root word "cinta" [t?in?ta] (meaning: to love) ? "melalui" [m??la?lu?i] (meaning: via / through) is spelt [meem-lam-alef-lam-waw-HTQ-yeh], it got affixed from root word "lalu" [la?lu] (meaning: to pass through) [D] spelling of Malaysian Chinese names in Malay. Example: ? "Ng" [??] (multiple family names including ?) is spelt [HTQ-NGA(AinWithThreeDotsAbove)] ? "Ong" [o?] (multiple family names including ?) is spelt [HTQ-waw-NGA(AinWithThreeDotsAbove)] Image of the word spellings: https://jawi.mnh48.moe/assets/img/email/word-with-hamza-three-quarter-high.png Image of sample sentences: https://jawi.mnh48.moe/assets/img/email/sentence-with-hamza-three-quarter-high.png I'm seeing people using either superscripted version of Arabic Letter Hamza (U+0621) or abuses Arabic Letter High Hamza (U+0674) but those doesn't help in plain text where none of the abused formatting would work in the first place. Some people even gave up and just create image with the correctly positioned letters and uses that image in place of text. Some font, notably Amiri, actually displayed Arabic Letter High Hamza on three-quarter high, probably (but not necessarily the case) to accommodate Jawi users who abuses that letter for their Hamza Three Quarter High, but that would lock users to use only that font as other fonts still display the original height. Link to Amiri font: https://www.amirifont.org/ The letter is the same size as regular Hamza, not any smaller like the size of High Hamza or the Hamza above/below Alef. It is positioned at three-quarter height of the writing line, unlike Arabic Letter High Hamza that displays on the highest position on the line nor regular Hamza that displays on the baseline. The letter is also separate from regular Hamza, which is mostly used for Arabic or Sanskrit loanwords in Malay. Regular Hamza and Three Quarter High Hamza co-exist in Malay Jawi, could exist in the same sentence or even the same word, but should not be shown at the same height in all cases including plain text, otherwise it could cause confusion when reading since it could signal different sound, and it is grammatically wrong as well. To recap the questions from first paragraph in case you are still unclear on the actual questions: Is "Jawi Letter Hamzah Three Quarter High" even formally proposed and encoded yet? If so, what is the actual codepoint for it in Unicode? or did no one brought it to Unicode's attention in the first place? I'm looking forward for more information regarding this. Best regards, [Yaya] Yaya MNH48 [A PDF version of this email is also attached as attachment, however it has slightly different formatting as I could directly mark the text in formatting so that it will be displayed correctly and so the additional images are not needed, unlike the plaintext email where none of the formatting would work] -------------- next part -------------- A non-text attachment was scrubbed... Name: email-for-unicode-about-hamza-three-quarter-high.pdf Type: application/pdf Size: 68596 bytes Desc: not available URL: From tom at honermann.net Wed Oct 14 00:16:14 2020 From: tom at honermann.net (Tom Honermann) Date: Wed, 14 Oct 2020 01:16:14 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> <74f39a75-72d8-aacf-02f8-d21a18450a56@honermann.net> Message-ID: <6dbc8066-b6bd-e0f7-47a1-708f40b69aa2@honermann.net> On 10/13/20 5:29 PM, Shawn Steele wrote: > > >?? The "use something other than a BOM" could mean adding a command > line option, adding a menu option, remembering what encoding was used > for that file last time, performing a heuristic analysis (that may or > may not include the presence of a BOM in its calculation), prompting > the user, etc... > > That?s the catch.? ?Adding? adding? remembering? performing.?? If the > code was doing the best practices/right thing, it?d be using UTF-8.? > It isn?t, and it?s sort of a given that it?s legacy behavior.? > Therefore ?adding?, etc. means that changes have to happen to the > applications &/or processes.? Which aren?t necessarily going to be > deployed promptly, if at all.? This isn?t a problem that a standard or > best practices can solve. > > Everyone already knows the best practice:? ?Use UTF-8?.? Any > resources/effort is going to be getting toward that best practice, not > edge cases of legacy behaviors that are offshoots of something that > isn?t the desired end state of ?use UTF-8?. > My goal is exactly to ease migration to that end state.? We can't reasonably synchronize a migration of all C++ projects to UTF-8. To get to that end state, we'll have to enable C++ projects to independently transition to UTF-8.? Such independent transition will be eased by having a portable means to indicate that a source file is UTF-8 encoded in such a way that a C++ compiler can process it correctly when it is #included from a differently encoded source file.? This would suffice for a project to migrate to UTF-8 while being usable (e.g., having its header files #included) by another UTF-8 encoded project, a Windows-1252 encoded project, or an EBCDIC encoded project.? Those other projects can then migrate to UTF-8 on their own schedule. Use of a BOM would be one way to get to that desired end state but, as you mentioned, a BOM isn't a great way to identify UTF-8 data.? The Unicode standard already admits this with the quoted "not recommended" text, but it lacks the rationale to defend that recommendation or to explain when it may be appropriate to disregard that recommendation.? My goal with this paper is to fill that hole.? If you don't care for how I've proposed it to be filled, that is certainly ok and alternative suggestions are welcome. > I sympathize with the problem, since I encounter variations of it > every day, I just don?t think any tweaking of this text will have any > practical impact with moving the needle. > That is entirely possible. Tom. > -Shawn > > *From:* Tom Honermann > *Sent:* Tuesday, October 13, 2020 2:06 PM > *To:* Shawn Steele ; Alisdair Meredith > > *Cc:* sg16 at lists.isocpp.org; Unicode Mail List > *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of a > BOM as a UTF-8 encoding signature > > On 10/13/20 4:42 PM, Shawn Steele wrote: > > My assertion is that if the application cannot change to UTF-8 due > to legacy considerations, that the subtleties of whether to use a > BOM or not also cannot be prescribed.? If the application could > follow best practices, it would use UTF-8.? Since it cannot use > UTF-8, therefore it can?t follow any prescribed behavior.? > Therefore anything beyond ?Use Unicode!? is merely suggestions.? > Terminology like ?require? implies a false sense of rigor that > these applications can?t follow in practice. > > This is why the prescription remains abstract: > > * If possible, use something other than a BOM. > * As a last resort, use a BOM. > > I am effectively proposing that as a best practice. > > Eg:? Presume I have a text editor that has been used in some > context for some time. ?If I?m told ?use UTF-8?, that?s cool, I > could try to do that, but if I cannot, then I?m in an exceptional > path.? Unicode could suggest that I consider behavior for BOMs > (such as ignoring them if present), however I?m already stuck in > my legacy behavior, so there?s a limit to what my application can do. > > This scenario fits the advice above.? The "use something other than a > BOM" could mean adding a command line option, adding a menu option, > remembering what encoding was used for that file last time, performing > a heuristic analysis (that may or may not include the presence of a > BOM in its calculation), prompting the user, etc... > > However, if Unicode says ?if you see a BOM, then you must use > UTF-8?, then users of my legacy application that is difficult to > change, may have expectations of the application that don?t match > reality. They could even enter bugs like ?The app isn?t > recognizing data being tagged with BOMs.?? Or ?your system isn?t > compliant, so we can?t license it.?? If the app could properly > handle UTF-8, we?d have been captured in the first requirements > and wouldn?t even be having this part of the conversation.? Since > they can?t handle UTF-8, trying to enforce it through the BOM > isn?t going to add much. > > No part of this proposal states "if you see a BOM, then you must use > UTF-8".? It only suggests guidelines; requirements are imposed by > protocols as deemed appropriate by the protocol designers. > > IMO it?s better that everyone involved understand that this legacy > app that can?t handle UTF-8 by default isn?t necessarily going to > behave per any set expectations and likely has legacy behaviors > that users may need to deal with. > > Granted, the difference between ?requiring,? and ?suggesting? or > ?recommending?, may be subtle, however those subtleties can > sometimes cause unnecessary pain. > > I don?t mind mandating UTF-8 without BOM if possible.? I don?t > really mind mandating that BOMs be ignored if ?without BOM? isn?t > reasonable to mandate. > > After that though, it?s trying to create a higher order protocol > for codepage detection.? BOM isn?t a great way to identify UTF-8 > data.? (It?s probably more effective to decode it as UTF-8.? If it > decodes properly, then it?s likely UTF-8.? With a certainty of > about as many ?nines? as you have bytes of input.? Linguistically > appropriate strings that fail that test are rare.) > > We are agreed on these points. > > Tom. > > -Shawn > > *From:* Tom Honermann > *Sent:* Tuesday, October 13, 2020 1:04 PM > *To:* Shawn Steele > ; Alisdair Meredith > > *Cc:* sg16 at lists.isocpp.org ; > Unicode Mail List > *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of > a BOM as a UTF-8 encoding signature > > On 10/12/20 4:54 PM, Shawn Steele wrote: > > I?m having trouble with the attempt to be this prescriptive. > > These make sense:? ?Use Unicode!? > > * If possible, mandate use of UTF-8 without a BOM; diagnose > the presence of a BOM in consumed text as an error, and > produce text without a BOM. > * Alternatively, swallow the BOM if present. > > After that the situation is clearly hopeless. ?Applications > should Use Unicode, eg: UTF-8, and clearly there are cases > happening where that isn?t happening.? Trying to prescribe > that negotiation should therefore happen, or that BOMs should > be interpreted or whatever is fairly meaningless at that > point. ?Given that the higher-order guidance of ?Use Unicode? > has already been ignored, at this point it?s garbage-in, > garbage-out.? Clearly the app/whatever is ignoring the ?use > unicode? guidance for some legacy reason.? If they could > adapt, it should be to use UTF-8. ?It **might** be helpful to > say something about a BOM likely indicating UTF-8 text in > otherwise unspecified data, but prescriptive stuff is > pointless, it?s legacy stuff that behaves in a legacy fashion > for a reason and saying they should have done it differently > 20 years ago isn?t going to help ? > > There are applications that, for legacy reasons, are unable to > change their default encoding to UTF-8, but that also need to > handle UTF-8 text.? It is not clear to me that such situations are > hopeless or that they cannot be improved. > > The prescription offered follows what you suggest.? The first > three cases are are all of the "use Unicode!" variety.? The > distinction between the third and the fourth is to relegate use of > a BOM as an encoding signature to the last resort option.? The > intent is to make it clear, with stronger motivation than is > currently present in the Unicode standard, that use of a BOM in > UTF-8 is not a best practice today. > > Tom. > > -Shawn > > *From:* Unicode > *On Behalf Of *Tom > Honermann via Unicode > *Sent:* Monday, October 12, 2020 7:03 AM > *To:* Alisdair Meredith > > *Cc:* sg16 at lists.isocpp.org ; > Unicode List > *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use > of a BOM as a UTF-8 encoding signature > > Great, here is the change I'm making to address this: > > Protocol designers: > > * If possible, mandate use of UTF-8 without a BOM; > diagnose the presence of a BOM in consumed text as an > error, and produce text without a BOM. > * Otherwise, if possible, mandate use of UTF-8 with or > without a BOM; accept and discard a BOM in consumed > text, and produce text without a BOM. > * Otherwise, if possible, use UTF-8 as the default > encoding with use of other encodings negotiated using > information other than a BOM; accept and discard a BOM > in consumed text, and produce text without a BOM. > * Otherwise, require the presence of a BOM to > differentiate UTF-8 encoded text in both consumed and > produced text*unless the absence of a BOM would result > in the text being interpreted as an ASCII-based > encoding and the UTF-8 text contains no non-ASCII > characters (the exception is intended to avoid the > addition of a BOM to ASCII text thus rendering such > text as non-ASCII)*. This approach should be reserved > for scenarios in which UTF-8 cannot be adopted as a > default due to backward compatibility concerns. > > Tom. > > On 10/12/20 8:40 AM, Alisdair Meredith wrote: > > That addresses my main concern. ?Essentially, best > practice (for UTF-8) would be no BOM unless the document > contains code points that require multiple code units to > express. > > AlisdairM > > > > > > On Oct 11, 2020, at 23:22, Tom Honermann > > wrote: > > On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: > > One concern I have, that might lead into rationale > for the current discouragement, > > is that I would hate to see a best practice that > pushes a BOM into ASCII files. > > One of the nice properties of UTF-8 is that a > valid ASCII file (still very common) is > > also a valid UTF-8 file. ?Changing best practice > would encourage updating those > > files to be no longer ASCII. > > Thanks, Alisdair.? I think that concern is implicitly > addressed by the suggested resolutions, but perhaps > that can be made more clear.? One possibility would be > to modify the "protocol designer" guidelines to > address the case where a protocol's default encoding > is ASCII based and to specify that a BOM is only > required for UTF-8 text that contains non-ASCII > characters.? Would that be helpful? > > Tom. > > AlisdairM > > > > > > On Oct 10, 2020, at 14:54, Tom Honermann via > SG16 > wrote: > > Attached is a draft proposal for the Unicode > standard that intends to clarify the current > recommendation regarding use of a BOM in UTF-8 > text.? This is follow up to discussion on the > Unicode mailing list > > back in June. > > Feedback is welcome.? I plan to submit > > this to the UTC in a week or so pending review > feedback. > > Tom. > > -- > SG16 mailing list > SG16 at lists.isocpp.org > > https://lists.isocpp.org/mailman/listinfo.cgi/sg16 > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Wed Oct 14 03:46:06 2020 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 14 Oct 2020 09:46:06 +0100 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <6dbc8066-b6bd-e0f7-47a1-708f40b69aa2@honermann.net> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> <74f39a75-72d8-aacf-02f8-d21a18450a56@honermann.net> <6dbc8066-b6bd-e0f7-47a1-708f40b69aa2@honermann.net> Message-ID: On Wed, 14 Oct 2020 at 06:22, Tom Honermann via Unicode wrote: > > Use of a BOM would be one way to get to that desired end state but, as you mentioned, a BOM isn't a great way to identify UTF-8 data. It is just as good a way to identify UTF-8 data as a BOM in UTF-18 data is for identifying UTF-16BE and UTF-16LE data. > The Unicode standard already admits this with the quoted "not recommended" text, I'm sorry, where is "the quoted "not recommended" text" in the Unicode Standard? The Unicode Standard section 2.6 (https://www.unicode.org/versions/Unicode13.0.0/ch02.pdf#G9354) states: "Use of a BOM is neither required nor recommended for UTF-8" My understanding of this poorly-phrased statement is that the Unicode Standard does not have a recommendation to use a BOM in UTF-8 text, but neither does it recommend not to use a BOM in UTF-8 text, i.e. the standard is essentially neutral on the position of BOM in UTF-8 (I think the interpretation of this statement has been discussed at least once previously on the Unicode list). > but it lacks the rationale to defend that recommendation or to explain when it may be appropriate to disregard that recommendation. The Unicode Standard text is explicitly not a recommendation! Andrew From tom at honermann.net Wed Oct 14 08:35:22 2020 From: tom at honermann.net (Tom Honermann) Date: Wed, 14 Oct 2020 09:35:22 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> <74f39a75-72d8-aacf-02f8-d21a18450a56@honermann.net> <6dbc8066-b6bd-e0f7-47a1-708f40b69aa2@honermann.net> Message-ID: <84ff62e5-7157-1c92-6995-db6f68e95e23@honermann.net> On 10/14/20 4:46 AM, Andrew West wrote: > On Wed, 14 Oct 2020 at 06:22, Tom Honermann via Unicode > wrote: >> Use of a BOM would be one way to get to that desired end state but, as you mentioned, a BOM isn't a great way to identify UTF-8 data. > It is just as good a way to identify UTF-8 data as a BOM in UTF-18 > data is for identifying UTF-16BE and UTF-16LE data. > >> The Unicode standard already admits this with the quoted "not recommended" text, > I'm sorry, where is "the quoted "not recommended" text" in the Unicode > Standard? The Unicode Standard section 2.6 > (https://www.unicode.org/versions/Unicode13.0.0/ch02.pdf#G9354) > states: > > "Use of a BOM is neither required nor recommended for UTF-8" > > My understanding of this poorly-phrased statement is that the Unicode > Standard does not have a recommendation to use a BOM in UTF-8 text, > but neither does it recommend not to use a BOM in UTF-8 text, i.e. the > standard is essentially neutral on the position of BOM in UTF-8 (I > think the interpretation of this statement has been discussed at least > once previously on the Unicode list). This has been discussed before; one such discussion is linked from the paper (https://corp.unicode.org/pipermail/unicode/2020-June/008713.html). Your interpretation of that phrase does not match my interpretation nor that of anyone else that I've discussed this with.? If the intent had been to be neutral, then "Use of a BOM is not required for UTF-8" would have sufficed.? If the intent had been to be explicitly neutral, then something like "Use of a BOM in UTF-8 is not required and this standard makes no recommendations regarding its use or non-use". > >> but it lacks the rationale to defend that recommendation or to explain when it may be appropriate to disregard that recommendation. > The Unicode Standard text is explicitly not a recommendation! If so, then the first suggested resolution in the paper would clarify that. Tom. > > Andrew From asmusf at ix.netcom.com Wed Oct 14 09:30:48 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 14 Oct 2020 07:30:48 -0700 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: <84ff62e5-7157-1c92-6995-db6f68e95e23@honermann.net> References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> <74f39a75-72d8-aacf-02f8-d21a18450a56@honermann.net> <6dbc8066-b6bd-e0f7-47a1-708f40b69aa2@honermann.net> <84ff62e5-7157-1c92-6995-db6f68e95e23@honermann.net> Message-ID: <16c8ace8-e2a7-d197-2e9d-28e20794986a@ix.netcom.com> An HTML attachment was scrubbed... URL: From boldewyn at gmail.com Wed Oct 14 16:04:46 2020 From: boldewyn at gmail.com (Manuel Strehl) Date: Wed, 14 Oct 2020 23:04:46 +0200 Subject: Updating TR 42 to Unicode 14.0.0 Message-ID: <375c2f8f-39bc-131e-fe66-46c4e9af580d@gmail.com> Hi, I?ve just noticed, that TR #42 (Unicode in XML) [1] was not updated to Unicode 14.0.0. Is that simply an oversight or is the XML representation abandoned? (Which would be a pity. I used it with some success as source of codepoints.net.) Cheers, Manuel [1] https://www.unicode.org/reports/tr42/ From markus.icu at gmail.com Wed Oct 14 16:28:12 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 14 Oct 2020 14:28:12 -0700 Subject: Updating TR 42 to Unicode 14.0.0 In-Reply-To: <375c2f8f-39bc-131e-fe66-46c4e9af580d@gmail.com> References: <375c2f8f-39bc-131e-fe66-46c4e9af580d@gmail.com> Message-ID: I would be surprised if anything had been updated to a Unicode version that is scheduled to be released roughly a year from now. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From boldewyn at gmail.com Wed Oct 14 16:31:21 2020 From: boldewyn at gmail.com (Manuel Strehl) Date: Wed, 14 Oct 2020 23:31:21 +0200 Subject: Updating TR 42 to Unicode 14.0.0 In-Reply-To: References: <375c2f8f-39bc-131e-fe66-46c4e9af580d@gmail.com> Message-ID: Right, v14.0.0 is not released yet. Thank you! That was a silly mistake on my end. Cheers, Manuel Am Mi., 14. Okt. 2020 um 23:28 Uhr schrieb Markus Scherer : > > I would be surprised if anything had been updated to a Unicode version that is scheduled to be released roughly a year from now. > markus From prosfilaes at gmail.com Wed Oct 14 17:11:11 2020 From: prosfilaes at gmail.com (David Starner) Date: Wed, 14 Oct 2020 15:11:11 -0700 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <04334236-5f57-4d24-feaf-5a21169b347e@honermann.net> <74f39a75-72d8-aacf-02f8-d21a18450a56@honermann.net> <6dbc8066-b6bd-e0f7-47a1-708f40b69aa2@honermann.net> Message-ID: On Wed, Oct 14, 2020, 1:52 AM Andrew West via Unicode wrote: > It is just as good a way to identify UTF-8 data as a BOM in UTF-18 > data is for identifying UTF-16BE and UTF-16LE data. > No, it's not. UTF-16/32 is basically the only encodings to use more than 8 bits to encode all characters. It's expected to use a general purpose signature reader to identify UTF-16. UTF-8, on the other hand, was designed and is used in a world of ASCII extensions where it's often expected that the encoding can be named near the start of the file with no need for nonASCII characters before the encoding declaration. A UTF-8 BOM breaks that assumption. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lorna_evans at sil.org Thu Oct 15 17:13:48 2020 From: lorna_evans at sil.org (Lorna Evans) Date: Thu, 15 Oct 2020 17:13:48 -0500 Subject: Question for Malay Jawi letter in Unicode - three quarter hamza In-Reply-To: References: Message-ID: <7ba2a19c-7635-9fd5-79e1-b8453385247c@sil.org> I believe you should use 0674;ARABIC LETTER HIGH HAMZA;Lo;0;AL;;;;;N;ARABIC LETTER HIGH HAMZAH As you say, some fonts like Amiri position it lower. I have a document being discussed to bring the position of the glyph down a bit from where it is in the codecharts so it's about even with the top of the alef rather than much higher. Your examples look lower than that, but that would be a font issue, not an encoding issue. Maybe we should consider adding an annotation for U+0674 to say that this character should be used for Jawi. Lorna On 10/13/2020 7:00 PM, Yaya MNH48 via Unicode wrote: > Hello everyone, I'm new on this list. > > I've got question for Jawi (Malay in Arabic script) in Unicode. > > > ## The codepoint for Jawi Letter Hamza Three Quarter High? > > I have been seeing "Jawi Letter Hamzah Three Quarter High" (sic) > mentioned on many local documents, that it was not in Unicode but said > to be proposed back then to be included after Unicode 5.0, but I could > not find it on Unicode even on the current version 13.0. Is "Jawi > Letter Hamzah Three Quarter High" even formally proposed and encoded > yet? If so, what is the actual codepoint for it in Unicode? or did no > one brought it to Unicode's attention in the first place? > > Note: The spelling "Hamzah" in all those documents are influenced from > Malay (and also pronounced as such in Malay, with an H sound at the > end), seemed like none of them realized that the final H does not > exist in the English spelling "Hamza", and they just carried over the > Malay spelling "Hamzah" in all of their English documents. For > consistency, I'm using the proper English spelling "Hamza" instead of > the actual spelling being used over here "Hamzah". > > > While most of the documents are local, some of them do exist online, > such as this document (linked after quote) in 2009 from Malaysia > Network Information Center (MYNIC) to Internet Assigned Numbers > Authority (IANA) for inclusion in their repository of > Internationalized Domain Names (IDN) tables for .my Malay > (macrolanguage) (Malaysia) entry, in which the document ends with > (sic) >> This character is not in the Unicode Table 5.0. The linguist came up with the decision to propose the inclusion of Jawi Letter Hamzah Three Quarter into the Unicode table. > Link: https://www.iana.org/domains/idn-tables/tables/my_ms-my_1.0.pdf > (via https://www.iana.org/domains/idn-tables ) > > > The Jawi letter Hamza Three Quarter High (marked as HTQ in the > examples from now onwards) is part of everyday use words, for example > in the Jawi spelling of the word "air" which mean "water" or "drink" > (noun) which is [alef-HTQ-yeh-reh]. Another example would be in the > Jawi spelling of the word "perduaan" which mean "binary" (term in > computing, science and mathematics) which is > [PA(Veh)-reh-dal-waw-alef-HTQ-noon]. > > These are the most common usage of Hamza Three Quarter High in Malay Jawi: > > [A] consecutive vowels for au, ai, and ui, mostly in native Malay words. > Example: > ? "laut" [la?ut] (meaning: sea) is spelt [lam-alef-HTQ-waw-teh] > ? "baik" [ba?i?] (meaning: good / nice) is spelt [beh-alef-HTQ-yeh-qaf] > ? "buih" [bu?ih] (meaning: bubble) is spelt [beh-waw-HTQ-yeh-heh] > > [B] diphtongs for au and ai, mostly in Malay words loaned from English. > Example: > ? "audio" [au?dio] (meaning: audio) is spelt [alef-HTQ-waw-dal-yeh-waw] > ? "aising" [ai?si?] (meaning: icing) is spelt > [alef-HTQ-yeh-seen-yeh-NGA(AinWithThreeDotsAbove)] > > [C] suffix -an after vowel a, and suffix -i after vowels a and u, as > part of Malay grammar rule which attaches affixes to modify word form. > Example: > ? "kenyataan" [k???a?ta?an] (meaning: statement) is spelt > [keheh-NYA(NoonWithThreeDotsAbove)-alef-teh-alef-HTQ-noon], it got > affixed from root word "nyata" [?a?ta] (meaning: to state) > ? "cintai" [t?in?ta?i] (meaning: love (conjugated verb)) is spelt > [tcheh-yeh-noon-teh-alef-HTQ-yeh], it got affixed from root word > "cinta" [t?in?ta] (meaning: to love) > ? "melalui" [m??la?lu?i] (meaning: via / through) is spelt > [meem-lam-alef-lam-waw-HTQ-yeh], it got affixed from root word "lalu" > [la?lu] (meaning: to pass through) > > [D] spelling of Malaysian Chinese names in Malay. > Example: > ? "Ng" [??] (multiple family names including ?) is spelt > [HTQ-NGA(AinWithThreeDotsAbove)] > ? "Ong" [o?] (multiple family names including ?) is spelt > [HTQ-waw-NGA(AinWithThreeDotsAbove)] > > Image of the word spellings: > https://jawi.mnh48.moe/assets/img/email/word-with-hamza-three-quarter-high.png > > Image of sample sentences: > https://jawi.mnh48.moe/assets/img/email/sentence-with-hamza-three-quarter-high.png > > > I'm seeing people using either superscripted version of Arabic Letter > Hamza (U+0621) or abuses Arabic Letter High Hamza (U+0674) but those > doesn't help in plain text where none of the abused formatting would > work in the first place. Some people even gave up and just create > image with the correctly positioned letters and uses that image in > place of text. Some font, notably Amiri, actually displayed Arabic > Letter High Hamza on three-quarter high, probably (but not necessarily > the case) to accommodate Jawi users who abuses that letter for their > Hamza Three Quarter High, but that would lock users to use only that > font as other fonts still display the original height. > > Link to Amiri font: https://www.amirifont.org/ > > > The letter is the same size as regular Hamza, not any smaller like the > size of High Hamza or the Hamza above/below Alef. It is positioned at > three-quarter height of the writing line, unlike Arabic Letter High > Hamza that displays on the highest position on the line nor regular > Hamza that displays on the baseline. > > The letter is also separate from regular Hamza, which is mostly used > for Arabic or Sanskrit loanwords in Malay. Regular Hamza and Three > Quarter High Hamza co-exist in Malay Jawi, could exist in the same > sentence or even the same word, but should not be shown at the same > height in all cases including plain text, otherwise it could cause > confusion when reading since it could signal different sound, and it > is grammatically wrong as well. > > To recap the questions from first paragraph in case you are still > unclear on the actual questions: Is "Jawi Letter Hamzah Three Quarter > High" even formally proposed and encoded yet? If so, what is the > actual codepoint for it in Unicode? or did no one brought it to > Unicode's attention in the first place? > > I'm looking forward for more information regarding this. > > Best regards, > [Yaya] > Yaya MNH48 > > [A PDF version of this email is also attached as attachment, however > it has slightly different formatting as I could directly mark the text > in formatting so that it will be displayed correctly and so the > additional images are not needed, unlike the plaintext email where > none of the formatting would work] From tom at honermann.net Thu Oct 15 17:32:39 2020 From: tom at honermann.net (Tom Honermann) Date: Thu, 15 Oct 2020 18:32:39 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> Message-ID: On 10/14/20 8:25 AM, Alisdair Meredith wrote: > A minor note for clarity. > > I would better understand the goal of this paper if there were an > early sentence indicating > whether the target audience of the advice is: > ? ?1) document authors > ? ?2) document processing tools > ? ?3) both equally > > The advice seems geared strongly towards group (2), but we probably > want to send a > message to group (1) as well. Thanks Alisdair.? I'll add more of an introduction.? The suggested resolutions do address (1) as well, but one has to read until the end to see that.? Good suggestion. Tom. > > AlisdairM > >> On Oct 10, 2020, at 14:54, Tom Honermann via SG16 >> > wrote: >> >> Attached is a draft proposal for the Unicode standard that intends to >> clarify the current recommendation regarding use of a BOM in UTF-8 >> text. This is follow up to discussion on the Unicode mailing list >> >> back in June. >> >> Feedback is welcome.? I plan to submit >> this to the UTC in a >> week or so pending review feedback. >> >> Tom. >> >> -- >> SG16 mailing list >> SG16 at lists.isocpp.org >> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Oct 16 12:19:22 2020 From: doug at ewellic.org (Doug Ewell) Date: Fri, 16 Oct 2020 11:19:22 -0600 Subject: Teletext separated mosaic graphics In-Reply-To: References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> <547afa9d-fc6a-7fd3-6d74-591b4819eb6b@ix.netcom.com> <5CC9C8F2-7162-4D08-A832-088DAC5FE163@bahnhof.se> Message-ID: <000201d6a3e0$7d8b95f0$78a2c1d0$@ewellic.org> This discussion has focused on two fundamentally incompatible encoding protocols: 1. ECMA-48 2. Teletext to which a third has been introduced: 3. HTML Of course it is possible to convert almost any amount of plain-text styling into HTML, especially with enough CSS. Kent has provided an extensive set of examples to illustrate this. I did not find a corresponding example of the original source text for any of them, however, so it's hard to evaluate from all this what format is most suitable for representing original teletext pages in a Unicode plain-text environment. I reiterate that it was UTC and Script Ad Hoc who provided the guidance to the group writing the Symbols for Legacy Computing proposal (and there is a second on the way) that 0x00 through 0x1F in the original teletext set should map to U+0000 through U+001F when converting to Unicode. We will eventually create a Unicode Technical Note to help guide implementers in the use of Legacy Symbols characters to develop, among other things, teletext emulation. It would be great if we could converge on a solution for this that would align with the guidance of UTC and Script Ad Hoc. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From Shawn.Steele at microsoft.com Fri Oct 16 13:33:15 2020 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Fri, 16 Oct 2020 18:33:15 +0000 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <93234e19-927e-f823-8748-ec65fc6d5602@honermann.net> Message-ID: Nobody?s going to consider #1 regardless of what wordsmithing is done in Unicode, people have had too much difficulty with BOMs for it to be considered as a serious standards based solution. #4 isn?t portable. The ?right? approach would be to ensure that the languages have ways of declaring a codepage (like a pragma or other magic semantic, options 2 & 3). The time invested on this problem should be spent on getting agreement with WG21 about what the declaration should be and seeing if there are any ?gotchas? to something like #pragma UTF8. IMO, it?s not the effort to try to get effort to tweak Unicode?s guidance in order to support the common view the BOMs are bad, which WG21 won?t be considering anyway. The biggest thing I can think of is that very few codepages would lend themselves to being declared in a portable manner. Different OS?s/software/vendors have different implementations of various codepages. Even ones that are nominally similar often are mistagged or have subtle differences. In other words, ?UTF8? is about the only ?safe? encoding that won?t have edge cases. Something like ?shift-jis? has multiple legacy variations that mean everything won?t always be the same. -Shawn From: Tom Honermann Sent: Friday, October 16, 2020 6:23 AM To: Shawn Steele ; J Decker Cc: sg16 at lists.isocpp.org Subject: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature On 10/14/20 3:21 PM, Shawn Steele wrote: How are you going to #include differently encoded source files? I don?t see anything in this document that would make it possible to #include a file in a different encoding. It?s unclear to me how your proposed document could be utilized to enable the scenario you?re interested in. My intention is to present various options for WG21 to consider along with a recommendation. The options that have been identified so far are listed below. Combinations of some of these options is a possibility. 1. Use of a BOM to indicate UTF-8 encoded source files. This matches existing practice for the Microsoft compiler. 2. Use of a #pragma. This matches existing practice for the IBM compiler. 3. Use of a "magic" or "semantic" comment. This matches existing practice in Python. 4. Use of filesystem meta data. This is an option for some compilers and is being considered for Clang on z/OS. The goal of this paper is to clarify guidance in the Unicode standard in order to better inform and justify a recommendation. If the UTC were to provide a strong recommendation either for or against use of a BOM in UTF-8 files, that would be a point either in favor or in opposition to option 1 above. As is, based on my reading and a number of the responses I've seen, the guidance is murky. For mixed-encoding behavior the only thing I could imagine is adding some sort of preprocessor #codepage or something to the standard. (Which would again take a while to reach critical mass.) Yes, deployment will take time in any case. A goal would be to choose an option that can be used as an extension for previous C++ standards. This may rule out option 2 above since some compilers diagnose use of an unrecognized pragma. Tom. -Shawn From: Tom Honermann Sent: Tuesday, October 13, 2020 9:47 PM To: Shawn Steele ; J Decker Cc: sg16 at lists.isocpp.org Subject: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature On 10/13/20 5:19 PM, Shawn Steele wrote: IMO this document doesn?t solve your problem. The problem of encourage use of UTF-8 in C++ source code is a goal that most compilers/source code authors/etc are totally onboard with. The source is already in an indeterminate state. The desired end state is to have UTF-8 source code (without BOM), which is typically supported. The difficulty is therefore getting from point A to point B. As far as ?Use Unicode? goes, there?s no issue, but trying to specify BOM as a protocol doesn?t really solve the problem, particularly in complex environments. I think there is a misunderstanding. The intent of the paper is to provide rationale for the existing discouragement for use of a BOM in UTF-8 while acknowledging that, in some cases, it may remain useful. My intent is to discourage use of a BOM for UTF-8 encoded source files - thereby arguing against standardizing the behavior exhibited by Microsoft Visual C++ today. If the compiler doesn?t handle BOM as expected, then you?ll get errors. This can be further complicated by preprocessors, #include, resources, etc. If ?specifying BOM behavior in Unicode? could help solve the problem, then all of the tooling used by everyone would have to be updated to handle that (new) requirement. If you could get everyone on the same page, they?d all use UTF-8, so you wouldn?t need to update the tooling. If you don?t need to update the tooling, you wouldn?t need to update the best practices for BOMs. This paper does not propose "specifying BOM behavior in Unicode". If you feel that it does, please read it again and let me know what leads you to believe that it does. The tooling isn't the problem. The problem is the existing source code that is not UTF-8 encoded or that is UTF-8 encoded with a BOM. The deployment challenge is with those existing source files. Microsoft Visual C++ is going to continue consuming source files using the Active Code Page (ACP) and IBM compilers on EBCDIC platforms are going to continue consuming source files using EBCDIC code pages. The goal is to provide a mechanism where a UTF-8 encoded source file can #include a source file in another encoding or vice versa. Any solution for that will require tooling updates (and that is ok). Personally, I?d prefer if cases like this ignore BOMs (or use them to switch to UTF-8); eg: treat BOMs like whitespace. But this isn?t a problem solvable by any recommendation by Unicode. When consuming text as UTF-8, I agree that ignoring a BOM is usually the right thing to do and would be the right thing to do when consuming source code. As you noted, many systems provide mechanisms for indicating that code is UTF-8 or compiling with UTF-8, regardless of BOM. Yes, but there is no standard solution, not even a defacto one, for consuming differently encoded source files in the same translation unit. A rather large codebase I?ve been working with has been working to remove encoding confusion, and it?s a big task ? Yes, yes it is. Tom. -Shawn From: Unicode On Behalf Of Tom Honermann via Unicode Sent: Tuesday, October 13, 2020 1:47 PM To: J Decker ; Unicode List Cc: sg16 at lists.isocpp.org Subject: Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature On 10/12/20 8:09 PM, J Decker via Unicode wrote: On Sun, Oct 11, 2020 at 8:24 PM Tom Honermann via Unicode > wrote: On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: One concern I have, that might lead into rationale for the current discouragement, is that I would hate to see a best practice that pushes a BOM into ASCII files. One of the nice properties of UTF-8 is that a valid ASCII file (still very common) is also a valid UTF-8 file. Changing best practice would encourage updating those files to be no longer ASCII. Thanks, Alisdair. I think that concern is implicitly addressed by the suggested resolutions, but perhaps that can be made more clear. One possibility would be to modify the "protocol designer" guidelines to address the case where a protocol's default encoding is ASCII based and to specify that a BOM is only required for UTF-8 text that contains non-ASCII characters. Would that be helpful? 'and to specify that a BOM is only required for UTF-8 ' this should NEVER be 'required' or 'must', it shouldn't even be 'suggested'; fortunately BOM is just a ZWNBSP, so it's certainly a 'may' start with a such and such. These days the standard 'everything IS utf-8' works really well, except in firefox where the charset is required to be specified for JS scripts (but that's a bug in that) EBCDIC should be converted on the edge to internal ascii, since, thankfully, this is a niche application and everything thinks in ASCII or some derivative thereof. Byte Order Mark is irrelatvent to utf-8 since bytes are ordered in the correct order. I have run into several editors that have insisted on emitted BOM for UTF8 when initially promoted from ASCII, but subsequently deleting it doesn't bother anything. I mostly agree. Please note that the paper suggests use of a BOM only as a last resort. The goal is to further discourage its use with rationale. I am curious though, what was the actual problem you ran into that makes you even consider this modification? I'm working on improving support for portable C++ source code. Today, there is no character encoding that is supported by all C++ implementations (not even ASCII). I'd like to make UTF-8 that commonly supported character encoding. For backward compatibility reasons, compilers cannot change their default source code character encoding to UTF-8. Most C++ applications are created from components that have different release schedules and that are maintained by different organizations. Synchronizing a conversion to UTF-8 across dependent projects isn't feasible, nor is converting all of the source files used by an application to UTF-8 as simple as just running them through 'iconv'. Migration to UTF-8 will therefore require an incremental approach for at least some applications, though many are likely to find success by simply invoking their compiler with the appropriate -everything-is-utf8 option since most source files are ASCII. Microsoft Visual C++ recognizes a UTF-8 BOM as an encoding signature and allows differently encoded source files to be used in the same translation unit. Support for differently encoded source files in the same translation unit is the feature that will be needed to enable incremental migration. Normative discouragement (with rationale) for use of a BOM by the Unicode standard would be helpful to explain why a solution other than a BOM (perhaps something like Python's encoding declaration) should be standardized in favor of the existing practice demonstrated by Microsoft's solution. Tom. J Tom. AlisdairM On Oct 10, 2020, at 14:54, Tom Honermann via SG16 > wrote: Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text. This is follow up to discussion on the Unicode mailing list back in June. Feedback is welcome. I plan to submit this to the UTC in a week or so pending review feedback. Tom. -- SG16 mailing list SG16 at lists.isocpp.org https://lists.isocpp.org/mailman/listinfo.cgi/sg16 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mnh48mail at gmail.com Fri Oct 16 14:09:14 2020 From: mnh48mail at gmail.com (Yaya MNH48) Date: Sat, 17 Oct 2020 03:09:14 +0800 Subject: Question for Malay Jawi letter in Unicode - three quarter hamza In-Reply-To: <7ba2a19c-7635-9fd5-79e1-b8453385247c@sil.org> References: <7ba2a19c-7635-9fd5-79e1-b8453385247c@sil.org> Message-ID: Looking at the codechart, it seem to note that ARABIC LETTER HIGH HAMZA is actually used in Kazakh and also forms digraphs. Is the position of it really that high in Kazakh in the first place? Would the change to lower it makes the character invalid in Kazakh? Lorna wrote: > Your examples look lower > than that, but that would be a font issue, not an encoding issue. When writing Jawi in Malay, the position that I show in the example earlier is the correct position, it should be that low in Jawi. Lorna wrote: > Maybe we should consider adding an annotation for U+0674 to say that > this character should be used for Jawi. If it is indeed the correct character, then do add the annotation for it, maybe also mention that this is the "three-quarter hamza" or something as well so that people actually know about its existence. In addition to that, maybe some note should be put somewhere about the size and position so that font makers would aware about how it should look like in Jawi. Maybe something similar to the note on letter NYA (U+06BD - Arabic Letter Noon with Three Dots Above) early on, which was noted its shape in the 13.0.0 core specification, Chapter 9 page 386. The size of the hamza three quarter here is the same as the size of regular hamza, not slightly smaller. The position is only slightly higher but not reaching the topmost of alef. It is at the third quarter of the line, not fourth quarter (topmost). Regular hamza on other hand is located at the baseline (oneth quarter) and that is already correct. Of course, when font maker design their font, the default glyph for the character will need to compromise between Kazakh (where it is annotated to be used in) and Jawi in terms of position and sizing, but font makers should make their font display the correct position and sizing when the correct language is chosen in the software people are using. On 10/16/20, Lorna Evans wrote: > I believe you should use 0674;ARABIC LETTER HIGH > HAMZA;Lo;0;AL;;;;;N;ARABIC LETTER HIGH HAMZAH > > As you say, some fonts like Amiri position it lower. > > I have a document being discussed to bring the position of the glyph > down a bit from where it is in the codecharts so it's about even with > the top of the alef rather than much higher. Your examples look lower > than that, but that would be a font issue, not an encoding issue. > > Maybe we should consider adding an annotation for U+0674 to say that > this character should be used for Jawi. > > Lorna > > On 10/13/2020 7:00 PM, Yaya MNH48 via Unicode wrote: >> Hello everyone, I'm new on this list. >> >> I've got question for Jawi (Malay in Arabic script) in Unicode. >> >> >> ## The codepoint for Jawi Letter Hamza Three Quarter High? >> >> I have been seeing "Jawi Letter Hamzah Three Quarter High" (sic) >> mentioned on many local documents, that it was not in Unicode but said >> to be proposed back then to be included after Unicode 5.0, but I could >> not find it on Unicode even on the current version 13.0. Is "Jawi >> Letter Hamzah Three Quarter High" even formally proposed and encoded >> yet? If so, what is the actual codepoint for it in Unicode? or did no >> one brought it to Unicode's attention in the first place? >> >> Note: The spelling "Hamzah" in all those documents are influenced from >> Malay (and also pronounced as such in Malay, with an H sound at the >> end), seemed like none of them realized that the final H does not >> exist in the English spelling "Hamza", and they just carried over the >> Malay spelling "Hamzah" in all of their English documents. For >> consistency, I'm using the proper English spelling "Hamza" instead of >> the actual spelling being used over here "Hamzah". >> >> >> While most of the documents are local, some of them do exist online, >> such as this document (linked after quote) in 2009 from Malaysia >> Network Information Center (MYNIC) to Internet Assigned Numbers >> Authority (IANA) for inclusion in their repository of >> Internationalized Domain Names (IDN) tables for .my Malay >> (macrolanguage) (Malaysia) entry, in which the document ends with >> (sic) >>> This character is not in the Unicode Table 5.0. The linguist came up with >>> the decision to propose the inclusion of Jawi Letter Hamzah Three Quarter >>> into the Unicode table. >> Link: https://www.iana.org/domains/idn-tables/tables/my_ms-my_1.0.pdf >> (via https://www.iana.org/domains/idn-tables ) >> >> >> The Jawi letter Hamza Three Quarter High (marked as HTQ in the >> examples from now onwards) is part of everyday use words, for example >> in the Jawi spelling of the word "air" which mean "water" or "drink" >> (noun) which is [alef-HTQ-yeh-reh]. Another example would be in the >> Jawi spelling of the word "perduaan" which mean "binary" (term in >> computing, science and mathematics) which is >> [PA(Veh)-reh-dal-waw-alef-HTQ-noon]. >> >> These are the most common usage of Hamza Three Quarter High in Malay >> Jawi: >> >> [A] consecutive vowels for au, ai, and ui, mostly in native Malay words. >> Example: >> ? "laut" [la?ut] (meaning: sea) is spelt [lam-alef-HTQ-waw-teh] >> ? "baik" [ba?i?] (meaning: good / nice) is spelt [beh-alef-HTQ-yeh-qaf] >> ? "buih" [bu?ih] (meaning: bubble) is spelt [beh-waw-HTQ-yeh-heh] >> >> [B] diphtongs for au and ai, mostly in Malay words loaned from English. >> Example: >> ? "audio" [au?dio] (meaning: audio) is spelt [alef-HTQ-waw-dal-yeh-waw] >> ? "aising" [ai?si?] (meaning: icing) is spelt >> [alef-HTQ-yeh-seen-yeh-NGA(AinWithThreeDotsAbove)] >> >> [C] suffix -an after vowel a, and suffix -i after vowels a and u, as >> part of Malay grammar rule which attaches affixes to modify word form. >> Example: >> ? "kenyataan" [k???a?ta?an] (meaning: statement) is spelt >> [keheh-NYA(NoonWithThreeDotsAbove)-alef-teh-alef-HTQ-noon], it got >> affixed from root word "nyata" [?a?ta] (meaning: to state) >> ? "cintai" [t?in?ta?i] (meaning: love (conjugated verb)) is spelt >> [tcheh-yeh-noon-teh-alef-HTQ-yeh], it got affixed from root word >> "cinta" [t?in?ta] (meaning: to love) >> ? "melalui" [m??la?lu?i] (meaning: via / through) is spelt >> [meem-lam-alef-lam-waw-HTQ-yeh], it got affixed from root word "lalu" >> [la?lu] (meaning: to pass through) >> >> [D] spelling of Malaysian Chinese names in Malay. >> Example: >> ? "Ng" [??] (multiple family names including ?) is spelt >> [HTQ-NGA(AinWithThreeDotsAbove)] >> ? "Ong" [o?] (multiple family names including ?) is spelt >> [HTQ-waw-NGA(AinWithThreeDotsAbove)] >> >> Image of the word spellings: >> https://jawi.mnh48.moe/assets/img/email/word-with-hamza-three-quarter-high.png >> >> Image of sample sentences: >> https://jawi.mnh48.moe/assets/img/email/sentence-with-hamza-three-quarter-high.png >> >> >> I'm seeing people using either superscripted version of Arabic Letter >> Hamza (U+0621) or abuses Arabic Letter High Hamza (U+0674) but those >> doesn't help in plain text where none of the abused formatting would >> work in the first place. Some people even gave up and just create >> image with the correctly positioned letters and uses that image in >> place of text. Some font, notably Amiri, actually displayed Arabic >> Letter High Hamza on three-quarter high, probably (but not necessarily >> the case) to accommodate Jawi users who abuses that letter for their >> Hamza Three Quarter High, but that would lock users to use only that >> font as other fonts still display the original height. >> >> Link to Amiri font: https://www.amirifont.org/ >> >> >> The letter is the same size as regular Hamza, not any smaller like the >> size of High Hamza or the Hamza above/below Alef. It is positioned at >> three-quarter height of the writing line, unlike Arabic Letter High >> Hamza that displays on the highest position on the line nor regular >> Hamza that displays on the baseline. >> >> The letter is also separate from regular Hamza, which is mostly used >> for Arabic or Sanskrit loanwords in Malay. Regular Hamza and Three >> Quarter High Hamza co-exist in Malay Jawi, could exist in the same >> sentence or even the same word, but should not be shown at the same >> height in all cases including plain text, otherwise it could cause >> confusion when reading since it could signal different sound, and it >> is grammatically wrong as well. >> >> To recap the questions from first paragraph in case you are still >> unclear on the actual questions: Is "Jawi Letter Hamzah Three Quarter >> High" even formally proposed and encoded yet? If so, what is the >> actual codepoint for it in Unicode? or did no one brought it to >> Unicode's attention in the first place? >> >> I'm looking forward for more information regarding this. >> >> Best regards, >> [Yaya] >> Yaya MNH48 >> >> [A PDF version of this email is also attached as attachment, however >> it has slightly different formatting as I could directly mark the text >> in formatting so that it will be displayed correctly and so the >> additional images are not needed, unlike the plaintext email where >> none of the formatting would work] > From doug at ewellic.org Fri Oct 16 14:24:21 2020 From: doug at ewellic.org (Doug Ewell) Date: Fri, 16 Oct 2020 13:24:21 -0600 Subject: Question for Malay Jawi letter in Unicode - three quarter hamza In-Reply-To: References: <7ba2a19c-7635-9fd5-79e1-b8453385247c@sil.org> Message-ID: <000b01d6a3f1$f3586270$da092750$@ewellic.org> Yaya MNH48 wrote: > Of course, when font maker design their font, the default glyph for > the character will need to compromise between Kazakh (where it is > annotated to be used in) and Jawi in terms of position and sizing, This is a typical approach used by Latin-script font designers to make the acute accent look acceptable in both French and Polish. > but > font makers should make their font display the correct position and > sizing when the correct language is chosen in the software people are > using. I'm pretty sure that's not how fonts work. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From wjgo_10009 at btinternet.com Fri Oct 16 14:27:19 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 16 Oct 2020 20:27:19 +0100 (BST) Subject: Teletext separated mosaic graphics Message-ID: <1de17d.936.17532e17d27.Webtop.73@btinternet.com> I have recently found the following links of interest. https://teletextarchaeologist.org/ https://zxnet.co.uk/teletext/viewer/?channel=1&page=100 There is a simulated electronic control gadget at the right of the page. https://zxnet.co.uk/teletext/viewer/?channel=3&page=100 https://www.techradar.com/news/internet/how-teletext-and-ceefax-are-coming-back-from-the-dead-1326145 William Overington Friday 16 October 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Oct 16 15:08:23 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 16 Oct 2020 21:08:23 +0100 (BST) Subject: Teletext separated mosaic graphics In-Reply-To: <000201d6a3e0$7d8b95f0$78a2c1d0$@ewellic.org> References: <42717374.14f0.174e543c410.Webtop.49@btinternet.com> <000001d699b6$b02f5d40$108e17c0$@ewellic.org> <134606a0.4bf.174f8b06e2c.Webtop.218@btinternet.com> <21414589.4d0.174f8b5a317.Webtop.218@btinternet.com> <508ec02.11be.175225f0ec6.Webtop.231@btinternet.com> <547afa9d-fc6a-7fd3-6d74-591b4819eb6b@ix.netcom.com> <5CC9C8F2-7162-4D08-A832-088DAC5FE163@bahnhof.se> <000201d6a3e0$7d8b95f0$78a2c1d0$@ewellic.org> Message-ID: Doug Ewell wrote as follows. > It would be great if we could converge on a solution for this that > would align with the guidance of UTC and Script Ad Hoc. Well, it would. However, in my opinion that is not the best solution available and I hope that my proposed plane 14 solution will be considered please. I have been thinking of how best to proceed and, as an encoding decision is unlikely to be made until at least the next meeting of the Unicode Technical Committee, there is the opportunity to give serious consideration to the matter and then any opinions formed can be put forward to the Unicode Technical Committee at that time. I suggest that a way to do this is to have a plane 15 Private Use Area encoding available for experimentation. If people interested all use the same Private Use Area encoding that would possibly give as good experience as possible without a formal regular Unicode encoding. I suggest that to start off that all thirty-two teletext control codes of the 1976 broadcast teletext specification are encoded, in the order that they appear in the 1976 broadcast teletext specification, from U+F7000 through to U+F701F. The character names along the following pattern. TELETEXT INFORMAL ARCHIVING ALPHANUMERICS GREEN This all thirty-two would have a name starting with TELETEXT INFORMAL ARCHIVING I appreciate that only twenty-seven were used in teletext broadcasts, yet for completeness I suggest encoding all thirty-two. For the displayable glyphs, each would be two capital letters one above the other upon a pale. For the avoidance of doubt they are not superimposed one over the other. For example, for TELETEXT INFORMAL ARCHIVING ALPHANUMERICS GREEN the displayable glyph would be an A above a G upon a pale. Where the original name has two words the first letter of each of the two words used, where the original name has only one word the first two letters of the word used. Here is a link to the display of the code names. There is a facility to zoom-in on the display.. https://archive.org/details/broadcast_teletext_specification_1976/page/n25/mode/2up William Overington Friday 16 October 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From junicode at jcbradfield.org Fri Oct 16 16:17:18 2020 From: junicode at jcbradfield.org (Julian Bradfield) Date: Fri, 16 Oct 2020 22:17:18 +0100 (BST) Subject: Question for Malay Jawi letter in Unicode - three quarter hamza References: <7ba2a19c-7635-9fd5-79e1-b8453385247c@sil.org> <000b01d6a3f1$f3586270$da092750$@ewellic.org> Message-ID: On 2020-10-16, Doug Ewell via Unicode wrote: > Yaya MNH48 wrote: >> Of course, when font maker design their font, the default glyph for >> the character will need to compromise between Kazakh (where it is >> annotated to be used in) and Jawi in terms of position and sizing, > This is a typical approach used by Latin-script font designers to make the acute accent look acceptable in both French and Polish. >> but >> font makers should make their font display the correct position and >> sizing when the correct language is chosen in the software people are >> using. > > I'm pretty sure that's not how fonts work. Isn't that what language system tags and the locl feature in OpenType are for? Indeed, a moment's googling shows people making fonts with language-dependent acutes. From asmusf at ix.netcom.com Fri Oct 16 16:47:45 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 16 Oct 2020 14:47:45 -0700 Subject: Question for Malay Jawi letter in Unicode - three quarter hamza In-Reply-To: References: <7ba2a19c-7635-9fd5-79e1-b8453385247c@sil.org> <000b01d6a3f1$f3586270$da092750$@ewellic.org> Message-ID: An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Oct 16 16:51:40 2020 From: doug at ewellic.org (Doug Ewell) Date: Fri, 16 Oct 2020 15:51:40 -0600 Subject: Question for Malay Jawi letter in Unicode - three quarter hamza In-Reply-To: References: <7ba2a19c-7635-9fd5-79e1-b8453385247c@sil.org> <000b01d6a3f1$f3586270$da092750$@ewellic.org> Message-ID: <001401d6a406$87c59400$9750bc00$@ewellic.org> Julian Bradfield wrote: >>> but >>> font makers should make their font display the correct position and >>> sizing when the correct language is chosen in the software people are >>> using. >> >> I'm pretty sure that's not how fonts work. > > Isn't that what language system tags and the locl feature in OpenType > are for? > > Indeed, a moment's googling shows people making fonts with language- > dependent acutes. Sorry, that was hasty; I put the burden on the font instead of the software. I meant to suggest that much of the time, software is not configured to allow a choice of language, or it does not communicate that choice to the rendering engine so the font can do the right thing. For example, when typing this email in Microsoft Outlook, I have no idea whether it is configured to know that I am writing in English (or French or Polish, where language-dependent acutes would matter) or whether it would tell Windows about that if it knew. Since language tagging is considered to belong to the domain of fancy text, I should try experimenting with French-tagged and Polish-tagged HTML, with a variety of fonts and browsers. The plainer the text, the less likely I suspect any of this process is to exist. I certainly can't tell Notepad++ or even BabelPad what language my text is in. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From mark at kli.org Fri Oct 16 16:55:38 2020 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 16 Oct 2020 17:55:38 -0400 Subject: Question for Malay Jawi letter in Unicode - three quarter hamza In-Reply-To: References: <7ba2a19c-7635-9fd5-79e1-b8453385247c@sil.org> Message-ID: An HTML attachment was scrubbed... URL: From tom at honermann.net Sat Oct 17 16:12:40 2020 From: tom at honermann.net (Tom Honermann) Date: Sat, 17 Oct 2020 17:12:40 -0400 Subject: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature In-Reply-To: References: <78a4d6ff-d09b-8230-def4-917554b1d568@honermann.net> <263C91E2-8EB6-4102-981D-80A1CC44F45D@me.com> <93234e19-927e-f823-8748-ec65fc6d5602@honermann.net> Message-ID: <013aaa64-e1b1-3d20-195b-f9b6d6de5026@honermann.net> On 10/16/20 2:33 PM, Shawn Steele wrote: > > Nobody?s going to consider #1 regardless of what wordsmithing is done > in Unicode, people have had too much difficulty with BOMs for it to be > considered as a serious standards based solution. > It isn't clear to me that everyone will agree with that perspective.? I've heard from people that continue to use BOMs in UTF-8 text in this thread.? We have strong consensus within SG16 that we don't want #1; additional support from outside WG21 will help to make the case for a different approach. > > ? #4 isn?t portable. > Correct, but WG21 may find it sufficient for all implementations to provide some implementation-defined means to identify UTF-8 source code without that means being a portable solution. > > The ?right? approach would be to ensure that the languages have ways > of declaring a codepage (like a pragma or other magic semantic, > options 2 & 3). > That matches my preference. > > > The time invested on this problem should be spent on getting agreement > with WG21 about what the declaration should be and seeing if there are > any ?gotchas? to something like #pragma UTF8.? IMO, it?s not the > effort to try to get effort to tweak Unicode?s guidance in order to > support the common view the BOMs are bad, which WG21 won?t be > considering anyway. > My motivation is not solely to support the eventual WG21 proposal. The responses I've seen to the paper (some of which have been private) have made it clear to me that there is not a common understanding of what the Unicode standard states about use of a BOM as an encoding signature in UTF-8.? I think it is worth clarifying. > > > The biggest thing I can think of is that very few codepages would lend > themselves to being declared in a portable manner. Different > OS?s/software/vendors have different implementations of various > codepages.? Even ones that are nominally similar often are mistagged > or have subtle differences. > > In other words, ?UTF8? is about the only ?safe? encoding that won?t > have edge cases. Something like ?shift-jis? has multiple legacy > variations that mean everything won?t always be the same. > I agree.? If WG21 opts to standardize an encoding declaration, I suspect we would only mandate support for UTF-8, and maybe ASCII with any other supported encodings being implementation-defined. Tom. > -Shawn > > *From:* Tom Honermann > *Sent:* Friday, October 16, 2020 6:23 AM > *To:* Shawn Steele ; J Decker > > *Cc:* sg16 at lists.isocpp.org > *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of a > BOM as a UTF-8 encoding signature > > On 10/14/20 3:21 PM, Shawn Steele wrote: > > How are you going to #include differently encoded source files?? I > don?t see anything in this document that would make it possible to > #include a file in a different encoding.? It?s unclear to me how > your proposed document could be utilized to enable the scenario > you?re interested in. > > My intention is to present various options for WG21 to consider along > with a recommendation.? The options that have been identified so far > are listed below.? Combinations of some of these options is a possibility. > > 1. Use of a BOM to indicate UTF-8 encoded source files.? This matches > existing practice for the Microsoft compiler. > 2. Use of a #pragma.? This matches existing practice > > for the IBM compiler. > 3. Use of a "magic" or "semantic" comment.? This matches existing > practice > > in Python. > 4. Use of filesystem meta data.? This is an option for some compilers > and is being considered for Clang on z/OS. > > The goal of this paper is to clarify guidance in the Unicode standard > in order to better inform and justify a recommendation.? If the UTC > were to provide a strong recommendation either for or against use of a > BOM in UTF-8 files, that would be a point either in favor or in > opposition to option 1 above.? As is, based on my reading and a number > of the responses I've seen, the guidance is murky. > > For mixed-encoding behavior the only thing I could imagine is > adding some sort of preprocessor #codepage or something to the > standard.? (Which would again take a while to reach critical mass.) > > Yes, deployment will take time in any case.? A goal would be to choose > an option that can be used as an extension for previous C++ > standards.? This may rule out option 2 above since some compilers > diagnose use of an unrecognized pragma. > > Tom. > > -Shawn > > *From:* Tom Honermann > *Sent:* Tuesday, October 13, 2020 9:47 PM > *To:* Shawn Steele > ; J Decker > > *Cc:* sg16 at lists.isocpp.org > *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of > a BOM as a UTF-8 encoding signature > > On 10/13/20 5:19 PM, Shawn Steele wrote: > > IMO this document doesn?t solve your problem.? The problem of > encourage use of UTF-8 in C++ source code is a goal that most > compilers/source code authors/etc are totally onboard with. > > The source is already in an indeterminate state.? The desired > end state is to have UTF-8 source code (without BOM), which is > typically supported.? The difficulty is therefore getting from > point A to point B.? As far as ?Use Unicode? goes, there?s no > issue, but trying to specify BOM as a protocol doesn?t really > solve the problem, particularly in complex environments. > > I think there is a misunderstanding.? The intent of the paper is > to provide rationale for the existing discouragement for use of a > BOM in UTF-8 while acknowledging that, in some cases, it may > remain useful.? My intent is to discourage use of a BOM for UTF-8 > encoded source files - thereby arguing against standardizing the > behavior exhibited by Microsoft Visual C++ today. > > > If the compiler doesn?t handle BOM as expected, then you?ll > get errors.? This can be further complicated by preprocessors, > #include, resources, etc. If ?specifying BOM behavior in > Unicode? could help solve the problem, then all of the tooling > used by everyone would have to be updated to handle that (new) > requirement.? If you could get everyone on the same page, > they?d all use UTF-8, so you wouldn?t need to update the > tooling.? If you don?t need to update the tooling, you > wouldn?t need to update the best practices for BOMs. > > This paper does not propose "specifying BOM behavior in Unicode".? > If you feel that it does, please read it again and let me know > what leads you to believe that it does. > > The tooling isn't the problem.? The problem is the existing source > code that is not UTF-8 encoded or that is UTF-8 encoded with a > BOM.? The deployment challenge is with those existing source > files.? Microsoft Visual C++ is going to continue consuming source > files using the Active Code Page (ACP) and IBM compilers on EBCDIC > platforms are going to continue consuming source files using > EBCDIC code pages. The goal is to provide a mechanism where a > UTF-8 encoded source file can #include a source file in another > encoding or vice versa.? Any solution for that will require > tooling updates (and that is ok). > > Personally, I?d prefer if cases like this ignore BOMs (or use > them to switch to UTF-8); eg: treat BOMs like whitespace.? But > this isn?t a problem solvable by any recommendation by Unicode. > > When consuming text as UTF-8, I agree that ignoring a BOM is > usually the right thing to do and would be the right thing to do > when consuming source code. > > > As you noted, many systems provide mechanisms for indicating > that code is UTF-8 or compiling with UTF-8, regardless of BOM. > > Yes, but there is no standard solution, not even a defacto one, > for consuming differently encoded source files in the same > translation unit. > > > A rather large codebase I?ve been working with has been > working to remove encoding confusion, and it?s a big task ? > > Yes, yes it is. > > Tom. > > -Shawn > > *From:* Unicode > *On Behalf Of *Tom > Honermann via Unicode > *Sent:* Tuesday, October 13, 2020 1:47 PM > *To:* J Decker ; > Unicode List > *Cc:* sg16 at lists.isocpp.org > *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use > of a BOM as a UTF-8 encoding signature > > On 10/12/20 8:09 PM, J Decker via Unicode wrote: > > On Sun, Oct 11, 2020 at 8:24 PM Tom Honermann via Unicode > > wrote: > > On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote: > > One concern I have, that might lead into rationale > for the current discouragement, > > is that I would hate to see a best practice that > pushes a BOM into ASCII files. > > One of the nice properties of UTF-8 is that a > valid ASCII file (still very common) is > > also a valid UTF-8 file. Changing best practice > would encourage updating those > > files to be no longer ASCII. > > Thanks, Alisdair.? I think that concern is implicitly > addressed by the suggested resolutions, but perhaps > that can be made more clear.? One possibility would be > to modify the "protocol designer" guidelines to > address the case where a protocol's default encoding > is ASCII based and to specify that a BOM is only > required for UTF-8 text that contains non-ASCII > characters.? Would that be helpful? > > 'and to specify that a BOM is only required for UTF-8 '? > this should NEVER be 'required' or 'must', it shouldn't > even be 'suggested'; fortunately BOM is just a ZWNBSP, so > it's certainly a 'may' start with a such and such. > > These days the standard 'everything IS utf-8' works really > well, except in firefox where the charset is required to > be specified for JS scripts (but that's a bug in that) > > EBCDIC should be converted on the edge to internal ascii, > since, thankfully, this is a niche application and > everything thinks in ASCII or some derivative thereof. > > Byte Order Mark is irrelatvent to utf-8 since bytes are > ordered in the correct order. > > I have run into several editors that have insisted on > emitted?BOM for UTF8 when initially promoted from ASCII, > but subsequently deleting it doesn't bother anything. > > I mostly agree.? Please note that the paper suggests use of a > BOM only as a last resort.? The goal is to further discourage > its use with rationale. > > > > I am curious though, what was the actual problem you ran > into that makes you even consider this modification? > > I'm working on improving support for portable C++ source > code.? Today, there is no character encoding that is supported > by all C++ implementations (not even ASCII). I'd like to make > UTF-8 that commonly supported character encoding.? For > backward compatibility reasons, compilers cannot change their > default source code character encoding to UTF-8. > > Most C++ applications are created from components that have > different release schedules and that are maintained by > different organizations.? Synchronizing a conversion to UTF-8 > across dependent projects isn't feasible, nor is converting > all of the source files used by an application to UTF-8 as > simple as just running them through 'iconv'. Migration to > UTF-8 will therefore require an incremental approach for at > least some applications, though many are likely to find > success by simply invoking their compiler with the appropriate > -everything-is-utf8 option since most source files are ASCII. > > Microsoft Visual C++ recognizes a UTF-8 BOM as an encoding > signature and allows differently encoded source files to be > used in the same translation unit.? Support for differently > encoded source files in the same translation unit is the > feature that will be needed to enable incremental migration.? > Normative discouragement (with rationale) for use of a BOM by > the Unicode standard would be helpful to explain why a > solution other than a BOM (perhaps something like Python's > encoding declaration > ) > should be standardized in favor of the existing practice > demonstrated by Microsoft's solution. > > Tom. > > J > > Tom. > > AlisdairM > > > > > > On Oct 10, 2020, at 14:54, Tom Honermann via > SG16 > wrote: > > Attached is a draft proposal for the Unicode > standard that intends to clarify the current > recommendation regarding use of a BOM in UTF-8 > text.? This is follow up to discussion on the > Unicode mailing list > > back in June. > > Feedback is welcome.? I plan to submit > > this to the UTC in a week or so pending review > feedback. > > Tom. > > -- > SG16 mailing list > SG16 at lists.isocpp.org > > https://lists.isocpp.org/mailman/listinfo.cgi/sg16 > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kittens at wobble.ninja Tue Oct 20 03:30:11 2020 From: kittens at wobble.ninja (Ellie) Date: Tue, 20 Oct 2020 10:30:11 +0200 Subject: Please fix the trademark policy in regards to code In-Reply-To: References: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> <20201001192635.2cccbb7c@JRWUBU2> Message-ID: <3f62a744-417e-0bd1-753a-b6e407be2a82@wobble.ninja> > I believe rather than attaching a symbol to every instance of a > trademark, a single comment at the beginning of the file would > suffice, e.g. "Within this file, the word 'Unicode', and its variants, > refers to the Unicode(R) Standard. Unicode(R) is a registered > trademark of the Unicode Consortium". Or some similar legal > boilerplate. But that might become a very long text in some files if all of those were always listed (windows, linux, macos, apple - from ifdefs -, unicode, ...). And for what it's worth, I don't think I've ever seen a source code file mentioning Linux that has a trademark note at the top. (Including those produced by people working at the Linux Foundation for the Linux kernel.) So even that suggestion doesn't seem to be what is commonly done in practice. Which in my opinion only would make a remark at the very least about fair use helpful, like the Linux guidelines put it. Even if legally apparently not required to be clarified, many open source hobby devs probably can't pay lawyers to verify that. So it still makes a difference IMHO to ease the mind with noting this in the text, as obvious as it might be. Regards Ellie On 10/7/20 5:31 PM, S?awomir Osipiuk via Unicode wrote: > On Wed, Oct 7, 2020 at 9:04 AM Ellie via Unicode wrote: >> >> I would find such a remark helpful, although the last sentence kind of >> makes it again sound like they expect me to put (R) into the source code >> which I find a bit unfortunate. Some qualifier like "you should, +where >> that is practical to do, acknowledge ..." might help alleviate this, >> however. > > I believe rather than attaching a symbol to every instance of a > trademark, a single comment at the beginning of the file would > suffice, e.g. "Within this file, the word 'Unicode', and its variants, > refers to the Unicode(R) Standard. Unicode(R) is a registered > trademark of the Unicode Consortium". Or some similar legal > boilerplate. > > That said, I think a note regarding source code and filenames can be > added to the "Special Situations" section of the page you originally > linked and would be helpful. > > S?awomir Osipiuk > From wjgo_10009 at btinternet.com Tue Oct 20 17:19:13 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 20 Oct 2020 23:19:13 +0100 (BST) Subject: Please fix the trademark policy in regards to code In-Reply-To: <3f62a744-417e-0bd1-753a-b6e407be2a82@wobble.ninja> References: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> <20201001192635.2cccbb7c@JRWUBU2> <3f62a744-417e-0bd1-753a-b6e407be2a82@wobble.ninja> Message-ID: <1485d26d.15b2.17548184cce.Webtop.50@btinternet.com> I am not a lawyer, but I seem to remember something in United Kingdom Patent Law that, whilst not the same issue as here, may have a sort of resonant parallel. There is something about there being no legal force in declaring upon an object or in literature describing the object that the object is patented unless the registration number of the patent is also stated. So, is it reasonable for Unicode, Inc. to request or require on a web page for people to repeat a statement that Unicode is a registered trademark when Unicode, Inc. does not provide any checkable evidence upon that web page that that is the case? If a person does state that message and is then challenged to prove that claim how does the person do that? I remember a magazine editor saying that the magazine, when reporting the content of a press release from a business about a new product, and what it could do, that the magazine would always list statements in the press release as "claims", People may well have no reason to doubt the truth of the claims that Unicode Inc. makes about trade mark registration, but that is not the same as people being requested or required to repeat those claims as if fact without checkable evidence before them. William Overington Tuesday 20 October 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From junicode at jcbradfield.org Wed Oct 21 07:59:27 2020 From: junicode at jcbradfield.org (Julian Bradfield) Date: Wed, 21 Oct 2020 13:59:27 +0100 (BST) Subject: Please fix the trademark policy in regards to code References: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> <20201001192635.2cccbb7c@JRWUBU2> <3f62a744-417e-0bd1-753a-b6e407be2a82@wobble.ninja> <1485d26d.15b2.17548184cce.Webtop.50@btinternet.com> Message-ID: On 2020-10-20, William_J_G Overington via Unicode wrote: > There is something about there being no legal force in declaring upon an > object or in literature describing the object that the object is > patented unless the registration number of the patent is also stated. I don't think so. While plastering your patent number over everything makes infringement claims easier, it's the responsibility of the other inventor or producer to look for relevant patents. The so-called "innocent infringement" defence requires the defendant to prove that they had no knowledge AND had no reasonable grounds for supposing the existence of a patent. If you have seen a claim that something is patented, you have reasonable grounds for supposing the existence of a patent, regardless of whether its number is given. > So, is it reasonable for Unicode, Inc. to request or require on a web > page for people to repeat a statement that Unicode is a registered > trademark when Unicode, Inc. does not provide any checkable evidence > upon that web page that that is the case? > > If a person does state that message and is then challenged to prove that > claim how does the person do that? By searching the online US trademarks database, surprisingly enough. Which you could have done before posting. From copypaste at kittens.ph Fri Oct 23 00:19:11 2020 From: copypaste at kittens.ph (Fredrick Brennan) Date: Thu, 22 Oct 2020 22:19:11 -0700 Subject: The Pitman English phonotypic alphabet and L2/11-153, L2/11-225 Message-ID: <17553e58428.cc7e5a624670.9187194122749323426@kittens.ph> Hello friendsI am working with Mr Ramachandran Rajaram on a proposal for the encoding of Pitman shorthand in Unicode.The history of this shorthand is such that the consonants and vowels can be related to the English Phonotypic Alphabet, also created by Sir Isaac Pitman.?This alphabet was used to write several publications, including the below "The Phonetic News" (1849):https://twitter.com/gaskell_beth/status/1318267991667167234There was a proposal to encode this by Mr Karl Pentzlin docketed as L2/11-153. After a comment it was followed up by another revised proposal, sent by the German national body, L2/11-225.This proposal has good attestation for all requested characters so I am trying to figure out why this proposal failed. Because Mr Pentzlin seems to no longer participate, could the proposal be adopted by someone else, that is to say, me?Among the characters that were not encoded are Latin capital letter round top A, which the German NB proposed for U+A7AE, now occupied by an unrelated character. Without this character I am having trouble type setting my document and I'm needing to use the private use area.I am hoping that someone who is an expert in Unicode can help me find out why this proposal failed and what needs to be changed in it so that it can finally be accepted.?While it is not in theory a requirement that the proposal be accepted to encode Pitman shorthand, because the two are so historically linked it would be preferable if I could show the phonotypic version of each shorthand glyph.?Best,Fred Brennan -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Oct 25 07:41:59 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 25 Oct 2020 12:41:59 +0000 Subject: Please fix the trademark policy in regards to code In-Reply-To: References: <7842c80a-0b8f-5c77-f37d-a475ace078b9@wobble.ninja> <1f59a248-c8b7-a5ea-f658-28ed12a496f4@sonic.net> <20201001192635.2cccbb7c@JRWUBU2> <3f62a744-417e-0bd1-753a-b6e407be2a82@wobble.ninja> <1485d26d.15b2.17548184cce.Webtop.50@btinternet.com> Message-ID: <20201025124159.6affd65d@JRWUBU2> On Wed, 21 Oct 2020 13:59:27 +0100 (BST) Julian Bradfield via Unicode wrote: > By searching the online US trademarks database, surprisingly enough. > Which you could have done before posting. Or one could search UK records (online and free) and find that 'UNICODE' at least is protected in the EU until 15 February 2026 under registration number WE00000892283. I think this is actually covered by the Madrid Protocol, so will still be valid in the UK in 2021. Richard.