From wjgo_10009 at btinternet.com Thu Aug 4 10:51:27 2022 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 4 Aug 2022 16:51:27 +0100 (BST) Subject: Emotes Message-ID: <66a3caa6.c95f.182698df475.Webtop.119@btinternet.com> An interesting document has recently become available in the Unicode Technical Committee Current Documents Register on the topic of emotes. https://www.unicode.org/L2/L2022/22180-inline-emotes.pdf I had not known the word 'emote' as a noun, only as a verb. I have found a wikipedia article. https://en.wikipedia.org/wiki/Emote A placeholder entry is in place in the Current Document Register for another document about emotes that has not yet been posted at the time of the writing of this note. Can we discuss emotes please? William Overington Thursday 4 August 2022 From sosipiuk at gmail.com Thu Aug 4 16:07:19 2022 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Thu, 4 Aug 2022 17:07:19 -0400 Subject: Emotes In-Reply-To: <66a3caa6.c95f.182698df475.Webtop.119@btinternet.com> References: <66a3caa6.c95f.182698df475.Webtop.119@btinternet.com> Message-ID: On Thu, Aug 4, 2022 at 12:10 PM William_J_G Overington via Unicode wrote: > > Can we discuss emotes please? > Inline image content is very firmly outside the scope of Unicode and practically screams "higher-level protocol". We should not be inventing yet another markup and/or syntax for things that are solved problems. If you want to embed images in your text, use an tag with a base64 data-URI or an external URL, as appropriate. The one legitimate issue is that maybe you want the syntax to be default-ignorable "for free" on platforms that don't support it and the existing unicode tag characters look mighty tempting. In that area, the Unicode Standard can be a little helpful by clarifying and formalizing the permitted use of tag characters for other protocols, but it should in no way be defining those protocols itself. Maybe declare U+E0010 through U+E001F private-use. That's about it. Unicode is for text. Even emoji as single characters are a pretty big stretch. Emoji ZWJ sequences even more so. I don't think we should be stretching any further. S?awomir Osipiuk From mark at kli.org Thu Aug 4 16:54:57 2022 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 4 Aug 2022 17:54:57 -0400 Subject: Emotes In-Reply-To: References: <66a3caa6.c95f.182698df475.Webtop.119@btinternet.com> Message-ID: The question wasn't about emoji or inline image content; it was about "emotes", a technique of phrasing chat messages in third person instead of first person.? I have no idea what this has to do with Unicode, any more than chat-forum conventions like spelling "you" as "u" or using CAPITALS to shout with.? That is, they're transmitted by means of Unicode, but they're content and not form or protocol, and Unicode doesn't dictate content. ~mark On 8/4/22 17:07, S?awomir Osipiuk via Unicode wrote: > On Thu, Aug 4, 2022 at 12:10 PM William_J_G Overington via Unicode > wrote: >> Can we discuss emotes please? >> > Inline image content is very firmly outside the scope of Unicode and > practically screams "higher-level protocol". > > We should not be inventing yet another markup and/or syntax for things > that are solved problems. If you want to embed images in your text, > use an tag with a base64 data-URI or an external URL, as > appropriate. > > The one legitimate issue is that maybe you want the syntax to be > default-ignorable "for free" on platforms that don't support it and > the existing unicode tag characters look mighty tempting. In that > area, the Unicode Standard can be a little helpful by clarifying and > formalizing the permitted use of tag characters for other protocols, > but it should in no way be defining those protocols itself. Maybe > declare U+E0010 through U+E001F private-use. > > That's about it. Unicode is for text. Even emoji as single characters > are a pretty big stretch. Emoji ZWJ sequences even more so. I don't > think we should be stretching any further. > > S?awomir Osipiuk From sosipiuk at gmail.com Thu Aug 4 17:44:08 2022 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Thu, 04 Aug 2022 22:44:08 +0000 Subject: Emotes In-Reply-To: References: Message-ID: <1659652909700.3628754344.3552603110@gmail.com> On Thursday, 04 August 2022, 17:54:57 (-04:00), Mark E. Shoulson via Unicode wrote: > The question wasn't about emoji or inline image content; it was about "emotes", a technique of phrasing chat messages in third person instead of first person. The question was about "emotes" as described in L2/22-180, which are certainly "emoji or inline image content". From mark at kli.org Thu Aug 4 19:08:36 2022 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 4 Aug 2022 20:08:36 -0400 Subject: Emotes In-Reply-To: <1659652909700.3628754344.3552603110@gmail.com> References: <1659652909700.3628754344.3552603110@gmail.com> Message-ID: <2b495428-e077-7825-1641-37c0e2f1a71c@shoulson.com> Ah, I was looking at the Wikipedia article that was linked, not L2/22-180.? I was *at* the UTC (virtually) when these were discussed, and even chimed in, so I probably should have realized that better. ~mark On 8/4/22 18:44, S?awomir Osipiuk via Unicode wrote: > On Thursday, 04 August 2022, 17:54:57 (-04:00), Mark E. Shoulson via > Unicode wrote: > > > The question wasn't about emoji or inline image content; it was > about "emotes", a technique of phrasing chat messages in third person > instead of first person. The question was about "emotes" as described > in L2/22-180, which are certainly "emoji or inline image content". From wjgo_10009 at btinternet.com Fri Aug 5 08:36:00 2022 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 5 Aug 2022 14:36:00 +0100 (BST) Subject: Emotes In-Reply-To: References: <66a3caa6.c95f.182698df475.Webtop.119@btinternet.com> Message-ID: <7774f87f.d52d.1826e384fa8.Webtop.101@btinternet.com> S?awomir Osipiuk wrote as follows. > The one legitimate issue is that maybe you want the syntax to be > default-ignorable "for free" on platforms that don't support it and > the existing unicode tag characters look mighty tempting. It seems to me that one could have a graphics format that defaults gracefully to give an indication of the intended graphic by using a few symbols each accompanied by a Variation Selector, for example herein, U+FE0C VARIATION SELECTOR-13. Whereas, there are U+25A0 BLACK SQUARE U+25A1 WHITE SQUARE U+2605 BLACK STAR One could have the following. Ordinary Unicode display is called text mode The is also graphics mode. The sequence U+25A0 U+FE0C means begin if in text mode then begin enter graphics mode; start a graphic with a black pixel in the upper left corner; set next_place be one pixel to the right of that black pixel; end elsif in graphics mode then begin place a black pixel at next_place; set next_place be one pixel to the right of that black pixel; end; end. The sequence U+25A1 U+FE0C means begin if in text mode then begin enter graphics mode; start a graphic with a white pixel in the upper left corner; set next_place be one pixel to the right of that white pixel; end elsif in graphics mode then begin place a white pixel at next_place; set next_place be one pixel to the right of that white pixel; end; end. The sequence U+2605 U+FE0C means begin if in graphics mode then begin set next_place to be at the start of the next row of the graphic; end; end. In graphics mode, carriage return, linefeed and space are ignored. In graphics mode, any character received that is not a graphics mode sequence causes graphics mode to be left gracefully and the received character to be displayed in text mode after the graphic, the graphic continuing to be displayed. ---- The above describes a first attempt to produce something to discuss. The capabality of the system could be extended to include colours. The capability of the system could be extended to produce 3d images too by using U+25CB WHITE CIRCLE and U+25B2 BLACK UP-POINTING TRIANGLE The sequence U+25CB U+FE0C means begin if in text mode then begin enter graphics mode; start a graphic with a transparent pixel in the upper left corner; set next_place be one pixel to the right of that transparent pixel; end elsif in graphics mode then begin place a transparent pixel at next_place; set next_place be one pixel to the right of that transparent pixel; end; end. The sequence U+25B2 U+FE0C means begin if in graphics mode then begin set next_place to be at the start of the first pixel of the first row of the graphic, one layer forward of the present layer; end; end. In 3d images, each pixel would be regarded as a voxel. Someone typesetting such a graphic could thus include carriage return and line feed characters as they would not affect the graphic display but would help to provide a graceful fallback display. To help in providing a graceful fallback display, U+25B6 BLACK RIGHT-POINTING TRIANGLE could be used in the system. The sequence U+25B6 U+FE0Cmeans begin enter graphics mode; set next_place at the upper left corner; end. could be used folowed by carriage return and line feed, if so desired. That would mean that the fallback display would start on a new line of the display rather than the first line of the fallback display not being aligned with subsequent lines of the fallback display. William Overington Friday 5 August 2022 From mark at kli.org Fri Aug 5 19:04:14 2022 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 5 Aug 2022 20:04:14 -0400 Subject: Emotes In-Reply-To: <7774f87f.d52d.1826e384fa8.Webtop.101@btinternet.com> References: <66a3caa6.c95f.182698df475.Webtop.119@btinternet.com> <7774f87f.d52d.1826e384fa8.Webtop.101@btinternet.com> Message-ID: <8c44d6e3-ce19-b85f-fa6d-cadae0528b77@shoulson.com> Will have to look more closely at this, but it sounds to me like we already have that, at least twice: https://en.wikipedia.org/wiki/Sixel https://en.wikipedia.org/wiki/ReGIS There are terminal emulators that support these. ~mark On 8/5/22 09:36, William_J_G Overington via Unicode wrote: > S?awomir Osipiuk wrote as follows. > >> The one legitimate issue is that maybe you want the syntax to be >> default-ignorable "for free" on platforms that don't support it and >> the existing unicode tag characters look mighty tempting. > > It seems to me that one could have a graphics format that defaults > gracefully to give an indication of the intended graphic by using a > few symbols each accompanied by a Variation Selector, for example > herein, U+FE0C VARIATION SELECTOR-13. > > Whereas, there are > > U+25A0 BLACK SQUARE > U+25A1 WHITE SQUARE > U+2605 BLACK STAR > > One could have the following. > > Ordinary Unicode display is called text mode > The is also graphics mode. > > The sequence U+25A0 U+FE0C means > > begin > if in text mode then > begin > enter graphics mode; > start a graphic with a black pixel in the upper left corner; > set next_place be one pixel to the right of that black pixel; > end > elsif in graphics mode then > begin > place a black pixel at next_place; > set next_place be one pixel to the right of that black pixel; > end; > end. > > The sequence U+25A1 U+FE0C means > > begin > if in text mode then > begin > enter graphics mode; > start a graphic with a white pixel in the upper left corner; > set next_place be one pixel to the right of that white pixel; > end > elsif in graphics mode then > begin > place a white pixel at next_place; > set next_place be one pixel to the right of that white pixel; > end; > end. > > The sequence U+2605 U+FE0C means > > begin > if in graphics mode then > begin > set next_place to be at the start of the next row of the graphic; > end; > end. > > In graphics mode, carriage return, linefeed and space are ignored. > > In graphics mode, any character received that is not a graphics mode > sequence causes graphics mode to be left gracefully and the received > character to be displayed in text mode after the graphic, the graphic > continuing to be displayed. > > ---- > > The above describes a first attempt to produce something to discuss. > > The capabality of the system could be extended to include colours. > > The capability of the system could be extended to produce 3d images > too by using > U+25CB WHITE CIRCLE and > U+25B2 BLACK UP-POINTING TRIANGLE > > The sequence U+25CB U+FE0C means > > begin > if in text mode then > begin > enter graphics mode; > start a graphic with a transparent pixel in the upper left corner; > set next_place be one pixel to the right of that transparent pixel; > end > elsif in graphics mode then > begin > place a transparent pixel at next_place; > set next_place be one pixel to the right of that transparent pixel; > end; > end. > > The sequence U+25B2 U+FE0C means > > begin > if in graphics mode then > begin > set next_place to be at the start of the first pixel of the first row > of the graphic, one layer forward of the present layer; > end; > end. > > In 3d images, each pixel would be regarded as a voxel. > > Someone typesetting such a graphic could thus include carriage return > and line feed characters as they would not affect the graphic display > but would help to provide a graceful fallback display. > > To help in providing a graceful fallback display, U+25B6 BLACK > RIGHT-POINTING TRIANGLE could be used in the system. > > The sequence U+25B6 U+FE0Cmeans > > begin > enter graphics mode; > set next_place at the upper left corner; > end. > > could be used folowed by carriage return and line feed, if so desired. > > That would mean that the fallback display would start on a new line of > the display rather than the first line of the fallback display not > being aligned with subsequent lines of the fallback display. > > ?William Overington > > Friday 5 August 2022 From abrahamgross at disroot.org Fri Aug 5 19:16:22 2022 From: abrahamgross at disroot.org (ag disroot) Date: Sat, 6 Aug 2022 00:16:22 +0000 (UTC) Subject: Emotes In-Reply-To: <8c44d6e3-ce19-b85f-fa6d-cadae0528b77@shoulson.com> References: <66a3caa6.c95f.182698df475.Webtop.119@btinternet.com> <7774f87f.d52d.1826e384fa8.Webtop.101@btinternet.com> <8c44d6e3-ce19-b85f-fa6d-cadae0528b77@shoulson.com> Message-ID: <1de796d2-2360-4e15-b9d6-047a736bb8c1@disroot.org> Its interesting how these technologies exist and are supported by modern terminal emulators, but outside of ?berzug, I haven't seen a single TUI program that uses it From wjgo_10009 at btinternet.com Sat Aug 6 13:18:51 2022 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 6 Aug 2022 19:18:51 +0100 (BST) Subject: Emotes In-Reply-To: <8c44d6e3-ce19-b85f-fa6d-cadae0528b77@shoulson.com> References: <66a3caa6.c95f.182698df475.Webtop.119@btinternet.com> <7774f87f.d52d.1826e384fa8.Webtop.101@btinternet.com> <8c44d6e3-ce19-b85f-fa6d-cadae0528b77@shoulson.com> Message-ID: <204f22e3.ece0.1827461a48f.Webtop.101@btinternet.com> Thank you for being willing to assess my suggestion. There can be added in the seven Unicode coloured squares U+1F7E5 through to U+1F7EB, each in a sequence with a U+FE0C VARIATION SELECTOR-13. It may well be that it is desirable for the black square and the white square used to be changed to U+2B1B and U+2B1C respectively, each in a sequence with a U+FE0C VARIATION SELECTOR-13. I am also adding in four characters, each in a sequence with a U+FE0C VARIATION SELECTOR-13, so as to include metallic effects in the graphic display while also having in a default display an indication of metallic effects even though their default display is not metallic. U+25F0 U+FE0C gold U+25F3 U+FE0C silver U+25F1 U+FE0C bronze U+25F2 U+FE0C copper Please note that the listing order is not the same as in the Unicode code chart. This is deliberate so that the quadrant order in English book reading order is in the value order of the metals. I feel that it is important to find a good balance of what can be done balanced with keeping the format as lightweight as possible, easy to typeset and with a graceful fallback for systems that do not support the format. Within that constraint I think that I can include some animation possibility as well. Also a way to produce barcodes and QR codes effectively, by having a way to signal, before any pixels are specified in a graphic, to use, for the whole graphic, pixel chunks 1 pixel wide by many pixels tall, or by using pixel chunks that are 2 pixels by 2 pixels, or by using pixel chunks that are 4 pixels by 4 pixels. William Overington Saturday 6 August 2022 ------ Original Message ------ From: "Mark E. Shoulson via Unicode" To: unicode at corp.unicode.org Sent: Saturday, 2022 Aug 6 At 01:04 Subject: Re: Emotes Will have to look more closely at this, but it sounds to me like we already have that, at least twice: https://en.wikipedia.org/wiki/Sixel https://en.wikipedia.org/wiki/ReGIS There are terminal emulators that support these. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri Aug 12 00:17:51 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 12 Aug 2022 06:17:51 +0100 Subject: Unicode Properties and Canonical Equivalence Message-ID: <20220812061751.29441699@JRWUBU2> May a process conforming to Unicode requirement C6 (TUS Section 3.2), "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct", consider the Unicode set [\p{sc = Greek}&&\p{sc ? Greek}] to be non-empty? The problem is that the canonically equivalent characters U+00B4 ACUTE ACCENT and U+1FFD GREEK OXIA have conflicting script properties, but a Unicode-conformant process may freely interchange the two characters whenever they appear as part of a string (Conformance Requirement C7). This conflict was allowed to stand in Consensus 113-C16 back in 2007, pending further study. For me, the question arose in the context of regular regular expressions for Unicode strings under canonical equivalence. A practical solution of instead using scx=Greek does not work, for U+00B4 does not include Greek in its script extensions. The only sane resolution I can see is to treat \p{sc = Greek} as the set of characters canonically equivalent to a character with the script property value of Greek, and similarly \p{sc ? Greek} as the set of characters canonically equivalent to a character with a script property value other than Greek. Disallowing the script property seems insane. Richard. From dchmelik at gmail.com Sun Aug 14 05:20:37 2022 From: dchmelik at gmail.com (dchmelik at gmail.com) Date: Sun, 14 Aug 2022 03:20:37 -0700 Subject: Western symbols? Large symbol site? Superscripts? Message-ID: Unicode has large number Eastern philosophical/metaphysical/spiritual/religious symbols including scant far Eastern ones but huge number of crosses (stemming from near East, not strictly Western). ??????? There are fewer well-recognizable true Western/European (pagan/heathen) ones: it's nice there's sun/monad (though I don't count astrological symbols, some divinities but also pseudoscience), and nice there's owl, snake, bee, spear, caduceus, thunderbolts, eagle, ankh, pentagram except circled (I recall one can place circle over like on reverse C before copyleft, but usually a mess... never works well for me)... of course most those are obscure and some may only be emojis... ??????? Much from SymbolDictionary.net (and similar) should be unicode (some won't fit except if emojis are large.? Maybe down/shut): http://web.archive.org/web/20220629004008/http://symboldictionary.net/ .? There should arguably be flaming torch, perhaps Aegis/Medusa, likely lyre, and (I forgot stuff from this set but) hammer--not just obvious Thor's but Hephaestus/Vulcan's, and Slavic Hands of God ( http://en.wikipedia.org/wiki/Funerary_urn_from_Bia%C5%82a#/media/File:Hands_of_God.svg ), Awen ( http://en.wikipedia.org/wiki/Awen , don't know which version, but USA-approved for veterans' headstones), perhaps cauldron, probably all Valknuts ( http://en.wikipedia.org/wiki/Valknut ).? Sun & Pythagorean monad may have variations but there should arguably be Pythagorean duad, triad, tetrad, pentad, hexad, heptad, octad, nonad, tetractys, though some variants get big, but at least tetractys: http://en.wikipedia.org/wiki/Tetractys .? One pentad variant is simply circled pentagram (used by some Hellenists and a large number of Celtic & other pagans, and others including in West Asia, and USA-approved for veterans' headstones, as are some others). ??????? I have a few old books like SymbolDictionary.net but there is/was also an obscure better website (since 1990s or '0s) which had (tens of?) thousands symbols including virtually all philosophical/spiritual/religious and even more others... does anyone know it?? I viewed it before ever heard of unicode but thought since similarly large number of symbols unicode people might know.? There may be a couple similar sites, one which is easier to find but newer (and far fewer symbols and more difficult to navigate). ??????? Of course, I don't expect (m)any of these may appear anytime soon or for years: just suggestions.? What I consider a bit more important is full Greek superscripts or /at least/ pi (?), used in the most important mathematical equation, e??=cos x+i sin x: e?^? +1=0.? I mentioned that a couple times explaining oldest but still widely-used Internet areas (NNTP/Usenet (apart from perhaps Google Groups posting HTML) and Internet Relay Chat (IRC)) are plaintext... IRC isn't changing because also command-line, and within last year there have still been IRC science/mathematics chat rooms with maybe 1,000+ people... no one wants to put 'p' for '?'. Seems most unicode discussers never think in terms of command-line & pre-World_Wide_Web (WWW) protocols--only GUI desktop personal computer (PC) and WWW--some missed/ignored my argument and stated 'any maths discussion area has "rich" text' (incorrect): please try to think outside one's main PC context/paradigm and consider plaintext/command-line scientists/technicians (surely some work(ed) on unicode).? It took many years suggestions to get copyleft, so unlikely any better case with '?' (but for advanced maths, apparently every Greek letter is used both superscript & subscript, and in 1800s Hebrew was added, but uncommonly superscript & subscript so am not asking for Hebrew). ??????? Glad to see more far Eastern symbol proposals, which should come first but in relation: an Eurasian Tengrii symbol or a few would eventually be nice (obscure so won't yet post Tengri crescent, yurt top... another--Tengri shield--is similar to a native American symbol, whose symbols should be considered (not meaning just USA but The Americas, North & South)).? All I do is suggest; others also suggested copyleft and someone finally proposed & added... unlikely we'll get '^? ' soon but maybe someone/anyone likes spiritual symbols? --D From markus.icu at gmail.com Mon Aug 15 13:38:24 2022 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 15 Aug 2022 11:38:24 -0700 Subject: Unicode Properties and Canonical Equivalence In-Reply-To: <20220812061751.29441699@JRWUBU2> References: <20220812061751.29441699@JRWUBU2> Message-ID: On Thu, Aug 11, 2022 at 10:21 PM Richard Wordingham via Unicode < unicode at corp.unicode.org> wrote: > May a process conforming to Unicode requirement C6 (TUS Section 3.2), > "A process shall not assume that the interpretations of two > canonical-equivalent character sequences are distinct", consider the > Unicode set > > [\p{sc = Greek}&&\p{sc ? Greek}] > > to be non-empty? > Regardless of other considerations, a set and its inverse are disjoint. The problem is that the canonically equivalent characters U+00B4 ACUTE > ACCENT and U+1FFD GREEK OXIA have conflicting script properties, but a > Unicode-conformant process may freely interchange the two characters > whenever they appear as part of a string (Conformance Requirement C7). > This conflict was allowed to stand in Consensus 113-C16 back in 2007, > pending further study. > Would you mind providing the information that you have already collected? Such as the script property values for these characters, and what that 2007 consensus says and what it was based on; and which value you think we should change to what other value. Thanks, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Aug 15 21:08:50 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 16 Aug 2022 03:08:50 +0100 Subject: Unicode Properties and Canonical Equivalence In-Reply-To: References: <20220812061751.29441699@JRWUBU2> Message-ID: <20220816030850.53545e16@JRWUBU2> On Mon, 15 Aug 2022 11:38:24 -0700 Markus Scherer via Unicode wrote: > On Thu, Aug 11, 2022 at 10:21 PM Richard Wordingham via Unicode < > unicode at corp.unicode.org> wrote: > > > May a process conforming to Unicode requirement C6 (TUS Section > > 3.2), "A process shall not assume that the interpretations of two > > canonical-equivalent character sequences are distinct", consider the > > Unicode set > > > > [\p{sc = Greek}&&\p{sc ? Greek}] > > > > to be non-empty? > > > > Regardless of other considerations, a set and its inverse are > disjoint. You're now asserting that \P{prop = val} and \p{prop ? val} are synonymous. To give a clear concrete example, I couldn't find a definition of \p{scx ? Beng}. Does this contain U+0964 DEVANAGARI DANDA, which is in \p{scx = Beng}? Perhaps then it would be less confusing to ask whether [\p{sc = Greek}&&\p{sc = Common}] may be considered to be non-empty. > The problem is that the canonically equivalent characters U+00B4 ACUTE > > ACCENT and U+1FFD GREEK OXIA have conflicting script properties, > > but a Unicode-conformant process may freely interchange the two > > characters whenever they appear as part of a string (Conformance > > Requirement C7). This conflict was allowed to stand in Consensus > > 113-C16 back in 2007, pending further study. > Would you mind providing the information that you have already > collected? Such as the script property values for these characters, > and what that 2007 consensus says and what it was based on; and which > value you think we should change to what other value. Consensus 113-C16 is recorded in L2/07-346 and reads: "[113-C16] Consensus: Due to the need for further study, the Script property value for 5 Greek compatibility accents will stay "Greek" in Unicode 5.1.0: [L2/07-202] "U+1FC1 GREEK DIALYTIKA AND PERISPOMENI "U+1FED GREEK DIALYTIKA AND VARIA "U+1FEE GREEK DIALYTIKA AND OXIA "U+1FEF GREEK VARIA "U+1FFD GREEK OXIA" To this day, U+1FEE and U+1FFD have sc=Greek, while their singleton decompositions U+0385 GREEK DIALYTIKA TONOS and U+00B4 ACUTE ACCENT have sc=Common. The other three lack singleton decompositions and therefore present me with no formal issues, though the script assignments can lead to the first two being rendered differently in their NFC forms (Greek font) and NFD forms (Latin font). I am still working on generating a modern day (Unicode 14.0, ideally also candidate Unicode 15.0) equivalent of the anomaly report L2/07-071. UTC minutes do not record technical reasoning. I presume the problem is that some stand-alone Greek accents would be treated as Greek and others as Common (raised by Ken Whistler in L2/07-202). I note that this could lead to them being rendered using different fonts, especially when in different paragraphs in plain text. There may be other issues. For the properties with the issue that characters and their singleton decompositions have different property values, it has occurred to me that one solution would be to instead support two derived properties: 1) The value of the iterated singleton decomposition of the character if any, otherwise the original property. 2) The set of the values for the character and everything of which it is a singleton decomposition. In the case of property sc, I am wondering whether to notate them as say sc_i and sc_s or whether to reuse the name sc for one of them. This takes me back to the original question. (Brevity is useful as I often use property-based regular expressions at the command line.) Richard. From richard.wordingham at ntlworld.com Tue Aug 16 03:10:52 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 16 Aug 2022 09:10:52 +0100 Subject: Unicode Properties and Canonical Equivalence In-Reply-To: References: <20220812061751.29441699@JRWUBU2> Message-ID: <20220816091052.61f08b03@JRWUBU2> On Mon, 15 Aug 2022 11:38:24 -0700 Markus Scherer via Unicode wrote: > ... and which > value you think we should change to what other value. I wasn't suggesting that values may be changed, though my question may constitute evidence that some values should be changed. My question was as to how we should handle the anomalies while complying with conformance requirement C6 in TUS Section 3.2. Perhaps some Unicode properties are simply inconsistent with that requirement. If anything should be changed, perhaps it is the guidance on regular expressions. Richard. From asmusf at ix.netcom.com Wed Aug 17 06:49:45 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 17 Aug 2022 04:49:45 -0700 Subject: Unicode Properties and Canonical Equivalence In-Reply-To: <20220816091052.61f08b03@JRWUBU2> References: <20220812061751.29441699@JRWUBU2> <20220816091052.61f08b03@JRWUBU2> Message-ID: <2a39650c-37d8-96f0-1ba2-990e815dcebe@ix.netcom.com> A process /*may */treat two canonically equivalent sequences differently. For example when determining how to allocate buffers, any length difference matters and may, at some point, surface to the user, if not intentionally. This case seems somewhat equivalent. What the conformance clause intends is that processes (and protocols for that matter) don't intentionally rely on the differences in encoding. (However, for example, a protocol may require a particular normalization form, while rejecting unnormalized data). [If people feel that this is forbidden by the current conformance clause, we would have serious troubles with protocols like IDNA2008 which enforce Normalization Form NFC for representation of data at certain interfaces.] A minor infidelity in script run parsing doesn't appear to rise to the level of concern that was the focus of the conformance clause about? treating different normalizations differently. That said, it's strongly preferable to design properties with closure under normalization, but edge cases like this need to be handled with some understanding of what the costs and benefits are of trying to implement such a guarantee. A./ On 8/16/2022 1:10 AM, Richard Wordingham via Unicode wrote: > On Mon, 15 Aug 2022 11:38:24 -0700 > Markus Scherer via Unicode wrote: > >> ... and which >> value you think we should change to what other value. > I wasn't suggesting that values may be changed, though my question may > constitute evidence that some values should be changed. My question > was as to how we should handle the anomalies while complying with > conformance requirement C6 in TUS Section 3.2. Perhaps some > Unicode properties are simply inconsistent with that requirement. If > anything should be changed, perhaps it is the guidance on regular > expressions. > > Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Aug 18 07:44:39 2022 From: richard.wordingham at ntlworld.com (Wordingham Richard) Date: Thu, 18 Aug 2022 13:44:39 +0100 (BST) Subject: Unicode Properties and Canonical Equivalence In-Reply-To: <2a39650c-37d8-96f0-1ba2-990e815dcebe@ix.netcom.com> References: <20220812061751.29441699@JRWUBU2> <20220816091052.61f08b03@JRWUBU2> <2a39650c-37d8-96f0-1ba2-990e815dcebe@ix.netcom.com> Message-ID: <1713215968.1625050.1660826679148@mail.virginmedia.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Aug 21 09:20:58 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 21 Aug 2022 15:20:58 +0100 Subject: Definition of Values of Property Vertical_Orientation Message-ID: <20220821152058.33bdd564@JRWUBU2> I've just spent a painful time verifying the loading of the values of Vertical_Orientation. After the list of codepoints and ranges in the comments of VerticalOrientation.txt for which the value defaults to Upright, is there any reason for having the ominous wording "All other code points, assigned and unassigned, that are not listed explicitly in the data section of this file are given the value R." Given the current (Version 14.0) and candidate (Version 15.0) data sections, is there any reason for not having the more reassuring "All code points, assigned and unassigned, that are not listed explicitly in the data section of this file are given the value R." One could then set up the default value of the property as Rotated and then just read in the data section as overrides, as with other files just defining the value of one enumeration property. As things stand, loading the property values into an application involves three steps: 1) Set up the default value. 2) Set up the default values for the Upright regions listed in the comments. 3) Set up the explicit values from the data file. Given the current explicit data, Step 2 is redundant. Richard. From asmusf at ix.netcom.com Sun Aug 21 11:32:25 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 21 Aug 2022 09:32:25 -0700 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <20220821152058.33bdd564@JRWUBU2> References: <20220821152058.33bdd564@JRWUBU2> Message-ID: <179a3800-2953-af5c-f1e1-309e5f777745@ix.netcom.com> On 8/21/2022 7:20 AM, Richard Wordingham via Unicode wrote: > I've just spent a painful time verifying the loading of the values of > Vertical_Orientation. After the list of codepoints and ranges in the > comments of VerticalOrientation.txt for which the value defaults to > Upright, is there any reason for having the ominous wording > > "All other code points, assigned and unassigned, that are not listed > explicitly in the data section of this file are given the value R." > > Given the current (Version 14.0) and candidate (Version 15.0) data > sections, is there any reason for not having the more reassuring > > "All code points, assigned and unassigned, that are not listed > explicitly in the data section of this file are given the value R." > > One could then set up the default value of the property as Rotated and > then just read in the data section as overrides, as with other files > just defining the value of one enumeration property. As things stand, > loading the property values into an application involves three steps: > > 1) Set up the default value. > 2) Set up the default values for the Upright regions listed in the > comments. > 3) Set up the explicit values from the data file. > > Given the current explicit data, Step 2 is redundant. > > Richard. The long-term goal is to have step 1 and 2 done via parsable @missing directives and to remove listings of property values for non-assigned code points. We have started this process for 15.0, but to fully get there may take another iteration or two. At this point, any fixes that don't go towards that longer-term goals would be non-starters. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Sun Aug 21 17:27:16 2022 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 21 Aug 2022 15:27:16 -0700 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <20220821152058.33bdd564@JRWUBU2> References: <20220821152058.33bdd564@JRWUBU2> Message-ID: On Sun, Aug 21, 2022 at 7:24 AM Richard Wordingham via Unicode < unicode at corp.unicode.org> wrote: > I've just spent a painful time verifying the loading of the values of > Vertical_Orientation. After the list of codepoints and ranges in the > comments of VerticalOrientation.txt for which the value defaults to > Upright, is there any reason for having the ominous wording > > "All other code points, assigned and unassigned, that are not listed > explicitly in the data section of this file are given the value R." > > Given the current (Version 14.0) and candidate (Version 15.0) data > sections, is there any reason for not having the more reassuring > > "All code points, assigned and unassigned, that are not listed > explicitly in the data section of this file are given the value R." > sgtm One could then set up the default value of the property as Rotated and > then just read in the data section as overrides, as with other files > just defining the value of one enumeration property. You can do that today. As things stand, > loading the property values into an application involves three steps: > > 1) Set up the default value. > Which you can also read from the @missing line. # @missing: 0000..10FFFF; R https://www.unicode.org/reports/tr44/#Missing_Conventions 2) Set up the default values for the Upright regions listed in the > comments. > 3) Set up the explicit values from the data file. > > Given the current explicit data, Step 2 is redundant. > Right. The comments document which ranges default to Upright, but the unassigned and private use code points that have that value are also explicitly listed. We intend, for some version after 15, to add additional @missing lines in this file so that we no longer need to set those not-assigned code points to U, but either way you can just parse the file without hardcoding assumptions. (Unicode 15 has three files with multiple @missing lines.) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Aug 22 17:45:38 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 22 Aug 2022 23:45:38 +0100 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: References: <20220821152058.33bdd564@JRWUBU2> Message-ID: <20220822234538.05c26af6@JRWUBU2> On Sun, 21 Aug 2022 15:27:16 -0700 Markus Scherer via Unicode wrote: > On Sun, Aug 21, 2022 at 7:24 AM Richard Wordingham via Unicode < > unicode at corp.unicode.org> wrote: > > > I've just spent a painful time verifying the loading of the values > > of Vertical_Orientation. After the list of codepoints and ranges > > in the comments of VerticalOrientation.txt for which the value > > defaults to Upright, is there any reason for having the ominous > > wording > > > > "All other code points, assigned and unassigned, that are not listed > > explicitly in the data section of this file are given the value R." > > > > Given the current (Version 14.0) and candidate (Version 15.0) data > > sections, is there any reason for not having the more reassuring > > > > "All code points, assigned and unassigned, that are not listed > > explicitly in the data section of this file are given the value R." > > > > sgtm > > One could then set up the default value of the property as Rotated and > > then just read in the data section as overrides, as with other files > > just defining the value of one enumeration property. > > > You can do that today. The description in the file gives no assurance of that. I did, however, find the necessary assurance in UAX #44 Revision 28 Section 4.2.9.1. > As things stand, > > loading the property values into an application involves three > > steps: > > > > 1) Set up the default value. > > > > Which you can also read from the @missing line. > > # @missing: 0000..10FFFF; R > > https://www.unicode.org/reports/tr44/#Missing_Conventions But "U+0023 NUMBER SIGN ("#") is used to indicate comments: all characters from the number sign to the end of the line are considered part of the comment, and are disregarded when parsing data." and "The comments are purely informational, and may change format or be omitted in the future. They should not be parsed for content."!(Revision 28 Section 4.2.4). I think something needs to be added at the start of Section 4.2.4 to say that a line starting U+0023, U+0020, U+0040 is exceptionally *not* a comment line. > 2) Set up the default values for the Upright regions listed in the > > comments. > > 3) Set up the explicit values from the data file. > > > > Given the current explicit data, Step 2 is redundant. > > > > Right. The comments document which ranges default to Upright, but the > unassigned and private use code points that have that value are also > explicitly listed. > > We intend, for some version after 15, to add additional @missing > lines in this file so that we no longer need to set those > not-assigned code points to U, but either way you can just parse the > file without hardcoding assumptions. > (Unicode 15 has three files with multiple @missing lines.) So long as the @missing lines are not commented out! Richard. From sosipiuk at gmail.com Mon Aug 22 18:31:55 2022 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Mon, 22 Aug 2022 23:31:55 +0000 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <20220822234538.05c26af6@JRWUBU2> References: <20220822234538.05c26af6@JRWUBU2> Message-ID: <1661209991589.2502010960.1221423313@gmail.com> On Monday, 22 August 2022, 18:45:38 (-04:00), Richard Wordingham via Unicode wrote: > > The description in the file gives no assurance of that. I did, > however, find the necessary assurance in UAX #44 Revision 28 Section > 4.2.9.1. This section, if anything, gives less assurance, by my reading. "Complex default values other than those specified in the "@missing" line are explicitly listed in the relevant property file, except for instances noted in this section." [...] "Vertical_Orientation: This property defaults to Rotated (R) for most code points, but defaults to Upright (U) for unassigned code points in blocks associated with scripts that are themselves predominantly Upright, in blocks for some notational systems, and in blocks predominantly associated with pictographic symbols and emoji" This implies to me that the default U values my *not* be (machine-readably) explicitly listed in this particular property file - why else would it be noted in this section? Inspecting the file, "R" is indeed listed in the @missing line, which leaves us with the implication that the default "U" in specific sections is *not* listed and must be handled specially based on outside information. This seems to be a case of descriptive conceptual information appearing to be *imperative*. Why is Vertical_Orientation even listed in 4.2.9.1 if it doesn't need special handling? How is it even a "complex" case in any meaningful way? The default is "R". The "U" ranges are all explicitly listed, making them *non-default* from a parsing standpoint, all handled by normally reading the data file. Is this not correct? From richard.wordingham at ntlworld.com Tue Aug 23 05:13:02 2022 From: richard.wordingham at ntlworld.com (Wordingham Richard) Date: Tue, 23 Aug 2022 11:13:02 +0100 (BST) Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <1661209991589.2502010960.1221423313@gmail.com> References: <20220822234538.05c26af6@JRWUBU2> <1661209991589.2502010960.1221423313@gmail.com> Message-ID: <796693839.1949146.1661249582514@mail.virginmedia.com> An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Aug 23 07:36:22 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 23 Aug 2022 05:36:22 -0700 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <20220822234538.05c26af6@JRWUBU2> References: <20220821152058.33bdd564@JRWUBU2> <20220822234538.05c26af6@JRWUBU2> Message-ID: <928ea0a5-e01a-c4c6-ecb4-de44a54a5461@ix.netcom.com> On 8/22/2022 3:45 PM, Richard Wordingham via Unicode wrote: > But "U+0023 NUMBER SIGN ("#") is used to indicate comments: all > characters from the number sign to the end of the line are considered > part of the comment, and are disregarded when parsing data." > and "The comments are purely informational, and may change format or be > omitted in the future. They should not be parsed for > content."!(Revision 28 Section 4.2.4). > > I think something needs to be added at the start of Section 4.2.4 to say > that a line starting U+0023, U+0020, U+0040 is exceptionally*not* a > comment line. That the @missing directives are contained in comment lines is a long-standing issue. Indeed, section 4.2.10 starts (emphasis added): 4.2.10 @missing Conventions Specially-formatted /*comment lines*/ with the keyword "@missing" are used to define default property values for ranges of code points not explicitly listed in a data file. These lines follow regular conventions that make them machine-readable. An @missing line /*starts with the comment character "#", followed by a space, then the "@missing" keyword*/, followed by a colon, another space, a code point range, and a semicolon. Then the line typically continues with a semicolon-delimited list of one or more default property values. For example: # @missing: 0000..10FFFF; Unknown .... I see no reason to add anything to section 4.2.4 other than, perhaps, a note that points to section 4.2.10 from the bullet item you cite. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Aug 23 07:51:26 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 23 Aug 2022 05:51:26 -0700 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <796693839.1949146.1661249582514@mail.virginmedia.com> References: <20220822234538.05c26af6@JRWUBU2> <1661209991589.2502010960.1221423313@gmail.com> <796693839.1949146.1661249582514@mail.virginmedia.com> Message-ID: <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> On 8/23/2022 3:13 AM, Wordingham Richard via Unicode wrote: > >> On 23/08/2022 00:31 S?awomir Osipiuk via Unicode >> wrote: >> >> >> Why is Vertical_Orientation even listed in 4.2.9.1 if it doesn't need >> special handling? How is it even a "complex" case in any meaningful way? >> The default is "R". The "U" ranges are all explicitly listed, making >> them >> *non-default* from a parsing standpoint, all handled by normally reading >> the data file. Is this not correct? > The Unicode term ?default property value? has only a limited > connection with the natural English meaning of the phrase. ?A ?default > property value? of an encoded character property is one taken by > unassigned code points or encoded characters for which the property is > irrelevant (TUS Section 3.5 D26). ?Its connection with parsing is > currently weak and confusing when there are multiple ?default property > values?. > > Worse, only an encoded character can have an ?explicit property value? > (D24)! > > Richard. There's a dual us of "default". For an code point that has an assigned character, a "default" value is one that is omitted in the data file listing. Which comes in handy for binary properties, so you only need to list those with a value of "True". For unassigned code points, a "default" means the most likely future value. In a few cases, that's not a single value across the entire code space, but there may be regions set aside for encoding characters that require different values than the default and where it makes sense to "future proof" some algorithms by picking a different value as the most likely one. Whether the actual value will later correspond to the default value is left open and there will be some exceptions, but generally these values are chosen to minimize disruptions. This range-based concept of defaults is what's called "complex" defaults. Now, the issue arises how to document them. The current approach on record is to use multiple @missing directives, with each later one resetting the value for the range given. The first one would cover the range 0000..10FFFF to set the general default for the entire code space and any following @missing directives would override selected subranges. Finally, the explicit values would override any default values set in? @missing directives. For compatibility with older parsers, all @missing directives are wrapped in comments. For some properties, such as derived bidi class, the? full scheme will be present in 15.0, but vertical orientation missed the cutoff, so that will be taken care of in the next version(s). Where multiple @missing lines are used, you will no longer see explicit listing of default values for reserved code points. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From sosipiuk at gmail.com Tue Aug 23 12:12:36 2022 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Tue, 23 Aug 2022 17:12:36 +0000 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> References: <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> Message-ID: <1661273955452.2394519965.3855422546@gmail.com> On Tuesday, 23 August 2022, 08:51:26 (-04:00), Asmus Freytag via Unicode wrote: For compatibility with older parsers, all @missing directives are wrapped in comments. If @missing directives are meaningful but ignored by older parsers, doesn't that result in incorrect values? Is that preferable to the parser simply failing on a new syntax? Would it only give incorrect values to unassigned code points? For some properties, such as derived bidi class, the full scheme will be present in 15.0, but vertical orientation missed the cutoff, so that will be taken care of in the next version(s). Where multiple @missing lines are used, you will no longer see explicit listing of default values for reserved code points. Then, from a parsing perspective, Vertical_Orientation is not currently complex (it has one default and all other values are explicit) but it will be complex (multiple defaults) in the next version. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Aug 23 12:29:42 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 23 Aug 2022 10:29:42 -0700 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <1661273955452.2394519965.3855422546@gmail.com> References: <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> <1661273955452.2394519965.3855422546@gmail.com> Message-ID: <905f6772-d4b0-1de9-e4b4-3383cc07cd0c@ix.netcom.com> On 8/23/2022 10:12 AM, S?awomir Osipiuk wrote: > On Tuesday, 23 August 2022, 08:51:26 (-04:00), Asmus Freytag via > Unicode wrote: > > For compatibility with older parsers, all @missing directives are > wrapped in comments. > > > If @missing directives are meaningful but ignored by older parsers, > doesn't that result in incorrect values? Is that preferable to the > parser simply failing on a new syntax? In principle, yes, and we've backed out of some suggested solutions at one point because there was the danger that a parser might not read a field. However, this train has left the station with Unicode 15.0; we're committed to moving to the new scheme for non-binary properties and will finish implementing it. That VO hasn't been moved over was a resource issue, not a policy one. @missing directives simply provide machine readable information where originally we had human-readable comments. Old parsers were supposed to implement the default values via the human readable description of the properties. > > Would it only give incorrect values to unassigned code points? 90% of properties have a single "all other code points" value, which is the same for code points not listed as well as not assigned. And in most cases, there's a value like "No" or "Other" that's the obvious value to choose. Those are relatively straightforward to build into a parser (or an API returning property data). But it's better to be explicit as we now are with the @missing directives. > > For some properties, such as derived bidi class, the? full scheme > will be present in 15.0, but vertical orientation missed the > cutoff, so that will be taken care of in the next version(s). > > Where multiple @missing lines are used, you will no longer see > explicit listing of default values for reserved code points. > > > Then, from a parsing perspective, Vertical_Orientation is not > currently complex (it has one default and all other values are > explicit) but it will be complex (multiple defaults) in the next version. No, it does have complex "defaults" in the second sense of default (value for unassigned code point) but not for the first sense of default ("omitted value in the listing"). Past time we got this all moved to a single scheme. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From sosipiuk at gmail.com Tue Aug 23 12:56:27 2022 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Tue, 23 Aug 2022 17:56:27 +0000 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <905f6772-d4b0-1de9-e4b4-3383cc07cd0c@ix.netcom.com> References: <905f6772-d4b0-1de9-e4b4-3383cc07cd0c@ix.netcom.com> Message-ID: <1661276034376.3422620534.2897572593@gmail.com> On Tuesday, 23 August 2022, 13:29:42 (-04:00), Asmus Freytag wrote: No, it does have complex "defaults" in the second sense of default (value for unassigned code point) but not for the first sense of default ("omitted value in the listing"). Excuse the pedantry, but I don't see how. If as you said earlier, "This range-based concept of defaults is what's called "complex" defaults", then Vertical_Orientation isn't complex because it *doesn't* have range-based defaults. It has one default only, and a bunch of explicit ranges (including both assigned and unassigned code points). That's what's in the data file. You can say that *conceptually* unassigned code point ranges are given "default" values (that are actually explicit in the data file) but this invites confusion, as this whole thread indicates. If we are being given instructions on how to parse data, such descriptions are superfluous and make the programmer question what their responsibility is. Ordinary developers shouldn't need to understand every nuance and motivation of Unicode design, they just want to know how to Get It Right. To a programmer, a complex default is something that *cannot* just be a single "else" value. For VO, that's coming in the next version, but it isn't here yet. Including it in 4.2.9.1 was premature. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Aug 23 13:06:02 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 23 Aug 2022 11:06:02 -0700 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <1661276034376.3422620534.2897572593@gmail.com> References: <905f6772-d4b0-1de9-e4b4-3383cc07cd0c@ix.netcom.com> <1661276034376.3422620534.2897572593@gmail.com> Message-ID: <2e27fff1-14d3-0f00-f61d-e164f9a42ef2@ix.netcom.com> On 8/23/2022 10:56 AM, S?awomir Osipiuk wrote: > If we are being given instructions on how to parse data, such > descriptions are superfluous and make the programmer question what > their responsibility is. Ordinary developers shouldn't need to > understand every nuance and motivation of Unicode design, they just > want to know how to Get It Right. Correct. hence the move to @missing directives. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Aug 23 13:28:57 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 23 Aug 2022 19:28:57 +0100 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> References: <20220822234538.05c26af6@JRWUBU2> <1661209991589.2502010960.1221423313@gmail.com> <796693839.1949146.1661249582514@mail.virginmedia.com> <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> Message-ID: <20220823192857.20a07cc7@JRWUBU2> On Tue, 23 Aug 2022 05:51:26 -0700 Asmus Freytag via Unicode wrote: > For unassigned code points, a "default" means the most likely future > value. In a few cases, that's not a single value across the entire > code space, but there may be regions set aside for encoding > characters that require different values than the default and where > it makes sense to "future proof" some algorithms by picking a > different value as the most likely one. > Whether the actual value will later correspond to the default value > is left open and there will be some exceptions, but generally these > values are chosen to minimize disruptions. Reality is a cheap version of this. And I'm not sure that choosing the most likely value minimises the disruption to be expected when new characters arrive before the new UCD; other values may do better. > Where multiple @missing lines are used, you will no longer see > explicit listing of default values for reserved code points. Which will *silently* damage some parsers' output. The damage should show as Unicode 16.0 comes out. Richard. From markus.icu at gmail.com Tue Aug 23 17:51:35 2022 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 23 Aug 2022 15:51:35 -0700 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <20220823192857.20a07cc7@JRWUBU2> References: <20220822234538.05c26af6@JRWUBU2> <1661209991589.2502010960.1221423313@gmail.com> <796693839.1949146.1661249582514@mail.virginmedia.com> <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> <20220823192857.20a07cc7@JRWUBU2> Message-ID: On Tue, Aug 23, 2022 at 11:32 AM Richard Wordingham via Unicode < unicode at corp.unicode.org> wrote: > > Where multiple @missing lines are used, you will no longer see > > explicit listing of default values for reserved code points. > > Which will *silently* damage some parsers' output. The damage should > show as Unicode 16.0 comes out. > Unicode *15*, in a few weeks. Depending on the parser, you might see it getting confused about multiple @missing lines, or getting incorrect property values for unassigned code points. We have had one prominent data file with multiple @missing lines since before the start of Unicode 15 beta, and we emphasized this on the beta review page . Starting with Version 15.0, some data files in the UCD may contain multiple @missing lines defined for the same property. This is currently the case for DerivedBidiClass.txt. UCD file parsers will need to be updated to treat the additional @missing lines like data lines. See UAX #44 Section 4.2.10, @missing Conventions for details. Some weeks later we added multiple @missing lines to a couple more data files. We are trying to innovate on Unicode data files in the gentlest way possible... Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Aug 23 20:36:08 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 24 Aug 2022 02:36:08 +0100 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: References: <20220822234538.05c26af6@JRWUBU2> <1661209991589.2502010960.1221423313@gmail.com> <796693839.1949146.1661249582514@mail.virginmedia.com> <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> <20220823192857.20a07cc7@JRWUBU2> Message-ID: <20220824023608.628b8d17@JRWUBU2> On Tue, 23 Aug 2022 15:51:35 -0700 Markus Scherer via Unicode wrote: > On Tue, Aug 23, 2022 at 11:32 AM Richard Wordingham via Unicode < > unicode at corp.unicode.org> wrote: > > > > Where multiple @missing lines are used, you will no longer see > > > explicit listing of default values for reserved code points. > > > > Which will *silently* damage some parsers' output. The damage > > should show as Unicode 16.0 comes out. > > > > Unicode *15*, in a few weeks. Unicode 14 UCD files will have been parsed correctly. Out of date parsers should still handle characters that are assigned in Unicode 15.0. However, the complex default values for characters unassigned in Unicode 15.0 will not be loaded properly. When characters newly assigned in Unicode 16.0 start hitting applications supposed to be using the UCD of Unicode 15.0, then the mitigations that should be in place may not be there. The effect is horribly subtle. > Depending on the parser, you might see it getting confused about > multiple @missing lines, or getting incorrect property values for > unassigned code points. > > We have had one prominent data file with multiple @missing lines since > before the start of Unicode 15 beta, and we emphasized this on the > beta review page . > > Starting with Version 15.0, some data files in the UCD may contain > multiple @missing lines defined for the same property. This is > currently the case for DerivedBidiClass.txt. UCD file parsers will > need to be updated to treat the additional @missing lines like data > lines. See UAX #44 Section 4.2.10, @missing Conventions > > for details. Remember that the current draft of UAX #44 for Unicode 15.0 says that comment lines should not be parsed. The need to parse ostensible comment lines needs to be publicised. Richard. From asmusf at ix.netcom.com Wed Aug 24 11:53:39 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 24 Aug 2022 09:53:39 -0700 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <20220824023608.628b8d17@JRWUBU2> References: <20220822234538.05c26af6@JRWUBU2> <1661209991589.2502010960.1221423313@gmail.com> <796693839.1949146.1661249582514@mail.virginmedia.com> <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> <20220823192857.20a07cc7@JRWUBU2> <20220824023608.628b8d17@JRWUBU2> Message-ID: On 8/23/2022 6:36 PM, Richard Wordingham via Unicode wrote: > Remember that the current draft of UAX #44 for Unicode 15.0 says that > comment lines should not be parsed. The need to parse ostensible > comment lines needs to be publicised. This is no longer the case. The draft has been updated to clearly point to the @missing conventions. (Thanks to the participants in this discussion for identifying the oversight). The @missing conventions as such are not new, the only thing that is being changed, as result of a very deliberate UTC decision is to make? the @missing convention correctly cover the few properties with complex defaults. VO, as noted, is being delayed by one version due to resource constraints. Parsers that rely on property values being listed explicitly for unassigned code points will not benefit from the change. Parsers that interpret @missing lines today, but can't handle multiple @missing lines for the same property will break visibly, and should have done so for the beta, or if limited to the VO file, will break visibly during the next beta. Properties for assigned code points are unaffected. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Aug 25 03:04:32 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 25 Aug 2022 09:04:32 +0100 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: References: <20220822234538.05c26af6@JRWUBU2> <1661209991589.2502010960.1221423313@gmail.com> <796693839.1949146.1661249582514@mail.virginmedia.com> <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> <20220823192857.20a07cc7@JRWUBU2> <20220824023608.628b8d17@JRWUBU2> Message-ID: <20220825090432.2a12e34c@JRWUBU2> On Wed, 24 Aug 2022 09:53:39 -0700 Asmus Freytag via Unicode wrote: > On 8/23/2022 6:36 PM, Richard Wordingham via Unicode wrote: > > Remember that the current draft of UAX #44 for Unicode 15.0 says > > that comment lines should not be parsed. The need to parse > > ostensible comment lines needs to be publicised. > > This is no longer the case. > > The draft has been updated to clearly point to the @missing > conventions. (Thanks to the participants in this discussion for > identifying the oversight). > > The @missing conventions as such are not new, the only thing that is > being changed, as result of a very deliberate UTC decision is to make > the @missing convention correctly cover the few properties with > complex defaults. Not yet in the outside world. https://www.unicode.org/reports/tr44/proposed.html is Draft 8, dated 4 August 2022. And clear authority to parse and interpret will be new once it is given. Richard. From asmusf at ix.netcom.com Thu Aug 25 03:34:17 2022 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 25 Aug 2022 01:34:17 -0700 Subject: Definition of Values of Property Vertical_Orientation In-Reply-To: <20220825090432.2a12e34c@JRWUBU2> References: <20220822234538.05c26af6@JRWUBU2> <1661209991589.2502010960.1221423313@gmail.com> <796693839.1949146.1661249582514@mail.virginmedia.com> <1775704a-8a1e-3f9f-8def-e7901f05b982@ix.netcom.com> <20220823192857.20a07cc7@JRWUBU2> <20220824023608.628b8d17@JRWUBU2> <20220825090432.2a12e34c@JRWUBU2> Message-ID: <7752994c-4341-34ce-b9ac-f804637164e3@ix.netcom.com> The updated draft will become public with the release of Unicode 15.0.0 in a few short days. For release mgmt reasons, "proposed.hmtl" is no longer being updated. There will be a rather extensive text describing possible migration issues on the main page for Version 15. VO is not affected this release, the best suggestion we have on the table is to add a DerivedVO file and have that one be like all the other DerivedXXX files that have added the multiple @missing lines. That would sidestep any issues for parsers written for the original VO file, and incidentally treat it on the same footing as EAW and LB. That would seem to take care of it for this round. Thanks for raising the issue. A./ On 8/25/2022 1:04 AM, Richard Wordingham via Unicode wrote: > On Wed, 24 Aug 2022 09:53:39 -0700 > Asmus Freytag via Unicode wrote: > >> On 8/23/2022 6:36 PM, Richard Wordingham via Unicode wrote: >>> Remember that the current draft of UAX #44 for Unicode 15.0 says >>> that comment lines should not be parsed. The need to parse >>> ostensible comment lines needs to be publicised. >> This is no longer the case. >> >> The draft has been updated to clearly point to the @missing >> conventions. (Thanks to the participants in this discussion for >> identifying the oversight). >> >> The @missing conventions as such are not new, the only thing that is >> being changed, as result of a very deliberate UTC decision is to make >> the @missing convention correctly cover the few properties with >> complex defaults. > Not yet in the outside world. > https://www.unicode.org/reports/tr44/proposed.html is Draft 8, dated 4 > August 2022. > > And clear authority to parse and interpret will be new once it is given. > > Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jshin1987 at gmail.com Fri Aug 26 03:28:55 2022 From: jshin1987 at gmail.com (=?UTF-8?B?SnVuZ3NoaWsgU0hJTiAo7Iug7KCV7IudKQ==?=) Date: Fri, 26 Aug 2022 01:28:55 -0700 Subject: Tai Tham Text Encoding In-Reply-To: <20220723171244.7fb392af@JRWUBU2> References: <20220723171244.7fb392af@JRWUBU2> Message-ID: On Sat, Jul 23, 2022 at 9:17 AM Richard Wordingham via Unicode < unicode at corp.unicode.org> wrote: > Most characters for writing words in the Tai Tham script in normal > texts have been encoded, though there are a few exceptions, of which > TAI THAM LETTER LAO LOW HA is the most prominent exception. (This is > mostly handled by repurposing TAI THAM LETTER LOW HA, which is not used > in Lao. Their relationship is like U+11034 BRAHMI LETTER LLA and > U+11075 BRAHMI LETTER OLD LETTER LLA.) On close reading of the TUS, > perhaps we also need to disunify U+1A58 TAI THAM SIGN MAI KANG LAI > depending on how it may be positioned relative to a following syllable > with a preposed vowel. (It was originally proposed as two separate > characters, distinguished by shape rather than positioning.) We may > need some monstrosities such as 'INVISIBLE MAI SAM' (though I'd rather > use CGJ). > > However, I am having a hard time persuading people that there is a > defined encoding for combinations of characters that rendering engines > should respect. What I regard as the basic definition of the encoding > of text is contained in the approved proposals, rather than in TUS or > any emanation thereof. > > What should I call the specification of the encoding of text, as > opposed to the encoding of characters? Would it be suitable to refer > to it as 'text encoding'? > How about "text representation"? See table 12-3 and the text around it (TUS chap 12. p.464). Or, would 'rendering rules' work better? Jungshik > I am trying to work out what in the way of Tai Tham text encoding is > laid down by the TUS and its emanations, such as the Unicode Character > Database. It is significant that the Indic syllabic category is > informative and by policy does not reflect sequencing requirements. > What I am left with is the general properties of marks, the principle > of canonical equivalence (which is still widely flouted) and the > specific text in the Tai Tham section. > > Now, extracting specifications are a bit tricky. For example, consider > "*Tone Marks*. Tai Tham has two combining tone marks, U+1A75 tai tham > sign tone-1 and U+1A76 tai tham sign tone-2, which are used in Tai Lue > and in Northern Thai. These are rendered above the vowel over the base > consonant." In modern Tai Khuen, what I take to be TONE-1 is rendered > to the right of the larger vowels over the base consonant, such as > VOWEL SIGN I. Should I therefore conclude that what I have taken to be > TONE-1 is something else? That would be ridiculous. We also have the > statement in TUS Section 2.11 that "all sequences of character codes > are permitted". > > I think I can extract some meaning from the text in the same section: > > "Tone marks are represented in logical order fol- > lowing the vowel over the base consonant or consonant stack. If there > is no vowel over a base consonant, then the tone is rendered directly > over the consonant; this is the same way tones are treated in the Thai > script." > > Consider the word ?????? TONE-1> in a typical Northern Thai style. The central stack, from top > to bottom, is TONE-1, SIGN I, HIGH KA, SIGN OA BELOW. If there were 'no > vowel over the base consonant', then TONE-1 would be rendered directly > over the base consonant, which is not how it is written. Therefore the > term 'vowel' refers to a vowel character rather than a complete > phonetic vowel. Therefore the logical order of the marks above and > below is either , as in the > proposals, or . The USE insists on SIGN OA, TONE-1>! (The USE order could be corrected by its override > method.) > > By contrast, there is some useful text on the position of U+1A7B TAI > THAM SIGN MAI SAM in character code sequences. > > In summary, my main two questions are: > > Is 'encoding of text' the correct phrase for the definition of the > correct arrangement? Is it appropriate to submit a proposal for the > standardisation of Tai Tham text encoding? > > Richard. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri Aug 26 16:17:53 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 26 Aug 2022 22:17:53 +0100 Subject: Tai Tham Text Encoding In-Reply-To: References: <20220723171244.7fb392af@JRWUBU2> Message-ID: <20220826221753.207c5dff@JRWUBU2> On Fri, 26 Aug 2022 01:28:55 -0700 Jungshik SHIN (???) via Unicode wrote: > On Sat, Jul 23, 2022 at 9:17 AM Richard Wordingham via Unicode < > unicode at corp.unicode.org> wrote: > > What should I call the specification of the encoding of text, as > > opposed to the encoding of characters? Would it be suitable to > > refer to it as 'text encoding'? > How about "text representation"? See table 12-3 and the text > around it (TUS chap 12. p.464). > Or, would 'rendering rules' work better? Thanks, that's a useful suggestion. I think "encoded representation of Tai Tham text" would be a useful phrase. For Tai Tham, the debate is mostly not about how to encode glyphs, but about how to order their encodings. (There may be debates about how to handle regional mergers of characters, such as the vowel sign MAI SAT and the tone mark TONE-2, also widely called 'mai sat', and some of the writings of Pali 'jjh'.) Richard. From wunnakoko at gmail.com Sun Aug 28 17:33:50 2022 From: wunnakoko at gmail.com (Wunna Ko) Date: Sun, 28 Aug 2022 18:33:50 -0400 Subject: Burmese Rendering on Kindle Message-ID: I just noticed that the font installed on Kindle cannot be rendered Burmese script correctly. Wondering if anyone on this mailing list is from Amazon and help to set up the rendering correctly? I can be of assistance if needed. -- Wunna Ko -------------- next part -------------- An HTML attachment was scrubbed... URL: From dchmelik at gmail.com Mon Aug 29 07:50:53 2022 From: dchmelik at gmail.com (David Chmelik) Date: Mon, 29 Aug 2022 12:50:53 -0000 (UTC) Subject: bold, italic, underline at once Message-ID: I know unicode does bold, italic, underline, but does it do them all at once? From doug at ewellic.org Mon Aug 29 09:56:32 2022 From: doug at ewellic.org (Doug Ewell) Date: Mon, 29 Aug 2022 14:56:32 +0000 Subject: bold, italic, underline at once In-Reply-To: References: Message-ID: David Chmelik wrote: > I know unicode does bold, italic, underline, but does it do them all > at once? It depends on the sense in which you mean that Unicode "does bold, italic, underline." If you're talking about ISO/IEC 6429 ("ANSI") SGR escape sequences: in principle you should be able to combine bold ("1"), italic ("3"), and underline ("4"). In practice it depends on the capabilities of your terminal or console emulator. Note that this has nothing to do with Unicode. If you're talking about Microsoft Word, OpenOffice Writer, or some other word-processing package: this is almost always supported. Note that this also has nothing to do with Unicode, or plain text at all. If you're talking about a private-use mechanism, such as the HTML-like Plane 14 tags supported by Andrew West's BabelPad editor: it depends on the specific mechanism. Andrew's approach does support combining these, plus strikethrough, based on font attributes. Note that no mechanism of this sort is authorized by the Unicode Standard. Mathematical Alphanumeric Symbols do not include underlining, do not support combinations of bold and italic except as explicitly encoded, are limited to a very small set of base characters, and are not intended for plain text styling in any event. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From lorna_evans at sil.org Mon Aug 29 09:57:41 2022 From: lorna_evans at sil.org (Lorna Evans) Date: Mon, 29 Aug 2022 09:57:41 -0500 Subject: bold, italic, underline at once In-Reply-To: References: Message-ID: <98b47e97-3d75-c70d-5faa-d6791f341ccd@sil.org> This is a formatting issue, not encoding. You can certainly do bold-italic at once IF you have a font that supports bold-italic. Underlining is a separate issue and is likely supported by most applications. Lorna On 8/29/2022 7:50 AM, David Chmelik via Unicode wrote: > I know unicode does bold, italic, underline, but does it do them all at > once? > > From harjitmoe at outlook.com Mon Aug 29 10:02:45 2022 From: harjitmoe at outlook.com (Harriet Riddle) Date: Mon, 29 Aug 2022 16:02:45 +0100 Subject: bold, italic, underline at once In-Reply-To: References: Message-ID: I'm not sure if I understand the question. Denoting and styling ranges of emphasis is not governed by Unicode itself, being a matter of higher-level protocols including but not limited to HTML/CSS, RTF, ECMA-48, IPTC 7901 and the myriad dialects of Markdown and Wikitext. In some contexts, stylised forms of a letter (e.g. blackletter) might be given a special meaning in a particular context; some such forms are included in Unicode in the Letterlike Symbols and Mathematical Alphanumeric Symbols blocks, and some (including blackletter) have HTML5 entities inherited from ISO 9573-13.? Mathematical Alphanumeric Symbols likewise includes e.g. bold serif, italic serif and bold italic serif forms of the Basic Latin alphabet letters.? While there might be a tendency in the wild to use these to stylise text where markup is unavailable, this is not really orthopraxic and (in more pragmatic terms) tends to be poorly supported by older devices and also assistive technology (although since they compatibility-decompose to the ASCII letters, it is theoretically /possible/ for assistive technology to support this, which cannot be said of some other novelty stylisations with lookalike characters). The underlining is available as a combining character (U+0332).? This is essentially a nonspacing version of the ASCII underscore, and can be applied if an underlined version of a symbol distinct from the plain symbol is needed for some purpose. Underlining an entire block of text that way is likely to have subpar results, and should generally be done with higher level markup instead.? And yes, this can be applied to Mathematical Alphanumeric Symbols characters if (say) a bold underlined sans-serif R is being used as a particular symbol for something, i.e. ??. David Chmelik via Unicode wrote: > I know unicode does bold, italic, underline, but does it do them all at > once? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From addisoni18n at gmail.com Sun Aug 28 18:35:56 2022 From: addisoni18n at gmail.com (Addison Phillips) Date: Sun, 28 Aug 2022 16:35:56 -0700 Subject: Burmese Rendering on Kindle In-Reply-To: References: Message-ID: I retired recently from Amazon and can help connect you (under separate cover) Addison On Sun, Aug 28, 2022, 15:37 Wunna Ko via Unicode wrote: > I just noticed that the font installed on Kindle cannot be rendered > Burmese script correctly. > > Wondering if anyone on this mailing list is from Amazon and help to set up > the rendering correctly? > > I can be of assistance if needed. > > -- > Wunna Ko > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Aug 29 10:47:13 2022 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 29 Aug 2022 16:47:13 +0100 (BST) Subject: bold, italic, underline at once In-Reply-To: References: Message-ID: <10678e76.2290f.182ea491542.Webtop.100@btinternet.com> David Chmelik wrote as follows. > I know unicode does bold, italic, underline, but does it do them all > at once? As explained by other posts in this thread that is not actually available in Unicode plain text at present. However, it could be. Some time ago I put forward a suggestion for using Variation Selector 14 to signal Italic for a character in plain text. https://www.unicode.org/L2/L2019/19063-italic-vs.pdf https://www.unicode.org/L2/L2019/19195-italic-cmt.pdf However my proposed enhancement to Unicode was rejected, indeed rejected at a formal decision not to encode level. https://www.unicode.org/alloc/nonapprovals.html However, I consider the way that that ruling is expressed is unfortunate, as it has "explicitly not" about something that I did not suggest. And it uses "inherently" which is only because the people there years ago decided that. Another issue is that they will not encode characters for a span of characters to be all italic because that is stateful, yet when a way to achieve the effect in plain text without being stateful is suggested they won't do that either. My suggestion could be extended to use Variation Selector 13 to signal Bold. I am unsure if Variation Selectors could be cascaded for Bold Italic or whether a single Variation Selector specifically for Bold Italic would be needed. But it is all Unicode politics. William Overington Monday 29 August 2022 From sosipiuk at gmail.com Mon Aug 29 11:36:50 2022 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Mon, 29 Aug 2022 16:36:50 +0000 Subject: bold, italic, underline at once In-Reply-To: References: Message-ID: <1661790332976.4278863416.1876827610@gmail.com> Technically, Unicode does NOT do bold and italic. Any bold or italic characters you find are encoded separately for mathematical use, where they can represent a different variable in a formula than their corresponding "plain" character, for example. The same applies to "double-struck" and similar characters. Using these characters to make "fancy text" is technically a misuse, though no one will punish you for doing it. If you want more such characters, you're likely out of luck. Applying formatting to regular text is outside the scope of Unicode. There are many standards for doing this. You can use HTML, Markdown, BBcode, ISO6429 escape sequences, RTF, various Office formats, and many many others. However, your software must support them. You do not get this "for free" with Unicode. On Monday, 29 August 2022, 08:50:53 (-04:00), David Chmelik via Unicode wrote: > I know unicode does bold, italic, underline, but does it do them all at > once? From kent.b.karlsson at bahnhof.se Mon Aug 29 13:07:38 2022 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Mon, 29 Aug 2022 20:07:38 +0200 Subject: bold, italic, underline at once In-Reply-To: References: Message-ID: <5BDA1360-DD8D-4602-BD72-E18AF2166453@bahnhof.se> > 29 aug. 2022 kl. 16:56 skrev Doug Ewell via Unicode : > > David Chmelik wrote: > >> I know unicode does bold, italic, underline, but does it do them all >> at once? > > It depends on the sense in which you mean that Unicode "does bold, italic, underline." > > If you're talking about ISO/IEC 6429 ("ANSI?) Nit: it is actually ANSI X3.64; but the best way to refer to it (international, original, easy to remember) is ECMA-48. > SGR escape sequences: Another nit: they are control sequences (I will not delve on the details here). > in principle you should be able to combine bold ("1"), italic ("3"), and underline ("4?). Formally, it depends on the GRCM - GRAPHIC RENDITION COMBINATION MODE setting: REPLACING (**really** bad idea) or CUMULATIVE (combine). However, I know of no implementation of any of the ECMA-48 ?modes" (and all of those ?modes" are a bad idea anyway). > In practice it depends on the capabilities of your terminal or console emulator. A rather important note: ECMA-48 SGR is in no way at all limited to terminal emulators, though ECMA-48 is popular there since such things as RTF, markdown, HTML are all non-starters for terminals. ECMA-48 SGR control sequences are perfectly well applicable to text files (though there is a lack of implementations) and ?WYSIWYG? text editors (nowadays called just GUI text editors or text editor windows). Only CSI 0m is specifically targeted to terminals, for use at the beginning(!) of prompts, doing a ?general style reset? from an unknown style setting. > Note that this has nothing to do with Unicode. Not entirely true... ISO/IEC 6429 (i.e. ECMA-48) is referenced from both Unicode and ISO/IEC 10646. /Kent K > > If you're talking about Microsoft Word, OpenOffice Writer, or some other word-processing package: this is almost always supported. Note that this also has nothing to do with Unicode, or plain text at all. > > If you're talking about a private-use mechanism, such as the HTML-like Plane 14 tags supported by Andrew West's BabelPad editor: it depends on the specific mechanism. Andrew's approach does support combining these, plus strikethrough, based on font attributes. Note that no mechanism of this sort is authorized by the Unicode Standard. > > Mathematical Alphanumeric Symbols do not include underlining, do not support combinations of bold and italic except as explicitly encoded, are limited to a very small set of base characters, and are not intended for plain text styling in any event. > > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Aug 29 13:28:24 2022 From: doug at ewellic.org (Doug Ewell) Date: Mon, 29 Aug 2022 18:28:24 +0000 Subject: bold, italic, underline at once In-Reply-To: <5BDA1360-DD8D-4602-BD72-E18AF2166453@bahnhof.se> References: <5BDA1360-DD8D-4602-BD72-E18AF2166453@bahnhof.se> Message-ID: I was simply trying to find out what direction the OP was headed (ECMA-48 vs. something else), not focusing on the precise terminology or arcane details like GRCM. Based on the OP's post from August 14, I was guessing he was headed in the direction of "let's encode some new stuff in Unicode to support this," but I wanted to find out. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > Nit: it is actually ANSI?X3.64; but the best way to refer to it (international, original, easy to remember) is ECMA-48. > ... From pgcon6 at msn.com Mon Aug 29 15:36:40 2022 From: pgcon6 at msn.com (Peter Constable) Date: Mon, 29 Aug 2022 20:36:40 +0000 Subject: Unicode goes AWKward Message-ID: FYI: Unix legend, who owes us nothing, keeps fixing foundational AWK code | Ars Technica Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Mon Aug 29 18:23:28 2022 From: jameskass at code2001.com (James Kass) Date: Mon, 29 Aug 2022 23:23:28 +0000 Subject: Burmese Rendering on Kindle In-Reply-To: References: Message-ID: Speaking of Myanmar, are the glyphs for the two following characters supposed to be identical? ? U+1051 MYANMAR LETTER SSA ? U+A9FD MYANMAR LETTER TAI LAING BA On 2022-08-28 11:35 PM, Addison Phillips via Unicode wrote: > I retired recently from Amazon and can help connect you (under separate > cover) > > Addison > > On Sun, Aug 28, 2022, 15:37 Wunna Ko via Unicode > wrote: > >> I just noticed that the font installed on Kindle cannot be rendered >> Burmese script correctly. >> >> Wondering if anyone on this mailing list is from Amazon and help to set up >> the rendering correctly? >> >> I can be of assistance if needed. >> >> -- >> Wunna Ko >> From mark at kli.org Mon Aug 29 19:39:29 2022 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 29 Aug 2022 20:39:29 -0400 Subject: Burmese Rendering on Kindle In-Reply-To: References: Message-ID: <5195eb09-6d80-ed86-f574-74feb908e0a9@shoulson.com> In my mailer, with whatever fonts I have, the first one has a little circle in the middle and the second one has a little solid dot in the middle.? Or did you mean you thought they were supposed to be identical and wondered why they weren't? ~mark On 8/29/22 19:23, James Kass via Unicode wrote: > > Speaking of Myanmar, are the glyphs for the two following characters > supposed to be identical? > > ? U+1051 MYANMAR LETTER SSA > ? U+A9FD MYANMAR LETTER TAI LAING BA > > > On 2022-08-28 11:35 PM, Addison Phillips via Unicode wrote: >> I retired recently from Amazon and can help connect you (under separate >> cover) >> >> Addison >> >> On Sun, Aug 28, 2022, 15:37 Wunna Ko via Unicode >> >> wrote: >> >>> I just noticed that the font installed on Kindle cannot be rendered >>> Burmese script correctly. >>> >>> Wondering if anyone on this mailing list is from Amazon and help to >>> set up >>> the rendering correctly? >>> >>> I can be of assistance if needed. >>> >>> -- >>> Wunna Ko >>> From jameskass at code2001.com Mon Aug 29 19:49:05 2022 From: jameskass at code2001.com (James Kass) Date: Tue, 30 Aug 2022 00:49:05 +0000 Subject: Burmese Rendering on Kindle In-Reply-To: <5195eb09-6d80-ed86-f574-74feb908e0a9@shoulson.com> References: <5195eb09-6d80-ed86-f574-74feb908e0a9@shoulson.com> Message-ID: In the code charts, both are shown with a little solid dot in the middle.? I was wondering if there should be a distinction between them.? (Trying to update my fonts.) But I think you've answered my question.? I'll just replace the solid dots in U+1050 and U+1051 with little circles in the middle. Hopefully that will work out... On 2022-08-30 12:39 AM, Mark E. Shoulson via Unicode wrote: > In my mailer, with whatever fonts I have, the first one has a little > circle in the middle and the second one has a little solid dot in the > middle.? Or did you mean you thought they were supposed to be > identical and wondered why they weren't? > > ~mark > > On 8/29/22 19:23, James Kass via Unicode wrote: >> >> Speaking of Myanmar, are the glyphs for the two following characters >> supposed to be identical? >> >> ? U+1051 MYANMAR LETTER SSA >> ? U+A9FD MYANMAR LETTER TAI LAING BA >> >> >> On 2022-08-28 11:35 PM, Addison Phillips via Unicode wrote: >>> I retired recently from Amazon and can help connect you (under separate >>> cover) >>> >>> Addison >>> >>> On Sun, Aug 28, 2022, 15:37 Wunna Ko via Unicode >>> >>> wrote: >>> >>>> I just noticed that the font installed on Kindle cannot be rendered >>>> Burmese script correctly. >>>> >>>> Wondering if anyone on this mailing list is from Amazon and help to >>>> set up >>>> the rendering correctly? >>>> >>>> I can be of assistance if needed. >>>> >>>> -- >>>> Wunna Ko >>>> From richard.wordingham at ntlworld.com Tue Aug 30 03:15:39 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 30 Aug 2022 09:15:39 +0100 Subject: Burmese Rendering on Kindle In-Reply-To: References: <5195eb09-6d80-ed86-f574-74feb908e0a9@shoulson.com> Message-ID: <20220830091539.21e445c8@JRWUBU2> On Tue, 30 Aug 2022 00:49:05 +0000 James Kass via Unicode wrote: > In the code charts, both are shown with a little solid dot in the > middle.? I was wondering if there should be a distinction between > them. (Trying to update my fonts.) > > But I think you've answered my question.? I'll just replace the solid > dots in U+1050 and U+1051 with little circles in the middle. > Hopefully that will work out... That may be inauthentic, in so far as the Tai Laing character is 'authentic'. U+1050 and U+1051 are PA and GA with a diacritic, necessary since the sibilant's glyphs started merging with PA and GA and seen across many Indic scripts. U+A9FD is PA with a systematic diacritic, forging extra letters needed for Pali from an authentic Shan base alphabet. I'm not sure that there is any real need to distinguish the two diacritics. Richard. From jameskass at code2001.com Wed Aug 31 07:09:50 2022 From: jameskass at code2001.com (James Kass) Date: Wed, 31 Aug 2022 12:09:50 +0000 Subject: Burmese Rendering (dots and circles) In-Reply-To: <20220830091539.21e445c8@JRWUBU2> References: <5195eb09-6d80-ed86-f574-74feb908e0a9@shoulson.com> <20220830091539.21e445c8@JRWUBU2> Message-ID: <351a03f4-9bf7-e19d-e42f-67072f2f4810@code2001.com> On 2022-08-30 8:15 AM, Richard Wordingham via Unicode wrote: > That may be inauthentic, in so far as the Tai Laing character is > 'authentic'. U+1050 and U+1051 are PA and GA with a diacritic, > necessary since the sibilant's glyphs started merging with PA and GA > and seen across many Indic scripts. U+A9FD is PA with a > systematic diacritic, forging extra letters needed for Pali from an > authentic Shan base alphabet. I'm not sure that there is any real need > to distinguish the two diacritics. Authenticity should always be a concern. A few years back, many users didn't distinguish between "1" and "l", or "0" and "O".? I prefer to make distinctions wherever feasible. But it isn't my r?le to foist my preferences on other user communities. In an effort to see what users are doing, I downloaded three OpenType Myanmar fonts.? Two of them didn't cover anything from Myanmar Extended-A and -B.? But the third, the Padauk font from SIL International, has the full repertoire. In Padauk the three following characters all use the inner circle: U+105C MYANMAR LETTER MON BBA ? U+1050 MYANMAR LETTER SHA ? U+1051 MYANMAR LETTER SSA ? ... and the appropriate characters in the extensions use dots. SIL International has a sterling reputation and has done wonderful work supporting non-Latin writing systems.? In my opinion SIL International likely has sufficient contacts within the user community.? I'm inclined to follow Padauk's lead. From richard.wordingham at ntlworld.com Wed Aug 31 21:53:00 2022 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 1 Sep 2022 03:53:00 +0100 Subject: Burmese Rendering (dots and circles) In-Reply-To: <351a03f4-9bf7-e19d-e42f-67072f2f4810@code2001.com> References: <5195eb09-6d80-ed86-f574-74feb908e0a9@shoulson.com> <20220830091539.21e445c8@JRWUBU2> <351a03f4-9bf7-e19d-e42f-67072f2f4810@code2001.com> Message-ID: <20220901035300.2f4af875@JRWUBU2> On Wed, 31 Aug 2022 12:09:50 +0000 James Kass via Unicode wrote: > On 2022-08-30 8:15 AM, Richard Wordingham via Unicode wrote: > > That may be inauthentic, in so far as the Tai Laing character is > > 'authentic'. U+1050 and U+1051 are PA and GA with a diacritic, > > necessary since the sibilant's glyphs started merging with PA and GA > > and seen across many Indic scripts. U+A9FD is PA with a > > systematic diacritic, forging extra letters needed for Pali from an > > authentic Shan base alphabet. I'm not sure that there is any real > > need to distinguish the two diacritics. > > Authenticity should always be a concern. > > A few years back, many users didn't distinguish between "1" and "l", > or "0" and "O".? I prefer to make distinctions wherever feasible. But > it isn't my r?le to foist my preferences on other user communities. > > In an effort to see what users are doing, I downloaded three OpenType > Myanmar fonts.? Two of them didn't cover anything from Myanmar > Extended-A and -B.? But the third, the Padauk font from SIL > International, has the full repertoire. > > In Padauk the three following characters all use the inner circle: > U+105C MYANMAR LETTER MON BBA ? > U+1050 MYANMAR LETTER SHA ? > U+1051 MYANMAR LETTER SSA ? > ... and the appropriate characters in the extensions use dots. > > SIL International has a sterling reputation and has done wonderful > work supporting non-Latin writing systems.? In my opinion SIL > International likely has sufficient contacts within the user > community.? I'm inclined to follow Padauk's lead. I would wonder though about their contacts for Sanskrit, and for that matter, Pali, community. I seem to be the one who alerted them to the fact that they'd overlooked the DD.DDHA conjunct. Despite the statement in TUS and in the 2006 submissions to the TUC from Michael Everson and Martin Hosken, the Padauk font declines to support . The Sixth Council text of the Tipitaka, or at least, something declaring itself to be such, is printed using a triangular WA for the conjuncts, even in words like ??????????? anv?ssaveyyum? (Verse 213 in the Dighanikaya, available at https://www.pali-text-images.net/cst/02-suttantapitaka/06-dighanikaya-1-cst.pdf), so I suspect negative feedback is limited, and besides, Padauk by default provides a round glyph for MEDIAL WA. Some of the minority characters, especially for Pali, do not seem to be robustly supported. (There's also the difficult question of whether a letter is actually a single character or a cluster.) Thus, the range of glyph variation will be unknown, and I strongly suspect that many of the regional Pali characters are actually recent inventions. Richard. From jameskass at code2001.com Wed Aug 31 23:51:05 2022 From: jameskass at code2001.com (James Kass) Date: Thu, 1 Sep 2022 04:51:05 +0000 Subject: Burmese Rendering (dots and circles) In-Reply-To: <20220901035300.2f4af875@JRWUBU2> References: <5195eb09-6d80-ed86-f574-74feb908e0a9@shoulson.com> <20220830091539.21e445c8@JRWUBU2> <351a03f4-9bf7-e19d-e42f-67072f2f4810@code2001.com> <20220901035300.2f4af875@JRWUBU2> Message-ID: On 2022-09-01 2:53 AM, Richard Wordingham via Unicode wrote: > I would wonder though about their contacts for Sanskrit, and for that > matter, Pali, community. I seem to be the one who alerted them to the > fact that they'd overlooked the DD.DDHA conjunct. Despite the > statement in TUS and in the 2006 submissions to the TUC from Michael > Everson and Martin Hosken, the Padauk font declines to support VIRAMA, U+101D WA>. The Sixth Council text of the Tipitaka, or at > least, something declaring itself to be such, is printed using a > triangular WA for the conjuncts, even in words like ??????????? > anv?ssaveyyum? (Verse 213 in the Dighanikaya, available at > https://www.pali-text-images.net/cst/02-suttantapitaka/06-dighanikaya-1-cst.pdf), > so I suspect negative feedback is limited, and besides, Padauk by > default provides a round glyph for MEDIAL WA. In N3043 the character U+103D MYANMAR CONSONANT SIGN MEDIAL WA was proposed to be a disunification from the sequence U+1039 MYANMAR SIGN VIRAMA plus U+101D MYANMAR LETTER WA.? The glyph used in the proposal for U+103D looks mostly circular until it is magnified, at which point it appears vaguely teardrop shaped. In a subsequent Myanmar proposal, N3143, the glyph for U+103D (Medial Wa) is shown as a circle, picture attached.? In the current code charts, Medial Wa is teardrop shaped. So : ? + ? ?? ? Since the new character was a disunification rather than a replacement, a Myanmar font should support both the sequence and the Medial Wa character.? And the Medial Wa character should display as a teardrop shape rather than a circle.? And the virama sequence glyph should display as a circle, which is simply a reduced letter Wa.? Is this correct? As Richard has pointed out, the Padauk font does not support the Virama+Wa sequence.? Picture attached. Here's the text used to make the Padauk exhibit screen capture: ???????????? -? with medial wa ???????????? -? with virama+wa -------------- next part -------------- A non-text attachment was scrubbed... Name: 20220901_Myanmar_U103D_N3143.JPG Type: image/jpeg Size: 17806 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 20220901_Padauk_ViramaWa.JPG Type: image/jpeg Size: 24016 bytes Desc: not available URL: