From wjgo_10009 at btinternet.com Wed Oct 4 12:54:40 2023 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 4 Oct 2023 18:54:40 +0100 (BST) Subject: Unicode encoding philosophy Message-ID: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> I have been reading the following. https://www.unicode.org/L2/L2023/23212-quotes-svs-proposal.pdf I am not an expert on this at all. It looks good and I hope it becomes implemented. What puzzles me though, is that structurally the proposal seems to have much the same encoding philosophy as a suggestion proposed by me in that they both would allow a variation selector to be used so as to conserve in plain text information that is typically these days conserved in rich text and gets lost if plain text is used. In my proposal, using a variation selector to conserve in a plain text document information about the use of italics in some text. My proposal was rejected, quite strongly. So, deep down, what please is the Unicode encoding philosophy that allows variation selectors to be used to conserve some information, yet not other information, in plain text? William Overington Wednesday 4 October 2023 From junicode at jcbradfield.org Wed Oct 4 15:46:19 2023 From: junicode at jcbradfield.org (Julian Bradfield) Date: Wed, 4 Oct 2023 21:46:19 +0100 (BST) Subject: Unicode encoding philosophy References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> Message-ID: On 2023-10-04, William_J_G Overington via Unicode wrote: > https://www.unicode.org/L2/L2023/23212-quotes-svs-proposal.pdf .... > about the use of italics in some text. > > My proposal was rejected, quite strongly. > > So, deep down, what please is the Unicode encoding philosophy that > allows variation selectors to be used to conserve some information, yet > not other information, in plain text? My understanding is: Variation selectors are not used to conserve meaningful information - they are used to allow specific allographs of a particular grapheme to be displayed, according to language- or region-specific conventions. Encoding italics is different, as the italics carry meaning. From beckiergb at gmail.com Wed Oct 4 15:58:33 2023 From: beckiergb at gmail.com (Rebecca Bettencourt) Date: Wed, 4 Oct 2023 13:58:33 -0700 Subject: Unicode encoding philosophy In-Reply-To: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> Message-ID: The alignment of quotation marks in a CJK square is an issue affecting very few characters, with no easy mechanism in markup or rich text formatting, with precedent in the form of SVSes for other punctuation marks used in CJK text. Italics applies to a large, open-ended set of characters (possibly the entire Unicode character set), has been implemented in just about every form of markup and formatting ever conceived, and has no precedent of implementation using VSes (other than the use of VS15/VS16 for text vs emoji presentation, which even the UTC has determined was a mistake). -- Rebecca Bettencourt On Wed, Oct 4, 2023 at 10:58?AM William_J_G Overington via Unicode < unicode at corp.unicode.org> wrote: > I have been reading the following. > > https://www.unicode.org/L2/L2023/23212-quotes-svs-proposal.pdf > > I am not an expert on this at all. It looks good and I hope it becomes > implemented. > > What puzzles me though, is that structurally the proposal seems to have > much the same encoding philosophy as a suggestion proposed by me in that > they both would allow a variation selector to be used so as to conserve > in plain text information that is typically these days conserved in rich > text and gets lost if plain text is used. In my proposal, using a > variation selector to conserve in a plain text document information > about the use of italics in some text. > > My proposal was rejected, quite strongly. > > So, deep down, what please is the Unicode encoding philosophy that > allows variation selectors to be used to conserve some information, yet > not other information, in plain text? > > William Overington > > Wednesday 4 October 2023 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Thu Oct 5 01:37:49 2023 From: c933103 at gmail.com (Phake Nick) Date: Thu, 5 Oct 2023 14:37:49 +0800 Subject: Unicode encoding philosophy In-Reply-To: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> Message-ID: The differences in quotation mark position among various East Asian languages is not a sort of formatting or rich text elements. The differences in position are not contextually meaningful, however the marks are expected to appear in different positions in different languages, or in the same languages but for users from different regions. Currently, some softwares attempts to deal with this by identifying language and region of the text being input/display, and choosing an appropriate regional variation of the symbol position to display to users. The variation selector will allow the variation to be specified directly. I don't think this can be compared to italic. William_J_G Overington via Unicode ? 2023?10?5??? ??1:57??? > > I have been reading the following. > > https://www.unicode.org/L2/L2023/23212-quotes-svs-proposal.pdf > > I am not an expert on this at all. It looks good and I hope it becomes > implemented. > > What puzzles me though, is that structurally the proposal seems to have > much the same encoding philosophy as a suggestion proposed by me in that > they both would allow a variation selector to be used so as to conserve > in plain text information that is typically these days conserved in rich > text and gets lost if plain text is used. In my proposal, using a > variation selector to conserve in a plain text document information > about the use of italics in some text. > > My proposal was rejected, quite strongly. > > So, deep down, what please is the Unicode encoding philosophy that > allows variation selectors to be used to conserve some information, yet > not other information, in plain text? > > William Overington > > Wednesday 4 October 2023 > From cate at cateee.net Thu Oct 5 07:22:56 2023 From: cate at cateee.net (Giacomo Catenazzi) Date: Thu, 5 Oct 2023 14:22:56 +0200 Subject: Unicode encoding philosophy In-Reply-To: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> Message-ID: <5d0d2996-08c7-40c0-8732-5acf4325d57b@cateee.net> On 4 Oct 2023 19:54, William_J_G Overington via Unicode wrote: (...) > What puzzles me though, is that structurally the proposal seems to have > much the same encoding philosophy as a suggestion proposed by me in that > they both would allow a variation selector to be used so as to conserve > in plain text information that is typically these days conserved in rich > text and gets lost if plain text is used. In my proposal, using a > variation selector to conserve in a plain text document information > about the use of italics in some text. > > My proposal was rejected, quite strongly. > > So, deep down, what please is the Unicode encoding philosophy that > allows variation selectors to be used to conserve some information, yet > not other information, in plain text? Note: Unicode philosophy changes (you still see some obsolete formatting tags), also because real life problems changed the target (from ideal to what can be implemented and used by real people). In any case, Unicode is not something magic. You tell them, and you get something. In fact support is not ideal on rendering on many "minor" scripts. So: how do you will implement your proposal in many operating systems and programs? Unicode supports some stylistic formatting, but mostly at character level, so that fonts, shaping, rendering. At character level (so variant selector) is easy to implement in fonts: and it is required also for other non-Unicode uses (e.g. tabular numbers). Your proposal is instead disruptive on most layer on rendering texts. And it will takes the limited resources to support more languages to make complex most of languages, and mostly just for few Western languages. And we can already format text with Italic, e.g. with TeX, LaTeX, HTML, etc. And also direct support in Unicode: we already have C0 (control block 0), and standards to do italic directly. Note: formatting is important, but it should be done at different level (we should not repeat errors of 1960s-1989s on mixing text and formatting, and putting formatting in "binary"/codepoints: we need verbose and human readable syntax). IMHO HTML is not good enough for all formatting things, but I do not think it should be done at Unicode (or at least, not at codepoints, but at UAX level or with more "independent" like ICU. Please: consider how to implement things. Help to program proof of concepts. Features which will not be implemented would just get troubles on Unicode (and so adding obsolete features). And Unicode success is also not to have flag days. (Think about how would interact programs which know italic and which do not know, and security implications.). cate From kent.b.karlsson at bahnhof.se Thu Oct 5 12:31:44 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 5 Oct 2023 19:31:44 +0200 Subject: Unicode encoding philosophy In-Reply-To: References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> Message-ID: <936488EB-FAC8-4F16-BDBB-D26093C6D26E@bahnhof.se> > 4 okt. 2023 kl. 22:58 skrev Rebecca Bettencourt via Unicode : > > The alignment of quotation marks in a CJK square is an issue affecting very few characters, with no easy mechanism in markup or rich text formatting, with precedent in the form of SVSes for other punctuation marks used in CJK text. > > Italics applies to a large, open-ended set of characters (possibly the entire Unicode character set), has been implemented in just about every form of markup and formatting ever conceived, and has no precedent of implementation using VSes (other than the use of VS15/VS16 for text vs emoji presentation, which even the UTC has determined was a mistake). I have missed that (busy with other things)? But I do not agree that text/emoji variation sequences were a mistake. Indeed it should be extended and systematized. However, the proposal in L2023/23212 would be a major mistake to accept. Most of the ?forms? given are completely different characters, especially those in VS2?Vertical?Hans are completely different from the supposed ?base? characters. (That they are used for more or less the same purpose is not a reason to coalesce them.) Variation sequences with FE00/FE01 are for (in some sense) minor typographical differences that are essentially indifferent (except if you care about typography). (The rotated hieroglyphs went too far?) /Kent K > > -- Rebecca Bettencourt > > > On Wed, Oct 4, 2023 at 10:58?AM William_J_G Overington via Unicode > wrote: > I have been reading the following. > > https://www.unicode.org/L2/L2023/23212-quotes-svs-proposal.pdf > > I am not an expert on this at all. It looks good and I hope it becomes > implemented. > > What puzzles me though, is that structurally the proposal seems to have > much the same encoding philosophy as a suggestion proposed by me in that > they both would allow a variation selector to be used so as to conserve > in plain text information that is typically these days conserved in rich > text and gets lost if plain text is used. In my proposal, using a > variation selector to conserve in a plain text document information > about the use of italics in some text. > > My proposal was rejected, quite strongly. > > So, deep down, what please is the Unicode encoding philosophy that > allows variation selectors to be used to conserve some information, yet > not other information, in plain text? > > William Overington > > Wednesday 4 October 2023 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotrunio-2004 at wp.pl Sat Oct 7 14:27:12 2023 From: piotrunio-2004 at wp.pl (=?UTF-8?Q?piotrunio-2004=40wp=2Epl?=) Date: Sat, 07 Oct 2023 21:27:12 +0200 Subject: =?UTF-8?Q?Reserved_character_issue?= Message-ID: In www.unicode.org https://www.unicode.org/Public/15.1.0/ucd/LineBreak.txt ?, the character range 20C1?20CF is included and marked as 'reserved': 20C1..20CF ; PR # Cn [15] <reserved-20C1>..<reserved-20CF> But in? www.unicode.org https://www.unicode.org/charts/PDF/U20A0.pdf ?those 15 reserved characters are missing entirely. Conversely, for U+1FB93, it is the other way around. In? www.unicode.org https://www.unicode.org/charts/PDF/U1FB00.pdf ?it is marked as a reserved character, but in? www.unicode.org https://www.unicode.org/Public/15.1.0/ucd/LineBreak.txt ?it is missing entirely. The inclusion of 'reserved' character ranges as if they were normal characters as well as the inconsistency of 'reserved' markings with the code charts is making it very difficult to determine the exact set of characters that is in Unicode and also very difficult to determine the exact set of 'reserved' characters as marked by code charts. In particular, www.unicode.org LineBreak.txt ?gives the impression that there are?211806 characters (excluding control characters, surrogates, and private use) in Unicode 15.1, which is vastly different from the official count of 149,813 characters ?as announced in? www.unicode.org https://www.unicode.org/versions/Unicode15.1.0/ ?. -------------- next part -------------- An HTML attachment was scrubbed... URL: From junicode at jcbradfield.org Sat Oct 7 14:58:22 2023 From: junicode at jcbradfield.org (Julian Bradfield) Date: Sat, 7 Oct 2023 20:58:22 +0100 (BST) Subject: Reserved character issue References: Message-ID: On 2023-10-07, piotrunio-2004 at wp.pl via Unicode wrote: In mail.unicode, you wrote: > In www.unicode.org https://www.unicode.org/Public/15.1.0/ucd/LineBreak.txt > ?, the character range 20C1?20CF is included and marked as > 'reserved': 20C1..20CF ; PR # Cn [15] > <reserved-20C1>..<reserved-20CF> But in? www.unicode.org > https://www.unicode.org/charts/PDF/U20A0.pdf ?those 15 reserved characters > are missing entirely. >From TUS chapter 24, page 951: Reserved Characters. Character codes that are marked ?? are unassigned and reserved for future encoding. Reserved codes are indicated by a ? glyph. To ensure read- ability, many instances of reserved characters have been suppressed from the names list. From harjitmoe at outlook.com Sun Oct 8 03:02:05 2023 From: harjitmoe at outlook.com (Harriet Riddle) Date: Sun, 8 Oct 2023 09:02:05 +0100 Subject: kJa property in Unihan Message-ID: |Unihan_OtherMappings.txt|[1] contains only seven entries for the /kJa/[2] property (specifically, |U+382F|, |U+4105|, |U+42C6|, |U+459D|, |U+484E|, |U+4B3B| and |U+4C17|).?But |JIS-X-0213-FromPrevious.txt|[3] contains 85 instances of IRG /JA/ sources which have been replaced with JIS X 0213 sources. Is there a reason for this apparent discrepancy, e.g. is the scope of the /kJa/ property narrower than its definition would suggest? --Har. [1] https://unicode.org/Public/15.1.0/ucd/Unihan.zip [2] https://www.unicode.org/reports/tr38/#kJa [3] https://unicode.org/wg2/iso10646/edition6/data/JIS-X-0213-FromPrevious.txt -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Sun Oct 8 09:46:25 2023 From: kenwhistler at sonic.net (Ken Whistler) Date: Sun, 8 Oct 2023 07:46:25 -0700 Subject: Reserved character issue In-Reply-To: References: Message-ID: Julian has provided the explanation. The code charts are produced by tooling that has logic for suppressing the display of overly long ranges of reserved code points in the code charts that would serve no point for display. When trying to get accurate character counts of any particular type, one should always depend on the data files in the UCD directly, rather than attempting to deconstruct values from the code charts. For counts by General_Category values, including gc=Cn, see https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt Note that gc=Cn also is not quite the same as "reserved", because that particular gc value combines both reserved code points and noncharacter code points. For most public purposes other than detailed implementations, there are also somewhat simplified, but handy character count statistics available for every version of the Unicode Standard: https://www.unicode.org/versions/stats/ Those statistics can be used to answer the general questions such as: "How many characters are in Unicode?" --Ken On 10/7/2023 12:58 PM, Julian Bradfield via Unicode wrote: > >From TUS chapter 24, page 951: Reserved Characters. Character codes > that are marked ?? are unassigned and reserved for future > encoding. Reserved codes are indicated by a ? glyph. To ensure read- > ability, many instances of reserved characters have been suppressed > from the names list. From duerst at it.aoyama.ac.jp Sun Oct 8 19:38:05 2023 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Mon, 9 Oct 2023 09:38:05 +0900 Subject: Unicode encoding philosophy In-Reply-To: <5d0d2996-08c7-40c0-8732-5acf4325d57b@cateee.net> References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> <5d0d2996-08c7-40c0-8732-5acf4325d57b@cateee.net> Message-ID: <1d6f95ea-a50d-008b-eae1-8139d59fff93@it.aoyama.ac.jp> On 2023-10-05 21:22, Giacomo Catenazzi via Unicode wrote: > Note: formatting is important, but it should be done at different level > (we should not repeat errors of 1960s-1989s on mixing text and > formatting, and putting formatting in "binary"/codepoints: we need > verbose and human readable syntax). IMHO HTML is not good enough for all > formatting things, but I do not think it should be done at Unicode (or > at least, not at codepoints, but at UAX level or with more "independent" > like ICU. Please note that formatting (and in particular saying what's bold or italic) isn't really the business of HTML, but CSS. Regards, Martin. From public at khwilliamson.com Tue Oct 10 11:24:15 2023 From: public at khwilliamson.com (Karl Williamson) Date: Tue, 10 Oct 2023 10:24:15 -0600 Subject: I'm trying to understand this Word Break test Message-ID: In the 15.1 UCD files in the auxiliary folder, there is the WordBreakTest.txt file. It contains the following line: ? 0020 ? 0308 ? 0020 ? # ? [0.2] SPACE (WSegSpace) ? [4.0] COMBINING DIAERESIS (Extend_FE) ? [999.0] SPACE (WSegSpace) ? [0.3] I don't understand how UAX #29 leads to a break between the COMBINING DIARESIS and the SPACE. The two relevant rules, I believe are Keep horizontal whitespace together. WB3d WSegSpace ? WSegSpace Ignore Format and Extend characters, except after sot, CR, LF, and Newline. (See Section 6.2, Replacing Ignore Rules.) This also has the effect of: Any ? (Format | Extend | ZWJ) WB4 X (Extend | Format | ZWJ)* ? X Looking at the boundary I mentioned, we have "Extend" followed immediately by "WSegSpace" The higher priority rules don't involve these classes, so don't apply. Rule 3d doesn't apply, but Rule 4 does. It says to pretend that the Extend doesn't exist except after certain classes. The character preceding the Extend one is a WSeqSpace character, so we get X Extend ? X WSegSpace Extend ? WSegSoace That means that we are to pretend that the boundary is between WSegSpace WSegSpace Rule 3d does apply to this, and says that no break is to happen. But the test says instead Rule 999.0 applies and a break should occur. Please explain. From ashpilkin at gmail.com Tue Oct 10 12:01:20 2023 From: ashpilkin at gmail.com (ashpilkin at gmail.com) Date: Tue, 10 Oct 2023 20:01:20 +0300 Subject: I'm trying to understand this Word Break test In-Reply-To: References: Message-ID: On Tue, 2023-10-10 at 10:24 -0600, Karl Williamson wrote: > The two relevant rules, I believe are > > Keep horizontal whitespace together. > > WB3d WSegSpace ? WSegSpace > > Ignore Format and Extend characters, except after sot, CR, LF, and > Newline. (See Section 6.2, Replacing Ignore Rules.) This also has the > effect of: Any ? (Format | Extend | ZWJ) > > WB4 X (Extend | Format | ZWJ)* ? X > > [Rule 4] says to pretend that the > Extend doesn't exist except after certain classes. The character > preceding the Extend one is a WSeqSpace character, so we get > > X Extend ? X > WSegSpace Extend ? WSegSoace Not quite. Section 6.2, referenced in the comment, says that the ignore rule 4 means two things: - First, don't break before (Extend | Format | ZWJ) unless a preceding (higher-priority) rule mandates that; - Second, in every *subsequent* (lower-priority) rule, replace every boundary property X by X (Extend | Format | ZWJ)* . As rule 3d precedes rule 4, we don't get to "pretend" that the combining diaeresis doesn't exists for the purposes of rule 3d, as you say,---only for rules 5, ..., 999. Thus rule 3d does not apply anywhere, then rule 4 applies between the first space and the combining diaeresis, then 999 applies between the diaeresis and the second space. (And IIUC this makes some sense---putting a combining accent on a space is a way to typeset that combining accent by itself that doesn't require its standalone form to be encoded separately.) -- Good luck, Alex From ecm.unicode at gmail.com Tue Oct 10 19:39:55 2023 From: ecm.unicode at gmail.com (Erik Carvalhal Miller) Date: Tue, 10 Oct 2023 20:39:55 -0400 Subject: Unicode encoding philosophy In-Reply-To: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> Message-ID: The Unicode Standard?s core specification, in chapter 23 (?Special Areas and Format Characters?), ?23.4 (?Variation Selectors?), is a little vague about variation sequences, stating that in ?special circumstances? within ?plain text contexts? they are used ?for specifying a restriction on the set of glyphs that are used to represent a particular character?. My take is that they are used in situations where the base character and its selectable variant(s) definitely represent the same abstract character identity yet, in some contexts at least, don?t entirely share 100% exactly the same identity ? a sort of have?your?cake?and?eat?it?too. For some such situations, Unicode might designate distinct code points instead; it seems that, when the UTC is making an encoding decision about the sort of situation that might call for variation selectors, the ultimate choice between variation selectors and distinct code points is influenced by questions of compatibility and pre?xisting encodings. With the base characters at issue in L2/23-212 already well established in Unicode, it makes sense to consider variation sequences for distinguishing the desired behaviors instead of assigning new code points. It turns out that the Unicode Standard does formally and explicitly declare a set of encoding principles, in the core specification?s chapter 2 ?General Structure?, ?2.2 ?Unicode Design Principles?. One of them in particular, Plain Text, would appear to be key in the opposition to a scheme for folding italicization, a rich?text feature, into the Unicode?s character?encoding standard. But, you may ask, what about the mathematical Latin and Greek alphanumeric symbols, rich in implied typographical styles including italic? There?s a temptation to consider Unicode?s acceptance of those pesky symbols, which the Consortium has emphasized is not a precedent for the inclusion of further plain?text italics within the Standard, as an exception to the rules; but actually I find the symbols? encoding quite consistent with overall Unicode logic. I trust you?re on board with the notion of including them in the standard one way or another on the basis of distinct semantics, that for example an italic variable A can mean something distinct from a bold variable A. Let?s consider an equation that you?ll probably recognize, font support willing: ??=????. Thanks to the power of Unicode, we could use it in the same plain?text document as, say, ??=???? while keeping both equations distinct. So, you may be thinking, that?s what you want with a more generalized italics scheme, via variation selectors; after all, in nontechnical text, styles such as italics convey meaning such as ?emphasis? or ?title of a work? or ?section heading? or ?foreign expression?. But there is a fundamental difference! In the math symbols, the stylization helps define a character?s distinct identity, so that we don?t mix up variables ? and ?; in more general text usage, the styles don?t change the character identity, but rather the styles themselves convey meaning independently of any characters appearing in the styled run. This becomes more obvious with styles such as outlining and background color where none of the actual glyphs change and any spacing invisible characters (such as U+0020 SPACE) are clearly part of the style run. In normal italic styling, yes, the visible characters? glyphs do change, but they do so because those characters are in the midst of an italic run with a beginning and an end, not because a letter such as E in the midst of such a run has a different identity from that it has outside such a run. For the math symbols, choices such as italic or bold are character?by?character decisions; for example, in ??=????, the ? and the ?, though adjacent, are each independently italic ? compare with the commutatively equivalent ??=????, where the superscript 2 remains upright ? whereas in more general text usage, adjacent italics such as in an italicized word ?emcee? are not a character?by?character decision but the result of a decision to italicize a whole span of text. And that brings us to another of Unicode?s design principles, Logical Order. For general text, italics span a run with a defined beginning and end, such as in the HTML representation annus mirabilis; to use a character?by?character representation such as annusmirabilis or &aital;&nital;&nital;&uital;&sital;&mital;&iital;&rital;&aital;&bital;&iital;&lital;&iital;&sital; or a&VS14;n&VS14;n&VS14;u&VS14;s&VS14;m&VS14;i&VS14;r&VS14;a&VS14;b&VS14;i&VS14;l&VS14;i&VS14;s&VS14; is counterintuitive and, I can now attest, tedious. Such runs of italics are inherently stateful in their conceptualization, and rich text implements them statefully, This statefulness applies even to pre?computer days of metal type: A compositor about to set a run of italic type would turn to a case of italics from which to pick out the next several glyphs, then turn back to a non?italic case when that run was complete, rather than serially start and finish using the italic case many times in a row. To encode spans of italics character?by?character, whether with variation sequences or atomic characters, violates the logical order of general text. And then, besides Plain Text and Logical Order, there?s the Stability principle, which comes into play with canonical equivalence when you start to play with canonical composition and decomposition of the various accented characters to which italics should be applicable (as alluded to in the aforementioned chapter 23, ?23.4). But I urge you to give those design principles a look. On Wed, Oct 4, 2023 at 1:59?PM William_J_G Overington via Unicode wrote: > > I have been reading the following. > > https://www.unicode.org/L2/L2023/23212-quotes-svs-proposal.pdf > > I am not an expert on this at all. It looks good and I hope it becomes > implemented. > > What puzzles me though, is that structurally the proposal seems to have > much the same encoding philosophy as a suggestion proposed by me in that > they both would allow a variation selector to be used so as to conserve > in plain text information that is typically these days conserved in rich > text and gets lost if plain text is used. In my proposal, using a > variation selector to conserve in a plain text document information > about the use of italics in some text. > > My proposal was rejected, quite strongly. > > So, deep down, what please is the Unicode encoding philosophy that > allows variation selectors to be used to conserve some information, yet > not other information, in plain text? > > William Overington > > Wednesday 4 October 2023 > From cate at cateee.net Wed Oct 11 03:02:55 2023 From: cate at cateee.net (Giacomo Catenazzi) Date: Wed, 11 Oct 2023 10:02:55 +0200 Subject: Unicode encoding philosophy In-Reply-To: <1d6f95ea-a50d-008b-eae1-8139d59fff93@it.aoyama.ac.jp> References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> <5d0d2996-08c7-40c0-8732-5acf4325d57b@cateee.net> <1d6f95ea-a50d-008b-eae1-8139d59fff93@it.aoyama.ac.jp> Message-ID: <3853f461-7033-469a-a285-b107157d721c@cateee.net> On 9 Oct 2023 02:38, Martin J. D?rst via Unicode wrote: > On 2023-10-05 21:22, Giacomo Catenazzi via Unicode wrote: > >> Note: formatting is important, but it should be done at different >> level (we should not repeat errors of 1960s-1989s on mixing text and >> formatting, and putting formatting in "binary"/codepoints: we need >> verbose and human readable syntax). IMHO HTML is not good enough for >> all formatting things, but I do not think it should be done at Unicode >> (or at least, not at codepoints, but at UAX level or with more >> "independent" like ICU. > > Please note that formatting (and in particular saying what's bold or > italic) isn't really the business of HTML, but CSS. It is complex, and probably difficult to define "formatting". CSS for sure it is used for the *realisation* of the formatting. The rest is complex. CSS can do something autonomously (e.g. :first-child), but on most cases you should define formatting limits in HTML (tags, classes, id). As example I do not think is it appropriate to use class='red' in HTML to tell CSS how to justify a box. Additionally some HTML tags are about formatting

, etc. (paragraph, chapter title, etc.). We can argue about semantic and style separation in HTML and CSS. But for Unicode both are on the other side of the line. We lack CSS-like styling, but also the way to express semantic separation (but for some ruby). We may find that ASCII provide different level of separations (FS, GS, RS, US, but also EM, FF, CR/LF, and also SPACE), or with ECMA, more about style (but as I found in Wikipedia, each terminal has own interpretation of "red" and "highlight red", and used may redefine palettes black on white vs white on black), but that is also outside Unicode (just some support to do it transparently on a different layer). We have different definition of "formatting" compared to W3C, and sometime that cause big problems, e.g. the problematic , (on few contexts, it seems W3C consider subscript as text, so lack of supports e.g. units on drop down menus using best-practices of Unicode). Or overlap on some fields (e.g. HTML: text directions, ruby; CSS: Variant selector) So: you cannot use Unicode strings with CSS to get a nice formatted text. Personally: I feel in future we need a generic markup language (as an Unicode-like project, with a large intent: for every living language, or in this case also about writing rules: not just articles, but signs, wood inscriptions, etc. which sometime have different rules). But it is a huge task, and I think more complex than Unicode tasks, so not for "today". And not a think should do Unicode (but ev. in a side entity). But for italic: why not use just HTML/CSS (which has good support to Latin scripts (and Western scripts in general) which requires use it. Or just ECMA, until we get resources (and possibly after we solved also the font problems). But also this last fact may give us some hints: why we do not use ECMA anymore for such formatting? For sure Microsoft knew it very well e.g. for Microsoft Word (it originated as MS-DOS (so console) program and later it had various parallel versions with the GUI console). But also on other cases. I really suspect such formatting is in the wrong layer, so it will not easy to program and to develop file formats. Also with such proposal: it is not enough expressive for all cases, and so it would be a special case so just making code more complex for what gain? ciao cate From pkar at ieee.org Wed Oct 11 03:28:27 2023 From: pkar at ieee.org (Piotr Karocki) Date: Wed, 11 Oct 2023 10:28:27 +0200 Subject: Unicode encoding philosophy In-Reply-To: <3853f461-7033-469a-a285-b107157d721c@cateee.net> References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> <5d0d2996-08c7-40c0-8732-5acf4325d57b@cateee.net> <1d6f95ea-a50d-008b-eae1-8139d59fff93@it.aoyama.ac.jp> <3853f461-7033-469a-a285-b107157d721c@cateee.net> Message-ID: <5ec55d60fe6f4b40a8f98f52d5c7449f@mail.gmail.com> > Additionally some HTML tags are about formatting

, etc. I disagree. HTML is about text structure, not about formatting/rendering.

is rendered differently for different output devices: monitor, printer, Braille 'display', narrator (text to voice), etc. Unicode should be used to specify character/glyph/sign, HTML to add text structure, and CSS used only to force rendering (so it should be used very rarely). Maybe also semantic web (RDF), level between Unicode and HTML. From wikipedia:
Paul Schuster was born in Dresden.
---8<--- Piotr Karocki From kent.b.karlsson at bahnhof.se Wed Oct 11 03:51:20 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 11 Oct 2023 10:51:20 +0200 Subject: Unicode encoding philosophy In-Reply-To: <5ec55d60fe6f4b40a8f98f52d5c7449f@mail.gmail.com> References: <5ec55d60fe6f4b40a8f98f52d5c7449f@mail.gmail.com> Message-ID: <4291D527-0400-4792-9BAF-7B9E326066E4@bahnhof.se> > 11 okt. 2023 kl. 10:30 skrev Piotr Karocki via Unicode : > > ? >> >> Additionally some HTML tags are about formatting

, etc. > I disagree. > HTML is about text structure, not about formatting/rendering. >

is rendered differently for different output devices: monitor, printer, > Braille 'display', Braille is not a format. To ?display? in Braille you need a text transformation. And that transformation is language dependent. > narrator (text to voice), etc. > > Unicode should be used to specify character/glyph/sign, HTML to add text > structure, and CSS used only to force rendering (so it should be used very > rarely). In the real world, CSS is used VERY much. /k > Maybe also semantic web (RDF), level between Unicode and HTML. From > wikipedia: >
> Paul Schuster was born in > href="https://www.wikidata.org/entity/Q1731"> > Dresden. > >
> > ---8<--- > Piotr Karocki From pkar at ieee.org Wed Oct 11 05:13:13 2023 From: pkar at ieee.org (Piotr Karocki) Date: Wed, 11 Oct 2023 12:13:13 +0200 Subject: Unicode encoding philosophy In-Reply-To: <4291D527-0400-4792-9BAF-7B9E326066E4@bahnhof.se> References: <5ec55d60fe6f4b40a8f98f52d5c7449f@mail.gmail.com> <4291D527-0400-4792-9BAF-7B9E326066E4@bahnhof.se> Message-ID: >> and CSS used only to force rendering (so it should be used very >> rarely). > In the real world, CSS is used VERY much. Yes, it is used very much, because many want to transfer not data (text), but visual effects :) From cate at cateee.net Wed Oct 11 05:41:49 2023 From: cate at cateee.net (Giacomo Catenazzi) Date: Wed, 11 Oct 2023 12:41:49 +0200 Subject: Unicode encoding philosophy In-Reply-To: <5ec55d60fe6f4b40a8f98f52d5c7449f@mail.gmail.com> References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> <5d0d2996-08c7-40c0-8732-5acf4325d57b@cateee.net> <1d6f95ea-a50d-008b-eae1-8139d59fff93@it.aoyama.ac.jp> <3853f461-7033-469a-a285-b107157d721c@cateee.net> <5ec55d60fe6f4b40a8f98f52d5c7449f@mail.gmail.com> Message-ID: <2fee5a33-d4e0-412a-907d-47954bdb174c@cateee.net> On 11 Oct 2023 10:28, Piotr Karocki via Unicode wrote: >> Additionally some HTML tags are about formatting

, etc. > I disagree. > HTML is about text structure, not about formatting/rendering. >

is rendered differently for different output devices: monitor, printer, > Braille 'display', narrator (text to voice), etc. Which font size? Which "version" of Braille? Which voice (male/female, accent)?, etc. In any case, you have a different definition of "formatting". Maybe we should stop using such word, and use instead "plain text", "structure", "style", "rendering" (with lower risk to misinterpret). You consider "formatting" only the last two. I consider everything above "plain text" as "formatting". Two empty lines on an email is /new paragraph/ and it should be displayed so. Should I really distinguish it from italic or bold (so using slash, or asterix)? So depending on application, we have different terminologies. Seldom we must distinguish it in so many steps. This group is one where distinction is important. Note: Unicode Category Cf (Other, formatting) includes various "structure" characters (so as HTML and not CSS "features") Note: On Microsoft Windows: "Paste without Formatting" is mostly plaintext, and some structure (new lines, lists) but not much more. Also a third definition of "formatting". > Unicode should be used to specify character/glyph/sign, HTML to add text > structure, and CSS used only to force rendering (so it should be used very > rarely). I think it is manichaeist. It may be the aim, but human languages are too diverse and complex to create a perfect split of domains. But also thematically it is difficult (and artificial) to split in such manner. Final rendering requires an additional step after "CSS", usually done by different engines: layout/shaping/font-rendering. And Unicode Standard touch also this part. Interaction of characters is an important topic on Unicode Standard: when to do liguatures and graphemes (and grapheme clusters), how to avoid them (ZWJ/ZWNJ, variant selectors etc.). Such rendering decisions are intrinsic on how to write scripts and language (topic of Unicode Standard). So, some styling decisions are done at level of Unicode. On the other hand, some are not done at Unicode level. (liguatures: you may get on a Roman font, but not on a typewriter font, and obviously we have more and different in cursive fonts). Maybe we should see Unicode has the last step, so HTML (structure), CSS (rendering) and Unicode has glyph selection. Which it is more in line with reality: words are just black boxes until rendering, we cannot format an accent in red with a black base character: formatting stage also in HTML happen before getting Unicode "Combining" category. (Unicode doesn't mandate a glyph, but it describe possible ligatures, and real world cases, decision is off-loaded to font designers, but the infrastructure is in Unicode). Note: I see what you want to tell us. Just I think HTML/CSS cannot be a generic (for all languages/uses) markup language/style (and if we expand it for such task, the outcome will become ugly). But again: a task for future. ciao cate Appendix: some special cases about strict layering. Unicode has "forms" (as blocks and with variant selectors). Is it wrong to have them? (should be moved to CSS, but they do not have ideas about glyphs). Spaces and new lines are considered control characters (ok, "spaces" may have double meaning). So already an strange case, but we can just interpret it as a "escape-like" "sequence" at lower layer. Some box characters, and many technical symbols requires some formatting (alignment of lines in case of other symbols nearby [right/left/above/below and possible diagonals]). On such case semantic of character has strong requirement on structure and rendering. (and you can change charmap, but so you can have different font rendering/engine also with very specific CSS). SHY (Soft hyphen, U+00AD or HTML ­): is it structure? style? Glyph semantic? And semantic (you just uncovered with your lat paragraph) is also a problem. For now no Unicode/HTML/CSS can style currency or numbers in my personal way: CSS doesn't know what are currencies (the part of text). HTML doesn't mandate to tag it differently, and Unicode may just help on giving a small space character (but also not so useful). From wjgo_10009 at btinternet.com Wed Oct 11 13:51:08 2023 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 11 Oct 2023 19:51:08 +0100 (BST) Subject: Unicode encoding philosophy In-Reply-To: References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> Message-ID: <79f1562f.1668.18b2014137c.Webtop.83@btinternet.com> Erik Carvalhal Miller wrote as follows. > But, you may ask,what about the mathematical Latin and Greek > alphanumeric symbols, rich in implied typographical styles including > italic? Thank you for replying. Yet I did not. As a mathematician I understood the difference of approach. I never had any intention of seeking to try to use the encoding of those characters as a precedent. That concept got refuted as a possible precedent notwithstanding that I had not contemplated trying to do that. > This statefulness applies even to pre?computer days of metal type: A > compositor about to set a run of italic type would turn to a case of > italics from which to pick out the next several glyphs, then turn back > to a non?italic case when that run was complete, rather than serially > start and finish using the italic case many times in a row. If the type were being handset, that may well be true in relation to the practical use of typecases, with the compositor perhaps needing to move to work at a different table where the typecase or typecases for italic glyphs had been placed, the compositor perhaps needing to use separate typecases for uppercase letters and punctuation, and for lowercase letters and punctuation. I do not know at present how the process worked if the compositor were using machine composition using a Linotype Machine or a Monotype machine. I learned to handset metal type back in the 1960s as I was involved in private press printing as a family hobby. Metal type in use was not stateful regarding italics as each piece of metal type in a sequence of handset metal type was an independent unit and there was no state set anywhere that insisted that the next piece of type after an italic character would be an italic character as sometimes it would not be, such as after an italicized word had been completed, and indeed, except for a very few special founts such as Palace Script, the spaces used between words were the same both for the roman fount and for the italic fount. For the avoidance of doubt, there were space sorts available in the italic case, for convenience when setting a sequence of words in italic sorts, yet they were from the same purchase of spacing material from the type foundry and were fount-independent too in relation to founts of the same point size, except for a few founts such as Palace Script that had special angled space sorts. A modern computer desktop publishing text editor program could allow a compositor to switch from roman to italic for a sequence of characters and yet allow, as an option, the text to be output to a file as plain text with a VS14 character after each of the italicized characters, so in practice, if implemented, there need not be any tediousness in applying a VS14 character after each text character. I say that allowing the VS14 proposal to become encoded would be practical and would not make a string of Unicode characters stateful. In days gone by, suggesting a character to switch on italics and a character to switch off italics was rejected as it would have made Unicode stateful. So, as time went by and I learned, a method to achieve the result without making Unicode stateful was devised, tested and it worked great, but that was rejected too, because it was not stateful! Actually, my main reason for wanting to be able to encode italics in plain text was to be able to transcribe historical texts into plain text on a computer, including such things as the title pages of printed books. I consider that, alas, an opportunity for progress has been dismissed due to adherence to concepts from long ago that are not relevant in some modern usage situations. The capabilities of plain text could be improved if that were allowed. William Overington Wednesday 11 October 2023 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Wed Oct 11 16:48:14 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 11 Oct 2023 23:48:14 +0200 Subject: Unicode encoding philosophy In-Reply-To: References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> Message-ID: > 11 okt. 2023 kl. 02:39 skrev Erik Carvalhal Miller via Unicode : > Let?s consider an equation that you?ll probably recognize, font support > willing: ??=????. Thanks to the power of Unicode, we could use it > in the same plain?text document as, say, ??=???? while keeping both That's not really a proper way of representing math expressions. For one thing, compatibility normalisation would ruin them (true, one is not supposed to apply that, which I agree with, but it sometimes is anyway). Another thing is that these "mathematical" characters were added because of letting MathML represent the (semantically, in some sense of that word, significant) style differences would be too verbose to express "properly" in MathML. (Still, MathML do still not really use them, it seems.) And, without any kind of structural coding, only very limited (and very linear) math expressions can be given like that. If you are interested in representing math expressions much more generally in (otherwise) plain text (or in the context of ECMA-48 formatting, or even in a HTML or SVG context), I did make a proposal for that: see https://github.com/kent-karlsson/control/blob/main/math-layout-controls-2023-B.pdf . /Kent K -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Wed Oct 11 16:48:16 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 11 Oct 2023 23:48:16 +0200 Subject: Unicode encoding philosophy In-Reply-To: <3853f461-7033-469a-a285-b107157d721c@cateee.net> References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> <5d0d2996-08c7-40c0-8732-5acf4325d57b@cateee.net> <1d6f95ea-a50d-008b-eae1-8139d59fff93@it.aoyama.ac.jp> <3853f461-7033-469a-a285-b107157d721c@cateee.net> Message-ID: <0D9669E9-AF61-4284-897C-B717DF5DC2FA@bahnhof.se> > 11 okt. 2023 kl. 10:02 skrev Giacomo Catenazzi via Unicode : > > We may find that ASCII provide different level of separations (FS, GS, RS, US, As far as I know, NOBODY is using these anymore. But I may be wrong; really old applications do not count, nor do EBCDIC ones (which also fall in the "really old" category). Note however that Unicode does not really have these; Unicode (referencing ECMA-48; in the ISO/IEC version) has IS1, IS2, IS3, IS4, and they have no pre-defined hierarchy. A hierarchy, if any, there need not be anyone, is defined by the application. So Unicode was wrong in equating them as aliases. > but also EM, FF, CR/LF, Nobody (I hope) is using EM. But FF, CR, LF are of course commonly used. LS and PS (Unicode replacements) never gained popularity, for compatibility reasons; they will likely never gain popularity. > and also SPACE), or with ECMA, more about style (but as I found in Wikipedia, each terminal has own interpretation of ?red" and "highlight red", Yes, that is annoying... See https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2023B.pdf , esp. page 31. But 35 or so years ago there were technical limitations, but we do not have those today, for many displays. > But also this last fact may give us some hints: why we do not use ECMA anymore for such formatting? By ECMA, I assume you mean ECMA-48 (as I assume above). ECMA-48 *is* very commonly used. Unfortunately only in terminal emulators. There is no need for that limitation. The formatting part (modernised) may well be used in text files as well. See https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2023B.pdf . (ECMA-48 has other stuff as well, in particular for keyboard input, as well as "terminal screen editing" (and those are used for terminal emulators). These are of course not suitable for text files with formatting. But ECMA-48 is a mix of stuff.) > For sure Microsoft knew it very well e.g. for Microsoft Word 1) ECMA-48 (even with the modernisation proposed in https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2023B.pdf ) is *far* from sufficient for such things as (full-fledged) document formatting, spreadsheets, and so on. But one does not always need full-fledged document formatting (such as HTML/CSS, Word, etc.). A much more light-weight formatting system is often useful. 2) ECMA-48 has not been updated for over 30 years. I think that is a pity. It is not at all a bad standard. (It even has support for Ruby; that you mentioned.) But with an update I think it may well be used for "light-weight formatted" text. I think there is a gap to fill between "plain text" editors and (full-fledged) document format apps (including HTML/CSS which has lot and lots of capabilities, and hard to implement in full), and plain text apps which do not even allow italics or bold, or the slightest font size change (for a heading for instance), and that gap may well be filled by using ECMA-48 in modernised form. (RTF is not all that attractive?.) And ECMA-48 (or rather it?s ISO equivalent) is referenced by Unicode&ISO/IEC 10646. /Kent K -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Oct 11 17:37:16 2023 From: doug at ewellic.org (Doug Ewell) Date: Wed, 11 Oct 2023 22:37:16 +0000 Subject: Compatibility normalization (was: RE: Unicode encoding philosophy) In-Reply-To: References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> Message-ID: Kent Karlsson wrote: >> Let?s consider an equation that you?ll probably recognize, font >> support willing: ??=????. Thanks to the power of Unicode, we could >> use it in the same plain?text document as, say, ??=???? while >> keeping both > > That's not really a proper way of representing math expressions. > For one thing, compatibility normalisation would ruin them (true, > one is not supposed to apply that, which I agree with, but it > sometimes is anyway). I see this claim from time to time, and not only from Kent: we must not use character (sequence) X, or must not use it in contrast with character (sequence) Y which is compatibility-equivalent to X, because some random, unknown process might surreptitiously apply NFKC or NFKD to the text, obliterating the distinction. Can Kent, or anyone else, please identify a *specific* program or process that does this? If there are no attested, real-world examples of processes actually applying NFKC or NFKD behind the user?s back (which would indeed be evil), I?m likely to write this off as an urban myth. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From graphity at adelaide.on.net Wed Oct 11 18:14:38 2023 From: graphity at adelaide.on.net (Kevin Brown) Date: Thu, 12 Oct 2023 09:44:38 +1030 Subject: My List delivery options seemed to have spontaneously reset. Message-ID: <73fe5b1a-ba76-1d7c-e847-f055725514fb@adelaide.on.net> Hello Unicode List I've started getting emails from the mailing list every time someone posts. ('veee had about 5 today so far. I just used to get occasional digests which is all I want. I've tried to reset to my former options but can't see how ? it seems extremely clunky. Please tell me how I can go back to my previous settings, or reset for me . Many thanks Kevin? Brown -- ************************************************* Kevin Brown's G R A P H I T Y ! DIGITAL TYPE SPECIALIST * GRAPHIC DESIGN 180 Marian Road, Glynde S.A. 5070 AUSTRALIA Phone: +61 (0)8 8365 6544 A.B.N. 23 260 697 910 DUNS 743862206 Email: graphity at adelaide.on.net Est.1979 Member: The Unicode Consortium Australian Graphic Design Association (AGDA) www.australianschoolfonts.com.au ************************************************* From jamessmn46 at gmail.com Thu Oct 12 03:28:50 2023 From: jamessmn46 at gmail.com (James Simeon) Date: Thu, 12 Oct 2023 04:28:50 -0400 Subject: Address Message-ID: 795 abercorn dr Atlanta ga 30331 y?all sent my package to the wrong house On Wednesday, October 11, 2023, wrote: > Send Unicode mailing list submissions to > unicode at corp.unicode.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://corp.unicode.org/mailman/listinfo/unicode > or, via email, send a message with subject or body 'help' to > unicode-request at corp.unicode.org > > You can reach the person managing the list at > unicode-owner at corp.unicode.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Unicode digest..." > -------------- next part -------------- An HTML attachment was scrubbed... URL: From orenwatson at tutanota.com Thu Oct 12 04:33:32 2023 From: orenwatson at tutanota.com (orenwatson at tutanota.com) Date: Thu, 12 Oct 2023 11:33:32 +0200 (CEST) Subject: Unicode encoding philosophy In-Reply-To: References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> <5d0d2996-08c7-40c0-8732-5acf4325d57b@cateee.net> <1d6f95ea-a50d-008b-eae1-8139d59fff93@it.aoyama.ac.jp> <3853f461-7033-469a-a285-b107157d721c@cateee.net> <5ec55d60fe6f4b40a8f98f52d5c7449f@mail.gmail.com> <2fee5a33-d4e0-412a-907d-47954bdb174c@cateee.net> Message-ID: replied off list, whoops. >?words are just black boxes until rendering, we cannot format an accent in red with a black base characterThis depends on the layout system. XeLaTeX can absolutely do this. for example here in greek. the unicode text ?{\color{red}?} ??{\color{red}?}??? ???????{\color{red}?}?????? ??{\color{red}?}{\color{red}?}?????, ?{\color{red}?} ????????{\color{red}?}?, ??{\color{red}?}{\color{red}?} ??{\color{red}?}? ??{\color{red}?}??{\color{red}?} ?{\color{red}?}{\color{red}?}? ???{\color{red}?}???, Even though formatting and diacritics are interspersed, XeLaTeX has no problem. HTML renderers don't support this, but apparently this is done in Arabic printing sometimes to separate different kinds of diacritics. --- Oren Watson (he/him) orenwatson at tutanota.com > > > > > 11 Oct 2023, 11:51 by unicode at corp.unicode.org: > >> On 11 Oct 2023 10:28, Piotr Karocki via Unicode wrote: >> >>>> Additionally some HTML tags are about formatting

, etc. >>>> >>> I disagree. >>> HTML is about text structure, not about formatting/rendering. >>>

is rendered differently for different output devices: monitor, printer, >>> Braille 'display', narrator (text to voice), etc. >>> >> >> Which font size? Which "version" of Braille? Which voice (male/female, accent)?, etc. >> >> In any case, you have a different definition of "formatting". Maybe we should stop using such word, and use instead "plain text", "structure", "style", "rendering" (with lower risk to misinterpret). You consider "formatting" only the last two. I consider everything above "plain text" as "formatting". Two empty lines on an email is /new paragraph/ and it should be displayed so. Should I really distinguish it from italic or bold (so using slash, or asterix)? So depending on application, we have different terminologies. Seldom we must distinguish it in so many steps. This group is one where distinction is important. >> >> Note: Unicode Category Cf (Other, formatting) includes various "structure" characters (so as HTML and not CSS "features") >> >> Note: On Microsoft Windows: "Paste without Formatting" is mostly plaintext, and some structure (new lines, lists) but not much more. Also a third definition of "formatting". >> >> >>> Unicode should be used to specify character/glyph/sign, HTML to add text >>> structure, and CSS used only to force rendering (so it should be used very >>> rarely). >>> >> >> I think it is manichaeist. It may be the aim, but human languages are too diverse and complex to create a perfect split of domains. But also thematically it is difficult (and artificial) to split in such manner. >> >> Final rendering requires an additional step after "CSS", usually done by different engines: layout/shaping/font-rendering. And Unicode Standard touch also this part. Interaction of characters is an important topic on Unicode Standard: when to do liguatures and graphemes (and grapheme clusters), how to avoid them (ZWJ/ZWNJ, variant selectors etc.). Such rendering decisions are intrinsic on how to write scripts and language (topic of Unicode Standard). So, some styling decisions are done at level of Unicode. On the other hand, some are not done at Unicode level. (liguatures: you may get on a Roman font, but not on a typewriter font, and obviously we have more and different in cursive fonts). >> >> >> Maybe we should see Unicode has the last step, so HTML (structure), CSS (rendering) and Unicode has glyph selection. Which it is more in line with reality: words are just black boxes until rendering, we cannot format an accent in red with a black base character: formatting stage also in HTML happen before getting Unicode "Combining" category. (Unicode doesn't mandate a glyph, but it describe possible ligatures, and real world cases, decision is off-loaded to font designers, but the infrastructure is in Unicode). >> >> >> Note: I see what you want to tell us. Just I think HTML/CSS cannot be a generic (for all languages/uses) markup language/style (and if we expand it for such task, the outcome will become ugly). But again: a task for future. >> >> >> ciao >> cate >> >> >> Appendix: some special cases about strict layering. >> >> Unicode has "forms" (as blocks and with variant selectors). Is it wrong to have them? (should be moved to CSS, but they do not have ideas about glyphs). >> >> Spaces and new lines are considered control characters (ok, "spaces" may have double meaning). So already an strange case, but we can just interpret it as a "escape-like" "sequence" at lower layer. >> >> Some box characters, and many technical symbols requires some formatting (alignment of lines in case of other symbols nearby [right/left/above/below and possible diagonals]). On such case semantic of character has strong requirement on structure and rendering. (and you can change charmap, but so you can have different font rendering/engine also with very specific CSS). >> >> SHY (Soft hyphen, U+00AD or HTML ­): is it structure? style? Glyph semantic? >> >> And semantic (you just uncovered with your lat paragraph) is also a problem. For now no Unicode/HTML/CSS can style currency or numbers in my personal way: CSS doesn't know what are currencies (the part of text). HTML doesn't mandate to tag it differently, and Unicode may just help on giving a small space character (but also not so useful). >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dziewon at xs4all.nl Thu Oct 12 05:34:11 2023 From: dziewon at xs4all.nl (dziewon) Date: Thu, 12 Oct 2023 12:34:11 +0200 Subject: Unsubscribe from mailing list Message-ID: Dear Sirs,Would you please unsubscribe my husband Reinier C. Bakhuizen van den Brink from your mailing list.?My husband passed away 2 years ago.Yours faithfully,E.M. Bakhuizen vd BrinkVerzonden vanaf mijn Galaxy -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamessmn46 at gmail.com Thu Oct 12 06:09:46 2023 From: jamessmn46 at gmail.com (slim hustla) Date: Thu, 12 Oct 2023 07:09:46 -0400 Subject: Unicode Digest, Vol 116, Issue 16 In-Reply-To: References: Message-ID: <236A3F93-7718-42A1-829C-D7CB64C71214@gmail.com> Don?t unsubscribe Sent from j > On Oct 12, 2023, at 6:36 AM, unicode-request at corp.unicode.org wrote: > > ?Send Unicode mailing list submissions to > unicode at corp.unicode.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://corp.unicode.org/mailman/listinfo/unicode > or, via email, send a message with subject or body 'help' to > unicode-request at corp.unicode.org > > You can reach the person managing the list at > unicode-owner at corp.unicode.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Unicode digest..." > Today's Topics: > > 1. Unsubscribe from mailing list (dziewon) -------------- next part -------------- An embedded message was scrubbed... From: dziewon Subject: Unsubscribe from mailing list Date: Thu, 12 Oct 2023 12:34:11 +0200 Size: 2011 URL: -------------- next part -------------- > _______________________________________________ > Unicode mailing list > Unicode at corp.unicode.org > https://corp.unicode.org/mailman/listinfo/unicode From kent.b.karlsson at bahnhof.se Thu Oct 12 06:32:18 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 12 Oct 2023 13:32:18 +0200 Subject: Compatibility normalization (was: RE: Unicode encoding philosophy) In-Reply-To: References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> Message-ID: <7813AC68-0CF2-46ED-87AA-BDD4C00BD80F@bahnhof.se> > 12 okt. 2023 kl. 00:37 skrev Doug Ewell via Unicode : > > Kent Karlsson wrote: > >>> Let?s consider an equation that you?ll probably recognize, font >>> support willing: ??=????. Thanks to the power of Unicode, we could >>> use it in the same plain?text document as, say, ??=???? while >>> keeping both >> >> That's not really a proper way of representing math expressions. >> For one thing, compatibility normalisation would ruin them (true, >> one is not supposed to apply that, which I agree with, but it >> sometimes is anyway). > > I see this claim from time to time, and not only from Kent: we must not use character (sequence) X, or must not use it in contrast with character (sequence) Y which is compatibility-equivalent to X, because some random, unknown process might surreptitiously apply NFKC or NFKD to the text, obliterating the distinction. > > Can Kent, or anyone else, please identify a *specific* program or process that does this? > > If there are no attested, real-world examples of processes actually applying NFKC or NFKD behind the user?s back (which would indeed be evil), I?m likely to write this off as an urban myth. It would be absolutely wonderful if it could (now) be written off, perhaps not as urban myth, but as old bugs. There have been even worse cases, removing ?accents? on e.g. ??? (ICU even has support for such a mapping). Just today, I saw a brand new(!) message where single apostrophe (not the ASCII one) somehow had been automatically replaced by three(!) question marks, likewise for some bullet point character (don?t know which one it was originally). So, while not NFKD/NFKC, that kind of ?downgrading? changes to text still happen. Now, I do not have access to, let alone able to test, all the software in the world? /Kent K > -- > Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Thu Oct 12 06:42:34 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 12 Oct 2023 13:42:34 +0200 Subject: Unicode encoding philosophy In-Reply-To: References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> <5d0d2996-08c7-40c0-8732-5acf4325d57b@cateee.net> <1d6f95ea-a50d-008b-eae1-8139d59fff93@it.aoyama.ac.jp> <3853f461-7033-469a-a285-b107157d721c@cateee.net> <5ec55d60fe6f4b40a8f98f52d5c7449f@mail.gmail.com> <2fee5a33-d4e0-412a-907d-47954bdb174c@cateee.net> Message-ID: > 12 okt. 2023 kl. 11:33 skrev Oren Watson via Unicode : > > replied off list, whoops. > > > words are just black boxes until rendering, we cannot format an accent in red with a black base characterThis depends on the layout system. XeLaTeX can absolutely do this. for example here in greek. the unicode text > ?{\color{red}?} ??{\color{red}?}??? ???????{\color{red}?}?????? ??{\color{red}?}{\color{red}?}?????, ?{\color{red}?} ????????{\color{red}?}?, ??{\color{red}?}{\color{red}?} ??{\color{red}?}? ??{\color{red}?}??{\color{red}?} ?{\color{red}?}{\color{red}?}? ???{\color{red}?}???, This is very much a grey area. That it is, or can be split into having, a combining character does not necessarily mean that the combining character should be separately stylable. To be nitpicking, and extremely strict, the combining marks in the text above combine with } as base character, not anything else, and } with a combining mark is not the end meta-bracket in (?)TeX? (You are fortunate that } does not canonically combine with any combining character; that is not the case with > (as used in HTML, XML), which does combine with a certain combining character?) /Kent K > Even though formatting and diacritics are interspersed, XeLaTeX has no problem. > > HTML renderers don't support this, but apparently this is done in Arabic printing sometimes to separate different kinds of diacritics. > > --- > Oren Watson (he/him) > orenwatson at tutanota.com > > > > > 11 Oct 2023, 11:51 by unicode at corp.unicode.org: > On 11 Oct 2023 10:28, Piotr Karocki via Unicode wrote: > Additionally some HTML tags are about formatting

, etc. > I disagree. > HTML is about text structure, not about formatting/rendering. >

is rendered differently for different output devices: monitor, printer, > Braille 'display', narrator (text to voice), etc. > > Which font size? Which "version" of Braille? Which voice (male/female, accent)?, etc. > > In any case, you have a different definition of "formatting". Maybe we should stop using such word, and use instead "plain text", "structure", "style", "rendering" (with lower risk to misinterpret). You consider "formatting" only the last two. I consider everything above "plain text" as "formatting". Two empty lines on an email is /new paragraph/ and it should be displayed so. Should I really distinguish it from italic or bold (so using slash, or asterix)? So depending on application, we have different terminologies. Seldom we must distinguish it in so many steps. This group is one where distinction is important. > > Note: Unicode Category Cf (Other, formatting) includes various "structure" characters (so as HTML and not CSS "features") > > Note: On Microsoft Windows: "Paste without Formatting" is mostly plaintext, and some structure (new lines, lists) but not much more. Also a third definition of "formatting". > > Unicode should be used to specify character/glyph/sign, HTML to add text > structure, and CSS used only to force rendering (so it should be used very > rarely). > > I think it is manichaeist. It may be the aim, but human languages are too diverse and complex to create a perfect split of domains. But also thematically it is difficult (and artificial) to split in such manner. > > Final rendering requires an additional step after "CSS", usually done by different engines: layout/shaping/font-rendering. And Unicode Standard touch also this part. Interaction of characters is an important topic on Unicode Standard: when to do liguatures and graphemes (and grapheme clusters), how to avoid them (ZWJ/ZWNJ, variant selectors etc.). Such rendering decisions are intrinsic on how to write scripts and language (topic of Unicode Standard). So, some styling decisions are done at level of Unicode. On the other hand, some are not done at Unicode level. (liguatures: you may get on a Roman font, but not on a typewriter font, and obviously we have more and different in cursive fonts). > > > Maybe we should see Unicode has the last step, so HTML (structure), CSS (rendering) and Unicode has glyph selection. Which it is more in line with reality: words are just black boxes until rendering, we cannot format an accent in red with a black base character: formatting stage also in HTML happen before getting Unicode "Combining" category. (Unicode doesn't mandate a glyph, but it describe possible ligatures, and real world cases, decision is off-loaded to font designers, but the infrastructure is in Unicode). > > > Note: I see what you want to tell us. Just I think HTML/CSS cannot be a generic (for all languages/uses) markup language/style (and if we expand it for such task, the outcome will become ugly). But again: a task for future. > > > ciao > cate > > > Appendix: some special cases about strict layering. > > Unicode has "forms" (as blocks and with variant selectors). Is it wrong to have them? (should be moved to CSS, but they do not have ideas about glyphs). > > Spaces and new lines are considered control characters (ok, "spaces" may have double meaning). So already an strange case, but we can just interpret it as a "escape-like" "sequence" at lower layer. > > Some box characters, and many technical symbols requires some formatting (alignment of lines in case of other symbols nearby [right/left/above/below and possible diagonals]). On such case semantic of character has strong requirement on structure and rendering. (and you can change charmap, but so you can have different font rendering/engine also with very specific CSS). > > SHY (Soft hyphen, U+00AD or HTML ­): is it structure? style? Glyph semantic? > > And semantic (you just uncovered with your lat paragraph) is also a problem. For now no Unicode/HTML/CSS can style currency or numbers in my personal way: CSS doesn't know what are currencies (the part of text). HTML doesn't mandate to tag it differently, and Unicode may just help on giving a small space character (but also not so useful). > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From harjitmoe at outlook.com Thu Oct 12 12:53:26 2023 From: harjitmoe at outlook.com (Harriet Riddle) Date: Thu, 12 Oct 2023 18:53:26 +0100 Subject: Compatibility normalization In-Reply-To: <7813AC68-0CF2-46ED-87AA-BDD4C00BD80F@bahnhof.se> References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> <7813AC68-0CF2-46ED-87AA-BDD4C00BD80F@bahnhof.se> Message-ID: On 12/10/2023 12:32, Kent Karlsson via Unicode wrote: > It would be absolutely wonderful if it could (now) be written off, > perhaps not as urban myth, but as old bugs. There have been even worse > cases, removing??accents? on e.g. ??? (ICU even has support for such a > mapping). I believe that's a "best-fit" mapping, such as those used by Microsoft Windows.[1]? The format of the files in that directory is a bit ideosyncratic and doesn't match any of the usual formats legacy-encoding-to-Unicode files (particularly evident for the CJK ones); I'm inclined to presume that Microsoft basically supplied the source files which the Windows code pages themselves are built from.? ICU's UCM format has built-in support for one-way mappings in either direction (Unicode-to-legacy or legacy-to-Unicode); the ICU project has UCMs generated for all of the Windows code pages[2], including those not included in the |MAPPINGS/VENDORS| collection on unicode.org. To be clear, best-fit conversion mappings have nothing to do with NFKD (or NFKC) normalisation /per se/, although NFKD normalisation in particular can certainly be used to aid generating them.? Note also that /any/ Unicode character not supported by the legacy encoding in question will either be best-fitted or substituted (with e.g. a question mark, katakana interpunct, geta mark, etc), irrespective of whether it has a compatibility decomposition. (As a sidenote, however: it is also worth noting that, if one /must/ map some text with diacritics onto text in ISO Basic Latin letters (ASCII letters) for purposes beyond just fuzzy matching, it is usually better to use (with awareness of the language in use) an appropriate transcription scheme rather than just removing all diacritics; see German DIN 91379 for European languages[3], Vietnamese Telex[4], Gwoyeu Romatzyh for Mandarin tones[5], Revised Romanisation for Korean vowels[6], etc.) [1] https://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/ [2] https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm [3] https://en.wikipedia.org/wiki/DIN_91379#Normative_mapping_of_Latin_letters_to_basic_letters_(search_form) [4] https://en.wikipedia.org/wiki/Telex_(input_method) [5] https://en.wikipedia.org/wiki/Gwoyeu_Romatzyh [6] https://en.wikipedia.org/wiki/Revised_Romanization_of_Korean > Just today, I saw a brand new(!) message where single?apostrophe (not > the ASCII one) somehow had been automatically replaced by three(!) > question marks, likewise for some bullet point character (don?t know > which one it was originally). So, while not NFKD/NFKC, that kind > of??downgrading? changes to text still happen. U+2019 is three bytes (0xE2 0x80 0x99) in UTF-8?again, nothing to do with normalisation, and something which would impact any non-ASCII character regardless of whether it has a compatibility decomposition. ?Har. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pkar at ieee.org Thu Oct 12 13:29:30 2023 From: pkar at ieee.org (Piotr Karocki) Date: Thu, 12 Oct 2023 20:29:30 +0200 Subject: Compatibility normalization In-Reply-To: References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> <7813AC68-0CF2-46ED-87AA-BDD4C00BD80F@bahnhof.se> Message-ID: >if one must map some text with diacritics onto text in ISO Basic Latin >letters (ASCII letters) > for purposes beyond just fuzzy matching, it is usually better to use (with > awareness of the > language in use) an appropriate transcription scheme rather than just > removing all diacritics; > see German DIN 91379 for European languages[3], Vietnamese Telex[4], > Gwoyeu Romatzyh for > Mandarin tones[5], Revised Romanisation for Korean vowels[6], etc.) Such mapping is useful to create filenames compatible with POSIX portable filename characterset. And such charset is meant to create filenames that can be used in any existing file system. From kent.b.karlsson at bahnhof.se Thu Oct 12 15:47:01 2023 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 12 Oct 2023 22:47:01 +0200 Subject: Compatibility normalization In-Reply-To: References: <42360a5b.29c6.18afbd3db38.Webtop.112@btinternet.com> <7813AC68-0CF2-46ED-87AA-BDD4C00BD80F@bahnhof.se> Message-ID: > 12 okt. 2023 kl. 19:53 skrev Harriet Riddle via Unicode : > ? Note also that any Unicode character not supported by the legacy encoding in question will either be best-fitted or substituted (with e.g. a question mark, katakana interpunct, geta mark, etc), irrespective of whether it has a compatibility decomposition. There is a special-purpose character JUST for that case: SUBSTITUTE. It is available in just about all (not too ancient, computer-wise) encodings (even EBCDIC) except the most crazy ones. For unclear reasons, Unicode has a duplicate of that character: REPLACEMENT CHARACTER, with the disadvantage that that copy is only available in Unicode encodings. SUBSTITUTE should display in a way that makes it clear that it is the SUBSTITUTE character, and not some ?ordinary? character nor not be displayed (though ECMA-48 does not say so explicitly). 8.3.148 SUB - SUBSTITUTE Notation: (C0) Representation: 01/10 SUB is used in the place of a character that has been found to be invalid or in error. SUB is intended to be introduced by automatic means. ?Invalid? here would include ?not available in the target encoding?. ?Best-fit? is something that is very much in the eye of the beholder. If a programmer (?system?, if you want to be person-neutral) thinks one fallback mapping is great, that greatness might not hold for the users (readers of the resulting text). > (As a sidenote, however: it is also worth noting that, if one must map some text with diacritics onto text in ISO Basic Latin letters (ASCII letters) for purposes beyond just fuzzy matching, > I would hesitate to say that such a mapping is appropriate for fuzzy matching. ?Fuzzy matching?, should, in an ideal world, cover common spelling variants, and common spelling mistakes (in the language in question). That often does not include ?diacritics removal?, which may easily result in a semantic change (or at least horrible misspelling). > U+2019 is three bytes (0xE2 0x80 0x99) in UTF-8? > Yes? But not at all clear that that is why I saw that result (I do not know what maneuvers the author did with that piece of text). Replacing single apostrophe (likely introduced ?by automatic means?) with a single question mark (still not great) I have seen plenty of times. /Kent K -------------- next part -------------- An HTML attachment was scrubbed... URL: From pkar at ieee.org Thu Oct 26 11:05:54 2023 From: pkar at ieee.org (Piotr Karocki) Date: Thu, 26 Oct 2023 18:05:54 +0200 Subject: Unicode philosophy - technical symbols Message-ID: <02600a0753453f4c8567c41ff927c488@mail.gmail.com> I have another question, related to Unicode philosophy (not "Unicode encoding philosophy", but more general "Unicode philosophy" :) ). We already adopted e.g. Symbols for Legacy Computing (1FB00-1FBFF), Emoji, hieroglyphs; why not include all ISO 7000 and IEC 60417 symbols? ISO 7000 / IEC 60417 Graphical symbols for use on equipment. https://www.iso.org/obp/ui#iso:pub:PUB400008:en ---8<--- Piotr Karocki From sosipiuk at gmail.com Thu Oct 26 12:16:19 2023 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Thu, 26 Oct 2023 17:16:19 +0000 Subject: Unicode philosophy - technical symbols In-Reply-To: <02600a0753453f4c8567c41ff927c488@mail.gmail.com> References: <02600a0753453f4c8567c41ff927c488@mail.gmail.com> Message-ID: <1698339925410.1815319897.725157918@gmail.com> I believe this is because those symbols are not meant to be included in runs of "text" but are designed to be "placed on equipment". This MIGHT be a valid argument if we didn't already have things like the pause and play buttons in Unicode. There's a long history of "anti-precedent" in Unicode decisions, where some set of characters/symbols get included, then a similar set is denied and is deemed to be different for hyperspecific, microscopic reasons, or the previous inclusion is outright dismissed as a historical mistake which "unfortunately" cannot be remedied. Then the same thing happens again, and again, and it feels like all the fancy words and reasoning are there just to hide the fluctuating mood of the proposal reviewers on any specific day. You should make sure your proposal gets reviewed immediately after lunch. Studies have shown people are more generous then. ;-) On Thursday, 26 October 2023, 12:05:54 (-04:00), Piotr Karocki via Unicode wrote: > I have another question, related to Unicode philosophy (not "Unicode > encoding philosophy", but more general "Unicode philosophy" :) ). > > We already adopted e.g. Symbols for Legacy Computing (1FB00-1FBFF), Emoji, > hieroglyphs; why not include all ISO 7000 and IEC 60417 symbols? > > ISO 7000 / IEC 60417 Graphical symbols for use on equipment. > https://www.iso.org/obp/ui#iso:pub:PUB400008:en > > > ---8<--- > Piotr Karocki > From doug at ewellic.org Thu Oct 26 12:25:12 2023 From: doug at ewellic.org (Doug Ewell) Date: Thu, 26 Oct 2023 17:25:12 +0000 Subject: Unicode philosophy - technical symbols In-Reply-To: <02600a0753453f4c8567c41ff927c488@mail.gmail.com> References: <02600a0753453f4c8567c41ff927c488@mail.gmail.com> Message-ID: Piotr Karocki wrote: > We already adopted e.g. Symbols for Legacy Computing (1FB00-1FBFF), > Emoji, hieroglyphs; why not include all ISO 7000 and IEC 60417 > symbols? (unofficially) First, ISO 7000/IEC 60417 is not freely available. Although the symbols themselves are available on the OBP, one must pay ISO or a member body for the standard itself. Second, Unicode is not generally a symbol encoding standard. It has traditionally been a requirement that symbols proposed for Unicode be shown to occur embedded in plain text. Although many symbols already in Unicode, especially those encoded in the early days (e.g. Dingbats), do not seem to meet this criterion, the requirement exists now and proposals today are bound by it. Many of the ISO 7000/IEC 60417 symbols do not satisfy this requirement. By contrast, all of the Symbols for Legacy Computing had already appeared (by definition) in computing environments as part of plain text. The clearly expressed need was to transcode text between legacy computing environments and Unicode. Legacy Computing Symbols are not, and were never, intended as a precedent for all manner of standalone symbols to be encoded. The proposers, Script Ad Hoc, and Unicode Technical Committee all agreed to that. (And if you think the Legacy Computing Symbols proposals sailed through the committees, with little or no opposition, think again.) -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From doug at ewellic.org Thu Oct 26 12:40:32 2023 From: doug at ewellic.org (Doug Ewell) Date: Thu, 26 Oct 2023 17:40:32 +0000 Subject: Unicode philosophy - technical symbols In-Reply-To: References: <02600a0753453f4c8567c41ff927c488@mail.gmail.com> Message-ID: Additionally, emoji ? for all my longstanding concerns about them ? are very deliberately intended to appear alongside, and as an integral part of, plain text. Egyptian hieroglyphs were very much a natural-language writing system, every bit as much as Latin or Greek letters are today. Frankly, I?m troubled that anyone would cite them as a precedent for encoding technical symbols. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org -----Original Message----- From: Unicode On Behalf Of Doug Ewell via Unicode Sent: Thursday, October 26, 2023 11:25 To: Piotr Karocki ; unicode at corp.unicode.org Subject: RE: Unicode philosophy - technical symbols Piotr Karocki wrote: > We already adopted e.g. Symbols for Legacy Computing (1FB00-1FBFF), > Emoji, hieroglyphs; why not include all ISO 7000 and IEC 60417 > symbols? From pkar at ieee.org Thu Oct 26 13:33:31 2023 From: pkar at ieee.org (Piotr Karocki) Date: Thu, 26 Oct 2023 20:33:31 +0200 Subject: Unicode philosophy - technical symbols In-Reply-To: References: <02600a0753453f4c8567c41ff927c488@mail.gmail.com> Message-ID: <2e0001e7a43d5b266689017201a3e1b6@mail.gmail.com> Ad first. ISO 7000 is not freely available, but Unicode also is ISO/IEC standard 10646, for 208 CHF (it is not exact equivalent, but neverless...). https://www.iso.org/standard/76835.html Ad second. Many of these symbols appear in text - e.g. clothing labels, instructions/user guides for technical equipment, cargo manifests, etc. Probably not all symbols, of course, but it seems that including all symbols is much easier than making selection of symbols to be included :) -----Original Message----- From: Doug Ewell [mailto:doug at ewellic.org] Sent: Thursday, 26 October 2023 19:25 To: Piotr Karocki; unicode at corp.unicode.org Subject: RE: Unicode philosophy - technical symbols Piotr Karocki wrote: > We already adopted e.g. Symbols for Legacy Computing (1FB00-1FBFF), > Emoji, hieroglyphs; why not include all ISO 7000 and IEC 60417 > symbols? (unofficially) First, ISO 7000/IEC 60417 is not freely available. Although the symbols themselves are available on the OBP, one must pay ISO or a member body for the standard itself. Second, Unicode is not generally a symbol encoding standard. It has traditionally been a requirement that symbols proposed for Unicode be shown to occur embedded in plain text. Although many symbols already in Unicode, especially those encoded in the early days (e.g. Dingbats), do not seem to meet this criterion, the requirement exists now and proposals today are bound by it. Many of the ISO 7000/IEC 60417 symbols do not satisfy this requirement. By contrast, all of the Symbols for Legacy Computing had already appeared (by definition) in computing environments as part of plain text. The clearly expressed need was to transcode text between legacy computing environments and Unicode. Legacy Computing Symbols are not, and were never, intended as a precedent for all manner of standalone symbols to be encoded. The proposers, Script Ad Hoc, and Unicode Technical Committee all agreed to that. (And if you think the Legacy Computing Symbols proposals sailed through the committees, with little or no opposition, think again.) -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From doug at ewellic.org Thu Oct 26 13:42:55 2023 From: doug at ewellic.org (Doug Ewell) Date: Thu, 26 Oct 2023 18:42:55 +0000 Subject: Unicode philosophy - technical symbols In-Reply-To: <2e0001e7a43d5b266689017201a3e1b6@mail.gmail.com> References: <02600a0753453f4c8567c41ff927c488@mail.gmail.com> <2e0001e7a43d5b266689017201a3e1b6@mail.gmail.com> Message-ID: Piotr Karocki wrote: > Ad first. > ISO 7000 is not freely available, but Unicode also is ISO/IEC standard > 10646, for 208 CHF (it is not exact equivalent, but neverless...). > https://www.iso.org/standard/76835.html Try searching for ?10646? here: https://standards.iso.org/ittf/PubliclyAvailableStandards/ > Ad second. > Many of these symbols appear in text - e.g. clothing labels, > instructions/user guides for technical equipment, cargo manifests, > etc. > Probably not all symbols, of course, but it seems that including > all symbols is much easier than making selection of symbols to be > included :) The encoding process is not necessarily oriented toward simplifying the job of the proposer. (Not to suggest it is oriented toward complicating their job, either.) If you want to propose these symbols, and you think you can persuade the committees, go ahead and propose them. But please do not cite Legacy Computing Symbols or Egyptian hieroglyphs as if they were analogous. They are not. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From asmusf at ix.netcom.com Thu Oct 26 16:13:38 2023 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 26 Oct 2023 21:13:38 +0000 Subject: Unicode philosophy - technical symbols Message-ID: <8a456ff6-5a1d-0e75-2a49-48046d3a924e@ix.netcom.com> The first thing the people reviewing a proposal will ask for is examples showing the proposed characters in running text. It's useless to submit anything without such samples. Even with clear evidence it can be an uphill battle to convince the reviewers that certain symbols are characters that should be encoded instead of embedded graphical elements. The more diverse and "text-like" the samples are, the higher the chance of making your case with success. But without samples, there's no need even trying. Simple as that. A./ From wjgo_10009 at btinternet.com Fri Oct 27 02:59:30 2023 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 27 Oct 2023 08:59:30 +0100 (BST) Subject: Unicode philosophy - technical symbols In-Reply-To: <8a456ff6-5a1d-0e75-2a49-48046d3a924e@ix.netcom.com> References: <8a456ff6-5a1d-0e75-2a49-48046d3a924e@ix.netcom.com> Message-ID: <3ed3361e.2fa8.18b70253d7a.Webtop.83@btinternet.com> Asmus Freytag wrote as follows. > But without samples, there's no need even trying. Perhaps that is the issue that needs addressing. Could all the symbols in the document ISO 7000 / IEC 60417 Graphical symbols for use on equipment be encoded in the order given, in plane 5? No samples needed, the fact that the symbols are published in an ISO / IEC document being enough evidence for encoding. Using a new policy made in 2023 as a result of discussions in this discussion thread. If not, why not? Please discuss if you choose to do so. For an application example, suppose that someone is designing the artwork for the manufacture of the front panel of a piece of equipment. Best regards, William Overington Friday 27 October 2023 -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Fri Oct 27 08:06:05 2023 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 27 Oct 2023 13:06:05 +0000 Subject: Unicode philosophy - technical symbols Message-ID: At the risk of being repetitive. The fact that a symbol is cataloged in some list is itself not sufficient reason to consider it a text element in plain text. Which would be a necessary requirement for encoding. Many things have been invented over time without having seen use as part of writing (in running text). My personal view is that lists (or standards) are useful tools for two related purposes. They can help in identifying the symbol and in distinguishing it from similar symbols. Something that may be difficult from textual evidence alone. The other relates to small subsets of symbols that constitute a series covering a range or pair of values. Such as on/off, fast/slow or open/close. In such cases, I would support the use of a list to motivate the encoding of both even if one tends to be less common to the point that locating an actual instance in text is not equally successful for the full range (or both items in pair). The presumption in such a case is that the inability to locate an actual text sample is accidental and not a reflection of the need to treat them differently for encoding. In neither of these scenarios does a listing substitute for demonstrating that these symbols are used in text. A./ -----Original Message----- From: William_J_G Overington Sent: Oct 27, 2023 10:06 AM To: Subject: RE: Unicode philosophy - technical symbols Asmus Freytag wrote as follows. > But without samples, there's no need even trying. Perhaps that is the issue that needs addressing. Could all the symbols in the document ISO 7000 / IEC 60417 Graphical symbols for use on equipment (https://www.iso.org/obp/ui#iso:pub:PUB400008:en) be encoded in the order given, in plane 5? No samples needed, the fact that the symbols are published in an ISO / IEC document being enough evidence for encoding. Using a new policy made in 2023 as a result of discussions in this discussion thread. If not, why not? Please discuss if you choose to do so. For an application example, suppose that someone is designing the artwork for the manufacture of the front panel of a piece of equipment. Best regards, William Overington Friday 27 October 2023 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Oct 28 02:14:34 2023 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 28 Oct 2023 08:14:34 +0100 (BST) Subject: Unicode philosophy - technical symbols In-Reply-To: References: Message-ID: <5cf1eecb.35b0.18b752276e7.Webtop.83@btinternet.com> Asmus Freytag wrote as follows. > The fact that a symbol is cataloged in some list is itself not > sufficient reason to consider it a text element in plain text. Which > would be a necessary requirement for encoding. Yet it is not just "some list", it is an ISO/IEC list. Yet why is considering a symbol as a text element in plain text a necessary requirement for encoding? Apart from that rule being the existing rule that was made at sometime in the past, possibly under different circumstances than those that exist now. Is that rule limiting progress? Suppose please, for example, that someone is using a desktop publishing program to produce a document, an instruction manual for a piece of equipment, the document initially stored in a proprietary file format, with the person intending to export the text in a PDF document. One frameful of text may perhaps start with "Please consider the symbol in Figure 1 ..." and another frameful of text may show the symbol together with a text caption and text stating that it is Figure 1. Is it reasonable that the symbol is encoded into Unicode as a character, notwithstanding that it is not actually in a run of text characters? Plane 5 is currently empty, why not use it? William Overington Saturday 28 October 2023 -------------- next part -------------- An HTML attachment was scrubbed... URL: From list+unicode at jdlh.com Sat Oct 28 04:26:21 2023 From: list+unicode at jdlh.com (Jim DeLaHunt) Date: Sat, 28 Oct 2023 02:26:21 -0700 Subject: Unicode philosophy - technical symbols In-Reply-To: <5cf1eecb.35b0.18b752276e7.Webtop.83@btinternet.com> References: <5cf1eecb.35b0.18b752276e7.Webtop.83@btinternet.com> Message-ID: On 2023-10-28 00:14, William_J_G Overington via Unicode wrote: > Asmus Freytag wrote as follows. > > > > The fact that a symbol is cataloged in some list is itself not > sufficient reason to consider it a text element in plain text. Which > would be a necessary requirement for encoding. > > > Yet it is not just "some list", it is an ISO/IEC list. > > > Yet why is considering a symbol as a text element in plain text a > necessary requirement for encoding? Apart from that rule being the > existing rule that was made at sometime in the past, possibly under > different circumstances than those that exist now. > It makes perfect sense to me that the Unicode standard exercises restraint in its role in the ecosystem. And being a "plain text encoding" seems like a very helpful kind of restraint. > > > Is that rule limiting progress? > Most such decisions are tradeoffs. Most such decisions occur in the context of an ecosystem which includes fonts, text layout software, shaping engines, input methods, operating systems, user comprehension, and more. Most decisions impose various costs on various parts of the ecosystem. So the question to ask is, will such an addition, in context, lead to benefits which outweigh the costs, and have an advantage over alternatives? > Suppose please, for example, that someone is using a desktop > publishing program to produce a document, an instruction manual for a > piece of equipment, the document initially stored in a proprietary > file format, with the person intending to export the text in a PDF > document. > > > One frameful of text may perhaps start with "Please consider the > symbol in Figure 1 ..." and another frameful of text may show the > symbol together with a text caption and text stating that it is Figure 1. > I think that this is a weak example, because a desktop publishing program has an alternative way to display the symbol in Figure 1: as a graphic. And the PDF format has comprehensive ways to represent graphics as well as text. In North America, there used to be a brand of hand tools called Craftsman. They had a life-time unconditional replacement guarantee. The joke used to be that "any Craftsman tool can be used as a hammer". If you broke your Craftsman screwdriver while banging in nails in with the handle, get it replaced. In the same token, these discussion make me think that some people believe that any mark on a page should be made using character and text mechanisms, rather than graphics or other mechanisms which might be more appropriate. > Is it reasonable that the symbol is encoded into Unicode as a > character, notwithstanding that it is not actually in a run of text > characters? Plane 5 is currently empty, why not use it? > IMHO, no, it is not reasonable. It is a small benefit, given the easy alternative of representing symbols as graphics, and that good layout tools can embed graphics in runs of text. Lack of Unicode scalar values is not the constraint.? The discussion of encoding this symbol, and all the infinity of other symbols which can be justified the same way, have an opportunity cost in UTC decision bandwidth. The benefit of encoding the symbol is not unlocked until font makers add the symbol to their fonts, and users update the fonts in their systems. There is a burden to font makers to add the symbol, to shaping engines to handle the symbol, to input methods to find a way to input that symbol, to users to learn that this symbol exists, and so on. And many users will not learn that the symbol exists, so will get no benefit. Put down your Craftsman screwdriver, and learn to use the hammer to drive nails. > > William Overington > > Saturday 28 October 2023 > > > -- . --Jim DeLaHunt,jdlh at jdlh.com http://blog.jdlh.com/ (http://jdlh.com/) multilingual websites consultant, Vancouver, B.C., Canada -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Sat Oct 28 06:01:17 2023 From: jameskass at code2001.com (James Kass) Date: Sat, 28 Oct 2023 11:01:17 +0000 Subject: Unicode philosophy - technical symbols In-Reply-To: References: <5cf1eecb.35b0.18b752276e7.Webtop.83@btinternet.com> Message-ID: <70793c5a-4a26-4742-a750-059bd1dfb5b0@code2001.com> On 2023-10-28 9:26 AM, Jim DeLaHunt via Unicode wrote: > Put down your Craftsman screwdriver, and learn to use the hammer to > drive nails. Craftsman also makes nail guns. https://www.iso.org/obp/ui#iso:pub:PUB400008:en Apparently nobody has made any effort to make a cross-mapping between these symbols and Unicode characters.? Looking at those symbols, this isn't surprising. Anyone contemplating submitting an encoding proposal for this set in spite of the kindly advice offered in this thread would need to inspect each of the symbols to determine if any of them can be unified with existing Unicode characters. The symbol shown at 0100 is unifiable with U+267F. Now you're off to a good start. (The recently updated Mozilla Thunderbird substituted an emoji for the U+267F character which was pasted into this message.? So I deleted it.)