From wjgo_10009 at btinternet.com Fri Dec 4 06:30:32 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 4 Dec 2020 12:30:32 +0000 (GMT) Subject: A workaround for using colour fonts in some application programs that do not support colour fonts Message-ID: <36a93647.41c.1762dbb82d0.Webtop.218@btinternet.com> Hi Some readers might like to know of a workaround that I have devised for using colour fonts in some application programs that do not support colour fonts. The technique works because the Unicode code point for each character is exactly the same whether the character is displayed in a colour font or in a monochrome font. The technique is to compose the design using the application program, the characters appearing in plain monochrome form, then export as an svg file without selecting the option to convert the text to curves. The svg file is then displayed using an application program that does support colour fonts. This works simply because the application program places the Unicode character code points in the svg file and those Unicode character code points are successfully used by the colour font supporting application program. This is because the Unicode code point for each character is exactly the same whether the character is displayed in a colour font or in a monochrome font. For example, I started with Serif Affinity Publisher, which at present does not support colour fonts, produced an svg file without converting the text to curves, displayed the svg file using Microsoft Edge, made a 'print screen' image, then trimmed out the browser window parts using Microsoft Paint and saved the result as a png file. The technique has been found to work with Affinity Publisher, Affinity Designer and two legacy Serif products, PagePlus and CraftArtist2. Please find attached a graphic made by me using Affinity Publisher, Microsoft Edge, Microsoft Paint and the Playbox colour font designed and kindly supplied free with a licence by Matt Lyon. https://forum.affinity.serif.com/index.php?/topic/128285-colour-fonts-and-affinity-products/ William Overington Friday 4 December 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: playbox_in_publisher.png Type: image/png Size: 40979 bytes Desc: not available URL: From christian.kleineidam at gmail.com Fri Dec 11 06:57:23 2020 From: christian.kleineidam at gmail.com (Christian Kleineidam) Date: Fri, 11 Dec 2020 13:57:23 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them Message-ID: In the FAQ on Ligatures it's written "The mathematical letters and digits are meant to be used only in mathematics, where the distinction between a plain and a bold letter is fundamentally semantic rather than stylistic." This suggests that the spirit of Unicode includes the intention to be able to represent semantic meaning. On Wikidata, we have the open problem of what to do with academic articles that have italics in their official title. We store for example the paper https://www.wikidata.org/wiki/Q33988883 which according to what the publisher writes on http://www.biochemsoctrans.org/content/33/4/582 has italics as part of it's proper name. In Wikidata we want to be able to store the semantic meaning. This gives us the choice between either using In Wikidata, to either list the paper as "Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo sapiens" or "Evidence suggesting that ???? ???????????????? contributed the H2 ???? haplotype to ???? ???????" which uses the mathematical characters against recommendations while the website lists it as "Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo sapiens". In scientific articles like that the ability to represent italics is needed to express all of the semantic meaning that's contained in the title. In contrast to properties like font-size, italics start to be used in the real world to express semantic meaning. For a project like Wikidata that cares about storing the semantic meaning of the title of an academic paper that unicode problematic as it leads us to lose information. You might say that if unicode doesn't serve the needs of Wikidata to store the semantic content of the texts we care about, we should add additional formatting on-top of unicode. Between RTF, Markdown, SGML, HTML, XML and Wikitext there are multiple different formats we could use on Wikidata to potentially represent italics. If we would however choose any one of them that would make it harder for data-reusers who use another format to interact with our data as they would need to run a parser over the data which increases their code complexity and makes it harder to interact with our data. Official style guidelines like the Chicago Manual of Style (18th edition) specify that certain italics should be used to express certain semantic meaning: 22.1.3 Other Types of Names Other types of names also follow specific patterns for capitalization, and some require italics. 22.2.1 Foreign-Language Terms Italicize isolated words and phrases in foreign languages likely to be unfamiliar to readers of English, and capitalize them as in their language. 22.3.2.1 ITALICS. Italicize the titles of most longer works, including the types listed here. An initial the should be roman and lowercase before titles of periodicals, or when it is not considered part of the title. For parts of these works and shorter works of the same type, see 22.3.2.2. The inability to follow the recommendations of the Chicago Manual of Style to express semantic meaning in italics means that unicode fails in it's mission to be able to express all semantic distinctions. This means that it's technically impossible to follow the Chicago Manual of Style in code comments of programming code that are in unicode. Outside of specialized needs like those of Wikidata and programmers who might want to follow the Chicago Manual of Style in context the inability of unicode to represent italics and bold of texts makes life harder for average users as well. Web browsers can't offer their users the ability to format a part of the text as italics or bold. As a result many users don't know how to italicize or bold text when they write online as different website use different standards. Many online systems break WYSIWYG for italics and bold which makes it harder for non-technical users to use them to express themselves. If Unicode would support italics and bold, the browser could make it easy for users to have italics or boldness. Even smartphone would have the option to offer a user to italicize or bold a text in the menu that currently allows copying and pasting. Websites like https://yaytext.com/bold-italic/ get used by users to express themselves in italics and bold on platforms like Facebook and Twitter that use Unicode without additional formatting. Having to use the unofficial workaround of mathematical letters is undesirable because it means that software like screen readers is less likely to interact well with the resulting text. Proposal of a solution: In today's usage italics often have semantic meaning. There are many cases where it's desirable that a user can express such meaning but where there's no intention to give the user control over features such as font size that the user gets when HTML or RTF is used as format. With the symbol for Right-to-Left text there's a precedent in unicode for having signs that manipulate multiple following characters. At the time of the design italics weren't used for expressing fundamentally semantic meaning such as "Homo neanderthalensis" referring to a a species as it's used in the title of the above paper. Create a new unicode character for begin/end italic formatting and begin/end bold formatting that works like the unicode character for the Right-to-Left switch. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Fri Dec 11 12:42:52 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Fri, 11 Dec 2020 10:42:52 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: Message-ID: On 12/11/2020 4:57 AM, Christian Kleineidam via Unicode wrote: > Create a new unicode character for begin/end italic formatting and > begin/end bold formatting that works like the unicode character for > the Right-to-Left switch. ... and ... Yeah, they are sequences of 3 (or 4) existing characters, and not single code points, but they accomplish what you are asking for and they work everywhere on the web already. Nobody would thank you for introducing yet *another* form of scoped markup for the same effects that would take years to be picked up (inconsistently) in thousands of implementations, and which would introduce yet more possibilities for conflicts in dueling schemes for markup in text. --Ken From kilobyte at angband.pl Fri Dec 11 13:42:01 2020 From: kilobyte at angband.pl (Adam Borowski) Date: Fri, 11 Dec 2020 20:42:01 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: Message-ID: <20201211194201.GA5630@angband.pl> On Fri, Dec 11, 2020 at 10:42:52AM -0800, Ken Whistler via Unicode wrote: > On 12/11/2020 4:57 AM, Christian Kleineidam via Unicode wrote: > > Create a new unicode character for begin/end italic formatting and > > begin/end bold formatting that works like the unicode character for the > > Right-to-Left switch. > > ... and ... > > Yeah, they are sequences of 3 (or 4) existing characters, and not single > code points, but they accomplish what you are asking for and they work > everywhere on the web already. > > Nobody would thank you for introducing yet *another* form of scoped markup > for the same effects that would take years to be picked up (inconsistently) > in thousands of implementations, and which would introduce yet more > possibilities for conflicts in dueling schemes for markup in text. And, despite the original recommendation, enough people use math characters for that, so even Google considers them equivalent to basic ASCII. So just: echo 'Homo sapiens'|tran italic and 'ere you go. ?! -- ??????? Latin: meow 4 characters, 4 columns, 4 bytes ??????? Greek: ???? 4 characters, 4 columns, 8 bytes ??????? Runes: ???? 4 characters, 4 columns, 12 bytes ??????? Chinese: ? 1 character, 2 columns, 3 bytes <-- best! From doug at ewellic.org Fri Dec 11 16:38:07 2020 From: doug at ewellic.org (Doug Ewell) Date: Fri, 11 Dec 2020 15:38:07 -0700 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: Message-ID: <000301d6d00e$4d244330$e76cc990$@ewellic.org> Christian Kleineidam wrote: > "Evidence suggesting that ???? ???????????????? contributed the H2 > ???? haplotype to ???? ???????" "Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo sapiens" This title is completely meaningful in plain text. The convention to style the names of species and haplotypes in italics is just that, a styling convention. > Between RTF, Markdown, SGML, HTML, XML and Wikitext there are multiple > different formats we could use on Wikidata to potentially represent > italics. If we would however choose any one of them that would make it > harder for data-reusers who use another format to interact with our > data as they would need to run a parser over the data which increases > their code complexity and makes it harder to interact with our data. https://xkcd.com/927/ > The inability to follow the recommendations of the Chicago Manual of > Style to express semantic meaning in italics means that unicode fails > in it's mission to be able to express all semantic distinctions. This > means that it's technically impossible to follow the Chicago Manual of > Style in code comments of programming code that are in unicode. Style guides such as Chicago and AP and MLA cover many stylistic realms beyond this. They tell the writer how to indent certain passages and what sort of contrastive font faces and sizes should be used for quotations and how tables should be laid out. None of this is within the scope of a plain-text encoding standard either. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From richard.wordingham at ntlworld.com Fri Dec 11 17:19:08 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 11 Dec 2020 23:19:08 +0000 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: Message-ID: <20201211231908.29035298@JRWUBU2> On Fri, 11 Dec 2020 13:57:23 +0100 Christian Kleineidam via Unicode wrote: > At the time of the design italics weren't used for > expressing fundamentally semantic meaning such as "Homo > neanderthalensis" referring to a a species as it's used in the title > of the above paper. I just looked in a 1969 reprint of a school biology textbook published in 1966. It consistently italicises generic names such as _Drosophila_ within sentences, so I find your claim hard to credit. Of course, typewritten materials had to resort to underlining to indicate italicisation in such cases. I think I've seen such usage, but my memory may not be reliable. Richard. From jameskass at code2001.com Fri Dec 11 19:41:31 2020 From: jameskass at code2001.com (James Kass) Date: Sat, 12 Dec 2020 01:41:31 +0000 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: Message-ID: On 2020-12-11 12:57 PM, Christian Kleineidam via Unicode wrote: > This suggests that the spirit of Unicode includes the intention to be able > to represent semantic meaning. That's the spirit! The topic of italics in Unicode was last discussed extensively on this list in January of 2019, bleeding into February. https://unicode.org/mail-arch/unicode-ml/y2019-m01/ As Adam Borowski points out, enough people are using the math alphanumerics that we have a ?? ????? method. From indolering at gmail.com Fri Dec 11 22:14:08 2020 From: indolering at gmail.com (Zach Lym) Date: Fri, 11 Dec 2020 20:14:08 -0800 Subject: Normalization Generics (NFx, NFKx, NFxy) Message-ID: I have been tracking down the rationale behind the normalization choices in filesystems. One trouble spot for implementers is interpreting strict logician terminology paired with imprecise pseudo code. Take the definition of Unicode's caseless matching algorithm [D145]: > A string X is a canonical caseless match for a string Y if and only if: > NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y))) The W3C Canonical Case Fold Normalization algorithm claims to be compatible with [D145], but uses NFC in the last step [w3c-charmod-norm], leading to an apparent contradiction. Even though Unicode explains that "case folding is closed under canonical normalization" it took me a long time to find that passage and convince myself that the W3C and Unicode matching algorithms are equivalent. I am not alone: *Linux kernel hackers couldn't figure it out either* [linux-norm]! I was originally going to propose additions to D145 textual description, cross-references to the implementation section, and adding discussion of W3C charmod-norm. However, I don't think this would help as the text is already quite dense and most people will just ignore everything outside the example anyway [minimalist-manual]. I would instead like to propose normalization form generics for use in pseudo code definitions: NFx = NFD|NFC NFKx = NFKD|NFKC NFxy = NFD|NFC|NFKD|NFKC Freestanding `X`/`Y` variables should be probably be replaced to disambiguate them from the `NFx` nomenclature. `s1`/`s2` would work but `foo`/`bar` is less dense: NFx(caseFold(NFD(foo))) = NFx(caseFold(NFD(bar))) `NFx` does not currently appear within the Unicode standard itself, but is used in the normalization technical note [UAX15]. However, **UAX15 defines `NFx` twice**, first as NFD|NFC|NFKD|NFKC and later on as NFD|NFC. I think the proposed convention gets the most mileage out of the nomenclature and is how I have seen `NFx` used in the real world [linus]. Thank you! -Zach Lym [w3c-charmod-norm]: https://w3c.github.io/charmod-norm/#CanonicalFoldNormalizationStep [linux-norm]: https://lwn.net/ml/linux-fsdevel/20190318202745.5200-10-krisman%40collabora.com [minimalist-manual]: https://dl.acm.org/doi/10.1207/s15327051hci0302_2 [UAX15]: https://unicode.org/reports/tr15/ [linus]: https://lore.kernel.org/linux-fsdevel/CAHk-=wiFtZL5rK3T-HQPm0oG4vekDJEKS47P8BbzHSXt_6SHuA at mail.gmail.com/ From sosipiuk at gmail.com Fri Dec 11 23:58:41 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Sat, 12 Dec 2020 00:58:41 -0500 Subject: Normalization Generics (NFx, NFKx, NFxy) In-Reply-To: References: Message-ID: On Fri, Dec 11, 2020 at 11:49 PM Zach Lym via Unicode wrote: > > > A string X is a canonical caseless match for a string Y if and only if: > > NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y))) > > The W3C Canonical Case Fold Normalization algorithm claims to be > compatible with [D145], but uses NFC in the last step > [w3c-charmod-norm], leading to an apparent contradiction. Even though > Unicode explains that "case folding is closed under canonical > normalization" it took me a long time to find that passage and > convince myself that the W3C and Unicode matching algorithms are > equivalent. The more general rule is that: NFC(X) = NFC(Y) if and only if NFD(X) = NFD(Y). I.e. you can always replace one canonical form with the other in equivalence comparisons. (As long as you apply the same one to both sides, of course, but which one is up to you.) > I would instead like to propose normalization form generics for use in > pseudo code definitions: > > NFx = NFD|NFC > NFKx = NFKD|NFKC > NFxy = NFD|NFC|NFKD|NFKC I would prefer the last one to be: NF(K)x = NFD|NFC|NFKD|NFKC; or perhaps NF[K]x = NFD|NFC|NFKD|NFKC; to look a bit more like ABNF. S?awomir Osipiuk From wjgo_10009 at btinternet.com Sat Dec 12 09:39:28 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 12 Dec 2020 15:39:28 +0000 (GMT) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <79fb5335.d72.17657954ac1.Webtop.223@btinternet.com> References: <79fb5335.d72.17657954ac1.Webtop.223@btinternet.com> Message-ID: <35cce9c0.d7f.176579b5f0a.Webtop.223@btinternet.com> Hi You might find the following links of interest. The proposal was not successful and was dismissed strongly, indeed using italics for emphasis. For the avoidance of doubt I did not advocate regarding the encoding of the mathematical italic characters as a precedent for what I proposed. It is somewhat ironic that the refusal uses italics for emphasis and could, in my opinion, be reasonably regarded as supporting evidence for the case of what you are wanting encoding, as that emphasis cannot at present be expressed in plain text. If it is not a semantic difference then it seems to me that there is no reason whatsoever to use italics at all in that refusal notice. So has Unicode Inc. in fact shown in its refusal the very need that it is refusing to encode? https://www.unicode.org/L2/L2019/19063-italic-vs.pdf https://www.unicode.org/L2/L2019/19195-italic-cmt.pdf https://forum.high-logic.com/viewtopic.php?f=10&t=7831 https://www.unicode.org/alloc/nonapprovals.html However, such dismissals are not absolute because sometimes there is a U-turn later, for example with the encoding of emoji. Look at where emoji encoding is now, no longer about just backwards compatibility yet pushing forward with new designs. For the avoidance of doubt I am pleased that emoji are being encoded. I wish that they would not insist that my proposals for encoding a futuristic idea of mine are out of scope and refuse to allow them to be discussed in this mailing list or put to The Unicode Technical Committee. I note that you mention a QID item. There is an ongoing public review about encoding what are being called QID emoji. https://www.unicode.org/review/pri408/ Although the page currently shows a closing date that has passed, the public review has, in fact, been reopened as listed on the following page. https://www.unicode.org/review/ Best regards, William Overington Saturday 12 December 2020 http://www.users.globalnet.co.uk/~ngo/ My website is safe to use, it is not hosted on my own computer, but is hosted on a server run by Plusnet PLC, a United Kingdom company. From christian.kleineidam at gmail.com Sat Dec 12 13:01:05 2020 From: christian.kleineidam at gmail.com (Christian Kleineidam) Date: Sat, 12 Dec 2020 20:01:05 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <000301d6d00e$4d244330$e76cc990$@ewellic.org> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> Message-ID: On Fri, Dec 11, 2020 at 11:38 PM Doug Ewell wrote: > Christian Kleineidam wrote: > > > "Evidence suggesting that ???? ???????????????? > contributed the H2 > > ???? haplotype to ???? ???????" > > "Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT > haplotype to Homo sapiens" > > This title is completely meaningful in plain text. The convention to style > the names of species and haplotypes in italics is just that, a styling > convention. > Would you also say there's no semantic difference between "Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo sapiens" and EVIDENCE SUGGESTING THAT HOMO NEANDERTHALENSIS CONTRIBUTED THE H2 MAPT HAPLOTYPE TO HOMO SAPIENS"? If so, why does unicode allow those to be formatted differently? I think that capitalization generally gets used to express semantic meaning. Capitalizing the first character of a sentence is a way to semantically mark the start of the sentence. Capitalizing Homo is a way to express semantics. Homo gets capitalized here for the same reasons as it gets italicized. In both cases it's because the semantics of a species name dictate it if you follow official recommendations. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sat Dec 12 16:32:33 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 12 Dec 2020 14:32:33 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> Message-ID: <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Sat Dec 12 19:25:06 2020 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Sun, 13 Dec 2020 10:25:06 +0900 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> Message-ID: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> Asmus gives a lot of good reasons below. Here are some more: Children learn to write with upper case and lower case letters in school, and most people continue to use both as adults. (There are exceptions of course, some people write only with lower case, and some only with upper case.) On the other hand, people who distinguish upright and italic in handwriting are extremely rare (maybe limited to editors of certain journals?). Also, case is important in names. It's Ludwig van Beethoven, not Ludwig Van Beethoven, and LeBron James, not Lebron James. Italics don't come into consideration here at all. For all these reasons, the upper/lower case distinction was and is also available on typewriters and keyboards. Again not so for italic. Regards, Martin. On 13/12/2020 07:32, Asmus Freytag via Unicode wrote: > On 12/12/2020 11:01 AM, Christian Kleineidam via Unicode wrote: >> On Fri, Dec 11, 2020 at 11:38 PM Doug Ewell > > wrote: >> >> Christian Kleineidam wrote: >> >> > "Evidence suggesting that ???? ???????????????? >> contributed the H2 >> > ???? haplotype to ???? ???????" >> >> "Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT >> haplotype to Homo sapiens" >> >> This title is completely meaningful in plain text. The convention to style >> the names of species and haplotypes in italics is just that, a styling >> convention. >> >> Would you also say there's no semantic difference between "Evidence suggesting >> that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo sapiens" >> and EVIDENCE SUGGESTING THAT HOMO NEANDERTHALENSIS CONTRIBUTED THE H2 MAPT >> HAPLOTYPE TO HOMO SAPIENS"? If so, why does unicode allow those to be >> formatted differently? >> >> I think that capitalization generally gets used to express semantic meaning. >> Capitalizing the first character of a sentence is a way to semantically mark >> the start of the sentence. Capitalizing Homo is a way to express semantics. >> Homo gets capitalized here for the same reasons as it gets italicized. In both >> cases it's because the semantics of a species name dictate it if you follow >> official recommendations. > > There are significant differences in usage as well as implication. > > A style, like "italics" can be applied to nearly the entire set of Unicode > characters, while case is limited to a comparatively tiny subset. If Unicode > wanted to encode styles like it does for case, it would mean multiplying the > number of characters. > > But Mathalphabetics, you say. Well, in mathematical notation, certain styles are > applied to very limited subsets. In effect, you could argue that in those > contexts, certain stylistic variants work like case in ordinary orthographies. > (Mathematical use of letter shapes is special, as it is almost exclusively > using letter shapes as individual symbols, not part of words). > > Styles, commonly, are applied in runs, not to isolated code points. For case, > the default is the other way around. In both cases, the exceptions prove the > underlying rule. > > ALL UPPER CASE, as well as SMALL CAPS are more like a style than normal casing. > As shown by the way they are supported like styles in feature-rich word > processing apps.(The latter are not encoded: extending the arguments for > encoding italics would force adding support for small caps as well). > > Styles, unlike case when applied to selected letters, tends to not have > orthographic use. Even if it carries meaning that goes beyond being > "decorative". There are exceptions even here, that prove the rule. > > Finally, the guiding design principle for "plain text" is that it is stateless > (again, exceptions like bidi, are there to prove the rule). Styles, being > applied in runs, are inherently not stateless, so are best expressed in stateful > ways (that is, in one or the other rich-text protocols). > > The use case comes from lack of support of stateful text protocols (even limited > ones) in places such as social media. There is no inherent reason why Twitter, > Facebook and the like could not support "markdown" or similar protocols. > > On balance, all proposals for supporting some sort of "italics in Unicode" > ignore not only the interrelationship shown in these facts, but also the well > established historical division of "plain text" and "rich text" -- which Unicode > has no business upsetting. > > A./ > From prosfilaes at gmail.com Sat Dec 12 19:59:53 2020 From: prosfilaes at gmail.com (David Starner) Date: Sat, 12 Dec 2020 17:59:53 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> Message-ID: There's a lot of good answers, but I'd like to circle back to what I think is the core reason: we've had character sets for seven decades, virtually all of which supported English, and if any have supported italics, I've never heard of it. Unicode supports italics the most of any character set I've heard of. Whether in some sense italics should be encoded in plain text is not an open problem; it's been assigned to a level above plain text, and is well supported there. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From indolering at gmail.com Sat Dec 12 20:23:23 2020 From: indolering at gmail.com (Zach Lym) Date: Sat, 12 Dec 2020 18:23:23 -0800 Subject: Normalization Generics (NFx, NFKx, NFxy) In-Reply-To: References: Message-ID: > The more general rule is that: > NFC(X) = NFC(Y) if and only if NFD(X) = NFD(Y). > I.e. you can always replace one canonical form with the other in > equivalence comparisons. (As long as you apply the same one to both > sides, of course, but which one is up to you.) Yes, and a careful reading of the standard will show that this is the case. But we don't live in a world where people have time to read the standard. Oh dear, I included the wrong link in my citation! It should have been: https://lwn.net/ml/linux-fsdevel/20190206084752.nwjkeiixjks34vao at pali/ At any rate, someone suggested using NFC, but this objection came up: >> Is there any case where >> NFC(x) == NFC(y) && NFD(x) != NFD(y) , or >> NFC(x) != NFC(y) && NFD(x) == NFD(y) > >This is good question. And I think we should get definite answer for it >prior inclusion of normalization into kernel. Which was simply never followed up on. This is a feature that was included after years of debate and developed in an open process. If even Linux can't get this one right, then we need to do a better job at explaining Unicode. > > I would instead like to propose normalization form generics for use in > > pseudo code definitions: > > > > NFx = NFD|NFC > > NFKx = NFKD|NFKC > > NFxy = NFD|NFC|NFKD|NFKC > > I would prefer the last one to be: > NF(K)x = NFD|NFC|NFKD|NFKC; or perhaps > NF[K]x = NFD|NFC|NFKD|NFKC; to look a bit more like ABNF. I don't care for NFxy either, but I strongly prefer sticking to C programming conventions. From mark at kli.org Sat Dec 12 20:48:54 2020 From: mark at kli.org (Mark E. Shoulson) Date: Sat, 12 Dec 2020 21:48:54 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> Message-ID: <8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org> An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Dec 12 21:20:01 2020 From: doug at ewellic.org (Doug Ewell) Date: Sat, 12 Dec 2020 20:20:01 -0700 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org> Message-ID: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org> Others have covered pretty much everything I was going to respond to Christian with. David Starner wrote: > I'd like to circle back to what I think is the core reason: we've had > character sets for seven decades, virtually all of which supported > English, and if any have supported italics, I've never heard of it. The only conceivable exception might be ISO-IR-68, which represented APL and its distinctive italic uppercase letters. The registration? for ISO-IR-68 named the letters (e.g.) "CAPITAL APL LETTER A" and noted that they were "[u]sually printed in italics," revealing that this was merely a font preference specific to APL, as the corresponding roman (non-italic) letters were not also included. All mapping tables from ISO-IR-68, including Unicode's, map the italic APL letters to normal ASCII letters. ? https://www.itscj.ipsj.or.jp/iso-ir/068.pdf Christian wrote: > If so, why does unicode allow those [uppercase and lowercase letters] > to be formatted differently? For "formatted differently" I read "encoded separately"; Unicode doesn't dictate whether characters are displayed in an upright (roman) or italic style. If one uses George Douros's Akkadian font, for example, everything comes out in italics. I wonder if the spelling "unicode" was meant here as a statement about the semantics of initial capitals. Standard English orthography requires that trade names like "Unicode" be spelled with an initial capital, whereas no orthographic requirement exists to spell anything with italics. We do understand that not every possible nuance of human communication, such as shades of emphasis, can be expressed in plain text. It seems that the nearly 30-year-old Unicode definition of "plain text" still has not caught on universally, since requests continue to emerge for UTC to encode things that are not plain text by that definition. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From pandey at umich.edu Sat Dec 12 21:33:08 2020 From: pandey at umich.edu (Anshuman Pandey) Date: Sat, 12 Dec 2020 21:33:08 -0600 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org> References: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org> Message-ID: <64B08E21-8D4B-4AC1-85A9-C40D9E468178@umich.edu> Doug basically covered everything I had to say. ? > On Dec 12, 2020, at 9:20 PM, Doug Ewell via Unicode wrote: > > ?Others have covered pretty much everything I was going to respond to Christian with. > > David Starner wrote: > >> I'd like to circle back to what I think is the core reason: we've had >> character sets for seven decades, virtually all of which supported >> English, and if any have supported italics, I've never heard of it. > > The only conceivable exception might be ISO-IR-68, which represented APL and its distinctive italic uppercase letters. The registration? for ISO-IR-68 named the letters (e.g.) "CAPITAL APL LETTER A" and noted that they were "[u]sually printed in italics," revealing that this was merely a font preference specific to APL, as the corresponding roman (non-italic) letters were not also included. All mapping tables from ISO-IR-68, including Unicode's, map the italic APL letters to normal ASCII letters. > > ? https://www.itscj.ipsj.or.jp/iso-ir/068.pdf > > Christian wrote: > >> If so, why does unicode allow those [uppercase and lowercase letters] >> to be formatted differently? > > For "formatted differently" I read "encoded separately"; Unicode doesn't dictate whether characters are displayed in an upright (roman) or italic style. If one uses George Douros's Akkadian font, for example, everything comes out in italics. > > I wonder if the spelling "unicode" was meant here as a statement about the semantics of initial capitals. Standard English orthography requires that trade names like "Unicode" be spelled with an initial capital, whereas no orthographic requirement exists to spell anything with italics. > > We do understand that not every possible nuance of human communication, such as shades of emphasis, can be expressed in plain text. It seems that the nearly 30-year-old Unicode definition of "plain text" still has not caught on universally, since requests continue to emerge for UTC to encode things that are not plain text by that definition. > > -- > Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org > > > From asmusf at ix.netcom.com Sat Dec 12 22:03:58 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 12 Dec 2020 20:03:58 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org> <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org> Message-ID: <8168fd17-31c3-5c23-d94d-7864bbd455a9@ix.netcom.com> On 12/12/2020 7:20 PM, Doug Ewell via Unicode wrote: > We do understand that not every possible nuance of human communication, such as shades of emphasis, can be expressed in plain text. It seems that the nearly 30-year-old Unicode definition of "plain text" still has not caught on universally, since requests continue to emerge for UTC to encode things that are not plain text by that definition. If you go against an established method/truth/system/consensus/anything and win, you'll be famous. That's the lure that keeps people up at night trying to create a /perpetuum mobile/. Problem is, that chances of that winning are usually more than elusive. Doesn't prevent people from trying. If conservation of energy, posited by Julius von Mayer in 1842 and well-tested in the over 150 years since then, does not prevent people trying the impossible, then why should 30 years of Unicode be sufficient :) A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From sosipiuk at gmail.com Sat Dec 12 23:28:56 2020 From: sosipiuk at gmail.com (=?utf-8?Q?S=C5=82awomir_Osipiuk?=) Date: Sun, 13 Dec 2020 00:28:56 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: Message-ID: <002401d6d110$db4ac280$91e04780$@gmail.com> I mostly agree with the general consensus, though probably not as firmly. However, I had a showerthought that, specifically in the case of Latin terms, marking them as such would be a legitimate use of the Unicode language tags. Indeed, an indication of ?this is Latin text? would be more correct and future-proof than ?this is italicized?, since the proper styling to indicate Latin text may change with the times, and because tags are default-ignorable, this approach would still be compatible with ?plain text? programs. The wiki (or whatever software) could be made to italicize Latin-within-English text that is tagged as such. I know the tags are officially deprecated, but I personally think they got a bad rap. If ? and that is a big if ? a system for basic formatting (italic/bold/underlined/nonspecifically-emphasized) is ever implemented in Unicode, it should be via the default-ignorable tags. S?awomir Osipiuk -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sat Dec 12 23:45:54 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 12 Dec 2020 21:45:54 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <002401d6d110$db4ac280$91e04780$@gmail.com> References: <002401d6d110$db4ac280$91e04780$@gmail.com> Message-ID: <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Dec 13 18:51:56 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 14 Dec 2020 00:51:56 +0000 Subject: Normalization Generics (NFx, NFKx, NFxy) In-Reply-To: References: Message-ID: <20201214005156.6125d895@JRWUBU2> On Fri, 11 Dec 2020 20:14:08 -0800 Zach Lym via Unicode wrote: > > A string X is a canonical caseless match for a string Y if and only > > if: NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y))) > Even though > Unicode explains that "case folding is closed under canonical > normalization" it took me a long time to find that passage and > convince myself that the W3C and Unicode matching algorithms are > equivalent. What does that quoted statement mean? I'm having a hard job working out what the meaning of full case folding is. I'm not having any doubts about the meaning of toCasefold(NFD(X)), so there is no issue for 'canonical caseless matching'. Richard. From indolering at gmail.com Sun Dec 13 22:08:08 2020 From: indolering at gmail.com (Zach Lym) Date: Sun, 13 Dec 2020 20:08:08 -0800 Subject: Normalization Generics (NFx, NFKx, NFxy) In-Reply-To: <20201214005156.6125d895@JRWUBU2> References: <20201214005156.6125d895@JRWUBU2> Message-ID: > What does that quoted statement mean? I'm having a hard job working > out what the meaning of full case folding is. I'm not having any > doubts about the meaning of toCasefold(NFD(X)), so there is no issue > for 'canonical caseless matching'. The "case folding is closed under canonical normalization" or the other part? Closed as in closure: https://en.wikipedia.org/wiki/Closure_(mathematics) Refer to page 240 of the standard, Chaper 5 "Implementation Guidelines" Section 18 "Case Mappings": http://www.unicode.org/versions/latest/ch05.pdf From marius.spix at web.de Mon Dec 14 05:26:47 2020 From: marius.spix at web.de (Marius Spix) Date: Mon, 14 Dec 2020 12:26:47 +0100 Subject: Aw: Re: Normalization Generics (NFx, NFKx, NFxy) Message-ID: An HTML attachment was scrubbed... URL: From harjitmoe at outlook.com Mon Dec 14 08:22:59 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Mon, 14 Dec 2020 14:22:59 +0000 Subject: Aw: Re: Normalization Generics (NFx, NFKx, NFxy) In-Reply-To: References: Message-ID: Marius Spix via Unicode wrote: > I understand that: > [:toCaseFold=s:] = [sS?] > [:toCaseFold=?:] = [???] > But can someone explain me the following? > [:toCaseFold=?:] = [?] > [:toCaseFold=i:] = [iI] > [:toCaseFold=?:] = [] > Why is it not: > [:toCaseFold=?:] = [iI?] > [:toCaseFold=i:] = [iI?] > [:toCaseFold=?:] = [??] > ? > ? is often changed to SS in uppercase; the ? is a relatively new addition as an encoded character and is not consistently used.? So PREUSSEN and Preu?en are casings of the same word, for example.? I think ? might have been added after ?'s casefolding was already defined, but I'm not sure so don't quote me on that. "I" cannot casefold to *both* "i" and "?", it has to casefold to one of them.? Not sure about "?" not casefolding the same as "I", but I don't suppose there really exists any "good" locale-independent solution for case insensitivity of "I". ? Har. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sosipiuk at gmail.com Mon Dec 14 11:02:26 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Mon, 14 Dec 2020 12:02:26 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> Message-ID: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> On Sun, Dec 13, 2020 at 12:47 AM Asmus Freytag via Unicode wrote: > > Write a killer social media app that uses these in an integral fashion and requires them for interoperability and then sit back and watch how long they stay deprecated ... That, or perhaps something like Wikidata could use it. ;) I slept on it, and I'm leaning to the other side now. I think of the paper books I've read, and italics often appear within the text. Are the books "plain text"? Do the italics really fall into the category of typesetting and style, like the choice of overall font? Or are they a meaningful part of the text itself? Should it be possible to fit the content of a whole novel into a .txt file without losing any semantic meaning? The "spirit of Unicode" whispers that it should. Of course some books contain charts and graphics, and Unicode can't do everything, but if a solution can cover 95% of cases, it at least deserves consideration. On Fri, Dec 11, 2020 at 1:13 PM Christian Kleineidam via Unicode wrote: > > Create a new unicode character for begin/end italic formatting and begin/end bold formatting that works like the unicode character for the Right-to-Left switch. If you or someone else chooses to make a proposal, my own recommendation would be this: - Assign a new character U+E0002 FORMAT TAG - The syntax follows the specification for tagging (chapter 23.9) - U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting. - U+E0002 U+E007F CANCEL TAG to cancel all formatting - Any use of U+E0002 overrides previous formatting (i.e. a "bold" tag alone cancels a previous "italic" tag), so format nesting must be done by combining all desired formats into a single tag. - This method should only be used in cases where formatting is required without a higher-level protocol - This method should not be used in instances where loss of formatting would greatly alter the meaning of the text or render it incomprehensible. - Strikethrough and super/subscript are deliberately omitted for the above reason. Advantages: - Only a single new character needs definition. - Uses an existing framework (tags) - Formatting is ignorable, implementation is optional - A viable method to preserve 95%+ of typical semantic formatting in plain-text - IMO a stronger case to have this than either language tags or annotations (argument is to accurately preserve the lot of existing documents that include rudimentary formatting, rather than just invent new features). Disadvantages: https://xkcd.com/927/ S?awomir Osipiuk From abrahamgross at disroot.org Mon Dec 14 11:15:34 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Mon, 14 Dec 2020 17:15:34 +0000 (UTC) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: Wait till u see signwriting. now u can draw full on pictures in unicode Dec 14, 2020 12:04:14 PM S?awomir Osipiuk via Unicode : > Of course some books contain charts and graphics, and Unicode can't do everything, > From wjgo_10009 at btinternet.com Mon Dec 14 10:29:30 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 14 Dec 2020 16:29:30 +0000 (GMT) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> Message-ID: <6e3feae1.712.1766215e59a.Webtop.210@btinternet.com> Asmus Freytag wrote: > But you need to be successful first :) Indeed. An invention of mine, a container for which needs encoding into Unicode in order to achieve successful unambiguous interoperability, has been banned from being discussed in this mailing list and blocked from going before The Unicode Technical Committee because it has been deemed without explanation to be "out of scope". Yet scope can change according to need, yet discussion of scope also has been blocked from being discussed in the mailing list. My posts have been placed on permanent moderated posts status so as to stop such discussion taking place. So, at present, the bar is far too high for me to be able to achieve my goal of successful unambiguous interoperability for the invention. Unless discussion and fair consideration by The Unicode Technical Committee is allowed then that success will be impossible and a futuristic invention will never achieve its full potential. Encoding into Unicode would also guarantee that the technique is applied in a non-proprietary manner. William Overington Monday 14 December 2020 From wjgo_10009 at btinternet.com Mon Dec 14 11:19:20 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 14 Dec 2020 17:19:20 +0000 (GMT) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com> S?awomir Osipiuk wrote: > Of course some books contain charts and graphics, and Unicode can't do > everything, ? In my opinion, Unicode could include charts and graphics by encoding them within a plain text stream if people wanted that to be encoded. William Overington Monday 14 December 2020 From kenwhistler at sonic.net Mon Dec 14 13:19:12 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Mon, 14 Dec 2020 11:19:12 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com> Message-ID: <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net> On 12/14/2020 9:19 AM, William_J_G Overington via Unicode wrote: > In my opinion, Unicode could include charts and graphics by encoding > them within a plain text stream if people wanted that to be encoded. You mean, as in the following sequence, shown here as a stream of plain text characters in email? ? And interpreted in the following document, in context, as HTML: https://www.unicode.org/reports/tr51/#Major_Sources ?? --Ken P.S. for the nitpickers... yeah, yeah, I realize that this email is delivered as HTML, so the "plain text" is itself using quoting conventions to embed in the HTML email. If you want this redelivered as actual plain text, I could accommodate. ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Dec 14 13:41:13 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 14 Dec 2020 11:41:13 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com> <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net> Message-ID: On Mon, Dec 14, 2020 at 11:25 AM Ken Whistler via Unicode < unicode at unicode.org> wrote: > P.S. for the nitpickers... yeah, yeah, I realize that this email is > delivered as HTML, so the "plain text" is itself using quoting conventions > to embed in the HTML email. If you want this redelivered as actual plain > text, I could accommodate. ? > No need. I can confirm that your email was sent as *Content-Type: multipart/alternative*; boundary="------------7278B446390F3BC66C4D83C4" And that the first part is in *Content-Type: text/plain*; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit So we are all good. (For Gmail users: Three-dot ?More? menu on the specific message, select ?Show original?) Thanks, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From costello at mitre.org Mon Dec 14 14:17:48 2020 From: costello at mitre.org (Roger L Costello) Date: Mon, 14 Dec 2020 20:17:48 +0000 Subject: Is there a difference between converting a string of ASCII digits to an integer versus a string of non-ASCII digits to an integer? Message-ID: Hi Folks, As I understand it, when the C programming language was created it just used ASCII. Programs written in C used ASCII digits. Nowadays C supports Unicode and Unicode contains more digits than just the ASCII digits. (I think) modern C programs can express numbers using strings of non-ASCII digits. Questions: 1. Is the algorithm for converting a string that contains non-ASCII digits different than the algorithm for converting a string containing ASCII digits? 2. The C function atoi() converts a string of digits to a number. I have seen the source code for atoi(). The source code that I saw was dated around the year 2000. Can you point me to the modern source code for atoi()? /Roger From wjgo_10009 at btinternet.com Mon Dec 14 15:48:07 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 14 Dec 2020 21:48:07 +0000 (GMT) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com> <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net> Message-ID: <31d43ca1.a6e.176633998ad.Webtop.216@btinternet.com> Ken Whistler wrote as follows. > You mean, as in the following sequence, shown here as a stream of > plain text characters in email? ? And interpreted in the following document, in context, as HTML: https://www.unicode.org/reports/tr51/#Major_Sources ?? Actually no, for bitmap images I was thinking of a tag character sequence method that was proposed in a document in The Unicode Technical Committee Document Register some time ago that would directly embed an image in a plain text file, no external link. It was not authored by me, I cannot find it at present. For vector graphics I was thinking of a tag character version of the eutographics system that I devised back in 2002. (Please note that that is eutographics, not eurographics, as which it has sometimes been incorrectly described.) http://www.users.globalnet.co.uk/~ngo/ast03000.htm It worked well locally using a Java applet in a web page. So, if The Unicode Technical Committee were to include these ideas in Unicode, then Unicode could enable much more information to be communicated unambiguously and interoperably in a plain text file. William Overington Monday 14 December 2020 Please note that the email address used in the listings in the eutographics web page is not in regular use these days. From harjitmoe at outlook.com Mon Dec 14 17:03:36 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Mon, 14 Dec 2020 23:03:36 +0000 Subject: Is there a difference between converting a string of ASCII digits to an integer versus a string of non-ASCII digits to an integer? In-Reply-To: References: Message-ID: Roger L Costello via Unicode wrote: > [?] > 2. The C function atoi() converts a string of digits to a number. I have seen the source code for atoi(). The source code that I saw was dated around the year 2000. Can you point me to the modern source code for atoi()? > > /Roger Here is the implementation from the FreeBSD libc: https://github.com/freebsd/freebsd/blob/master/lib/libc/stdlib/strtol.c (|strtol| and |strtol_l| are defined in that source file.? |atoi| and |atoi_l| just wrap them, passing |NULL| for |endptr| and |10| for |base|.) ?Har. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Mon Dec 14 18:59:33 2020 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 14 Dec 2020 19:59:33 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From sosipiuk at gmail.com Mon Dec 14 22:36:03 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Mon, 14 Dec 2020 23:36:03 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: On Mon, Dec 14, 2020 at 8:05 PM Mark E. Shoulson via Unicode wrote: > > All TAG symbols placed between a U+E003D TAG LESS-THAN SIGN and a U+E003E TAG GREATER-THAN SIGN, inclusive, are to be treated as if they were they corresponding ASCII characters, and run that through an HTML renderer. I guess if you wanted you could stipulate some reduced or restricted subset of HTML I've been informed off-list that BabelPad uses this as a formatting option. So, it's been done. This solution technically constitutes a higher-level protocol anyway. It's a markup language, just using unusual characters, but it's not in any fundamental way a Unicode feature, official or not. > If this sounds disturbing and wrong to you, Disturbing? No. Wrong? I'd say "not my first choice". There are plenty of things already approved that actually disturb me, but I won't go on that tangent now. > then other pseudo-markup ideas probably should as well. Pseudo-markup already exists in Unicode, in multiple, inconsistent ways. It exists because it was, at some point, by some people, deemed useful enough and compatible enough with the aims of Unicode to be included. I'm boggled by how annotations got in. I'm well aware of scope creep and I'm not at all in favour of making Unicode a Turing-complete programming language. That's why I proposed something that fits into an already-established method that Unicode has already defined. It even includes a bit of syntactic salt in the way format nesting must be done that drives implementers to other protocols for anything beyond rudimentary effects. My guiding example is, "record fully the story text of a paperback novel". There are things that are irrelevant for this purpose, such as choice of font, or drop caps ("fancy first letters"), or page numbers, or sizing of chapter titles, etc., etc.. Even something like monospaced text is almost always used purely stylistically (to indicate in-story things like signage, computer output, telegrams.) and can be substituted with imagination by engaged readers. But italics or underlines are often a meaningful part of text and something is lost when that formatting is lost. Necessitating a higher-level protocol for something so simple, when it can be easily accommodated through an existing Unicode framework, is needlessly conservative. The thread-starter, Christian Kleineidam, gave a different use case but I think it's a valid one as well. I think this would be an easy win with not a whole lot of downside. Reading the room here, not many agree. C'est la vie. Cheers, S?awomir Osipiuk From beckiergb at gmail.com Tue Dec 15 00:47:33 2020 From: beckiergb at gmail.com (Rebecca Bettencourt) Date: Mon, 14 Dec 2020 22:47:33 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> Message-ID: On Sat, Dec 12, 2020 at 6:03 PM David Starner via Unicode < unicode at unicode.org> wrote: > we've had character sets for seven decades, > virtually all of which supported English, and if any have supported > italics, I've never heard of it. ISCII 1991 had a mechanism called ATR Codes for applying styles and switching character sets (see Annex E of http://varamozhi.sourceforge.net/iscii91.pdf): EF 30 - bold EF 31 - italic EF 32 - underline EF 33 - expanded EF 34 - highlight EF 35 - outline EF 36 - shadow EF 37 - double height, top half EF 38 - double height, bottom half EF 39 - double height and width Many character sets from 8-bit microcomputers had ?inverse? or ?reverse video? characters that were treated as distinct from their ?normal video? counterparts. When we proposed encoding these, as atomic characters or using variation sequences or by any other means, the UTC shot down the idea completely. The existence of existing character sets, even when one is a government standard, can't even get stylistic differences like italics or reverse video into Unicode. -- Rebecca Bettencourt -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Tue Dec 15 11:31:36 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 15 Dec 2020 09:31:36 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> Message-ID: On Mon, Dec 14, 2020 at 10:54 PM Rebecca Bettencourt via Unicode < unicode at unicode.org> wrote: > Many character sets from 8-bit microcomputers had ?inverse? or ?reverse > video? characters that were treated as distinct from their ?normal video? > counterparts. When we proposed encoding these, as atomic characters or > using variation sequences or by any other means, the UTC shot down the idea > completely. > Early computing systems conflated layers of processing where modern ones separate them. For example, a quarter of ASCII and of EBCDIC, respectively, was used for control codes which we inherited but which are now mostly unused because we use lower-level mechanisms instead that carry text purely as payload. I think the plain text / rich text distinction has been quite successful. I don't actually personally like the math-styled characters because they seem specific to a particular math tradition. When I was in high school, the vector-math teacher gave us a choice between the old style of using Fraktur/S?tterlin for vector variables vs. the new style of regular letters with an arrow on top. "Vector" markup with different style choices seems better for this kind of thing. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Dec 15 13:10:01 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 15 Dec 2020 19:10:01 +0000 Subject: Normalization Generics (NFx, NFKx, NFxy) In-Reply-To: References: <20201214005156.6125d895@JRWUBU2> Message-ID: <20201215191001.356f5795@JRWUBU2> On Sun, 13 Dec 2020 20:08:08 -0800 Zach Lym via Unicode wrote: > > What does that quoted statement mean? I'm having a hard job working > > out what the meaning of full case folding is. I'm not having any > > doubts about the meaning of toCasefold(NFD(X)), so there is no issue > > for 'canonical caseless matching'. > > The "case folding is closed under canonical normalization" or the > other part? That part. > Closed as in closure: > https://en.wikipedia.org/wiki/Closure_(mathematics) That only tells me what it means for a _set_ to be closed under an operation. What does it mean for a _function_ (or similar) to be closed under an operation? If I must use the definition for a set, then I can only conclude that for one operation to be closed under another operation, the result should be independent of the order in which they are applied. But for X = : NFD(toCasefold(X)) = toCasefold(NFD(X)) = NFC(toCasefold(X)) = toCasefold(NFC(X)) = So either "case folding is closed under canonical normalization" means something else, or it is simply not true. > Refer to page 240 of the standard, Chaper 5 "Implementation > Guidelines" Section 18 "Case Mappings": > > http://www.unicode.org/versions/latest/ch05.pdf Why? The trick is not to be deflecting by the opening paragraph in TUS Section 3.13, but to read on to find R4. Richard. From indolering at gmail.com Tue Dec 15 14:04:00 2020 From: indolering at gmail.com (Zach Lym) Date: Tue, 15 Dec 2020 12:04:00 -0800 Subject: Normalization Generics (NFx, NFKx, NFxy) In-Reply-To: <20201215191001.356f5795@JRWUBU2> References: <20201214005156.6125d895@JRWUBU2> <20201215191001.356f5795@JRWUBU2> Message-ID: Okay, so points for pedantry ... but do you have any input on adding normalization generics to Unicode pseudocode? Or would you like to split this discussion out into a new topic? On Tue, Dec 15, 2020 at 11:21 AM Richard Wordingham via Unicode wrote: > > On Sun, 13 Dec 2020 20:08:08 -0800 > Zach Lym via Unicode wrote: > > > > What does that quoted statement mean? I'm having a hard job working > > > out what the meaning of full case folding is. I'm not having any > > > doubts about the meaning of toCasefold(NFD(X)), so there is no issue > > > for 'canonical caseless matching'. > > > > The "case folding is closed under canonical normalization" or the > > other part? > > That part. > > > Closed as in closure: > > https://en.wikipedia.org/wiki/Closure_(mathematics) > > That only tells me what it means for a _set_ to be closed under an > operation. What does it mean for a _function_ (or similar) to be > closed under an operation? > > If I must use the definition for a set, then I can only conclude that > for one operation to be closed under another operation, the result > should be independent of the order in which they are applied. > > But for X = COMBINING GREEK PERISPOMENI>: > > NFD(toCasefold(X)) = SMALL LETTER IOTA, U+0342> > > toCasefold(NFD(X)) = > > NFC(toCasefold(X)) = PERISPOMENI> > > toCasefold(NFC(X)) = U+03B9> > > So either "case folding is closed under canonical normalization" means > something else, or it is simply not true. > > > Refer to page 240 of the standard, Chaper 5 "Implementation > > Guidelines" Section 18 "Case Mappings": > > > > http://www.unicode.org/versions/latest/ch05.pdf > > Why? > > The trick is not to be deflecting by the opening paragraph in TUS > Section 3.13, but to read on to find R4. > > Richard. From abrahamgross at disroot.org Tue Dec 15 14:52:35 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Tue, 15 Dec 2020 20:52:35 +0000 (UTC) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> Message-ID: <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> Unicode refused to encode arabic letter variants (not counting compatibility chars), which are taught in school and adults use it, and its how arabic is written, so ur argument here doesn't hold water. Dec 12, 2020 8:26:10 PM Martin J. D?rst via Unicode : > Children learn to write with upper case and lower case letters in school, and most people continue to use both as adults. (There are exceptions of course, some people write only with lower case, and some only with upper case.) > From indolering at gmail.com Tue Dec 15 16:28:55 2020 From: indolering at gmail.com (Zach Lym) Date: Tue, 15 Dec 2020 14:28:55 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: > If you or someone else chooses to make a proposal, my own recommendation would be this: > > - Assign a new character U+E0002 FORMAT TAG > - The syntax follows the specification for tagging (chapter 23.9) > - U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting. How would one implement blink? I would consider that top priority, as it was explicitly designed for styling plaintext. From richard.wordingham at ntlworld.com Tue Dec 15 16:32:16 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 15 Dec 2020 22:32:16 +0000 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <000301d6d00e$4d244330$e76cc990$@ewellic.org> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> Message-ID: <20201215223216.339e3a0a@JRWUBU2> On Fri, 11 Dec 2020 15:38:07 -0700 Doug Ewell via Unicode wrote: > Christian Kleineidam wrote: > > > "Evidence suggesting that ???? ???????????????? contributed the H2 > > ???? haplotype to ???? ???????" > > "Evidence suggesting that Homo neanderthalensis contributed the H2 > MAPT haplotype to Homo sapiens" > > This title is completely meaningful in plain text. The convention to > style the names of species and haplotypes in italics is just that, a > styling convention. Yet there are cases where meaning is completely lost. There was a Latin script spelling for Pali and Sanskrit that used italicised guttural letters for palatals, and italicised letters where nowadays we normally have a dot below. I think this scheme was introduced by Max Mueller. Thus, a Sanskrit sequence meaning 'and this' is written not 'tacca' but 'ta??a'. (I naturally misread the latter as though it were 'takka'.) That naturally raises the question of how such italic letters are to be italicised! I've also seen phonetic respelling of English in the Thai script where italicised consonants are used for English consonants for which Thai has no equivalent. When documenting program, there is a massive gain in readability when the lower case names of programs and variables are written out in a typewriter-style font like Courier. (Some monospace fonts lack the distinctiveness.) Richard. From kent.b.karlsson at bahnhof.se Tue Dec 15 17:07:05 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 16 Dec 2020 00:07:05 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: (Below) > 14 dec. 2020 kl. 18:02 skrev S?awomir Osipiuk via Unicode : > If you or someone else chooses to make a proposal, my own recommendation would be this: > > - Assign a new character U+E0002 FORMAT TAG > - The syntax follows the specification for tagging (chapter 23.9) > - U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting. > - U+E0002 U+E007F CANCEL TAG to cancel all formatting > - Any use of U+E0002 overrides previous formatting (i.e. a "bold" tag alone cancels a previous "italic" tag), so format nesting must be done by combining all desired formats into a single tag. > - This method should only be used in cases where formatting is required without a higher-level protocol > - This method should not be used in instances where loss of formatting would greatly alter the meaning of the text or render it incomprehensible. > - Strikethrough and super/subscript are deliberately omitted for the above reason. Now, where did I see something very much like this??? ? ? Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very close (especially the ?invisible by default? (?default ignorable?) IF parsed correctly). And? ECMA-48 is already a standard. And? ECMA-48 is already successful, and still used every day by very many people. Though it is primarily used in terminal emulators. (Nit: ECMA-48 does have strikethrough? And more. As does HTML/CSS, and when doing ?copy as plain text?, also that formatting disappear.) Your U+E0002 FORMAT TAG: ECMA-48 CSI ? m Your U+E0062 (bold): ECMA-48 CSI 1m Your U+E0065 (emphatic): don?t know what you mean by that Your U+E0069 (italic): ECMA-48 CSI 3m Your U+E0079 (underlined): ECMA-48 CSI 4m Your U+E007F CANCEL TAG: ECMA-48 CSI 0m It is not entirely inconceivable to map all the (otherwise) printable characters used by such control sequences to TAG characters, thus making the ?default ignorable? part of this a bit easier. Extra nit: Some markdowns (however did that name stick?) allow for strikethrough as well, as -stricken-. Though a bit intuitive, it way too often has an unexpected effect where no strikethrough was intended (try doing ?ls -l? in your Linux terminal, and paste the result into some place that have that kind of markdown). ?Math Italic? is a hack for MathML. If done right, MathML would not have needed them either. ?Math Italic? for emphasis in running text (not MathML) only ?works? (sort of, and partially) for English, nearly no other language. Please don?t use the ?Math italic/bold/etc? outside of MathML. /Kent Karlsson PS First edition of ECMA-48 came in 1976. About 44 years ago. > Advantages: > - Only a single new character needs definition. > - Uses an existing framework (tags) > - Formatting is ignorable, implementation is optional > - A viable method to preserve 95%+ of typical semantic formatting in plain-text > - IMO a stronger case to have this than either language tags or annotations (argument is to accurately preserve the lot of existing documents that include rudimentary formatting, rather than just invent new features). > > Disadvantages: > https://xkcd.com/927/ > > S?awomir Osipiuk > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From billposer2 at gmail.com Tue Dec 15 17:10:07 2020 From: billposer2 at gmail.com (Bill Poser) Date: Tue, 15 Dec 2020 15:10:07 -0800 Subject: Is there a difference between converting a string of ASCII digits to an integer versus a string of non-ASCII digits to an integer? In-Reply-To: References: Message-ID: What do you mean by "non-ASCII digits"? Things like superscript and subscript versions of the usual Western "Arabic' numbers? Or are you talking about numbers like those of Chinese, roman numerals, Tamil, etc.? In the case of the former, once you map the digits to their standard forms, the algorithm is the same. In the case of the latter, no, in many cases very different algorithms are required. On Mon, Dec 14, 2020 at 12:28 PM Roger L Costello via Unicode < unicode at unicode.org> wrote: > Hi Folks, > > As I understand it, when the C programming language was created it just > used ASCII. Programs written in C used ASCII digits. > > Nowadays C supports Unicode and Unicode contains more digits than just the > ASCII digits. (I think) modern C programs can express numbers using strings > of non-ASCII digits. > > Questions: > > 1. Is the algorithm for converting a string that contains non-ASCII digits > different than the algorithm for converting a string containing ASCII > digits? > > 2. The C function atoi() converts a string of digits to a number. I have > seen the source code for atoi(). The source code that I saw was dated > around the year 2000. Can you point me to the modern source code for atoi()? > > /Roger > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue Dec 15 17:26:31 2020 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 15 Dec 2020 18:26:31 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org> An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Tue Dec 15 17:45:11 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 15 Dec 2020 15:45:11 -0800 Subject: Is there a difference between converting a string of ASCII digits to an integer versus a string of non-ASCII digits to an integer? In-Reply-To: References: Message-ID: I suspect that Roger is just looking at decimal digits (property gc=Nd ). I believe that they can all be parsed like strings of ASCII digits (and you can call ICU or other libraries to get at the digit values and other properties). I suggest you double-check about the RTL digits (N'Ko & Adlam); please take a look at the relevant Unicode book chapters. What's more interesting is handling the grouping and decimal separators which differ by both language and region. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From sosipiuk at gmail.com Tue Dec 15 18:41:09 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Tue, 15 Dec 2020 19:41:09 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org> Message-ID: On Tue, Dec 15, 2020 at 6:26 PM Mark E. Shoulson wrote: > > But how is that different from anything being proposed? If this idea were accepted as part of Unicode, then it *would* be a feature of Unicode, just as whatever is being proposed would be if it were accepted. How does it matter if italicizing something is marked by some new U+DEADBF characters or by existing tag characters? - Rather than a completely new method, it's "just" an extension of an existing feature. (Tag syntax, scope, and default ignorability are already defined in the Unicode standard) - The syntax "naturally" discourages complicated format nesting. Unicode may formally restrict format combos. > If you insist that Unicode-compliant text readers must show italics or bold when marked with such-and-such characters, Absolutely not! > Conversely, if you're okay with pseudo-markup, this should sound fine to you. Why doesn't it? "Not my first choice" is what I said. It's not bad, but its similarity to HTML is not a good thing in my eyes, because it raises the question "I can do this in HTML, why can't I do it in UnicodeML??" and push for more and more HTML features to be included. It encourages feature creep, which I said I'm against. Familiarity is not always a good thing. > (how would this markup interact with other markup, like HTML, I wonder?) (From the Unicode Standard, page 916, with [] additions by me; notice how little the text changes) "The rules for Unicode conformance for the tag characters are exactly the same as those for any other Unicode characters. A conformant process is not required to interpret the tag characters. If it does interpret them, it should interpret them according to the standard? that is, as spelled-out tags. However, there is no requirement to provide a particular interpretation of the text because it is tagged with a given language [or formatting]. If an application does not interpret tag characters, it should leave their values undisturbed and do whatever it does with any other uninterpreted characters. [...] "Implementations of Unicode that already make use of out-of-band mechanisms for language [or format] tagging or ?heavy-weight? in-band mechanisms such as XML or HTML will continue to do exactly what they are doing and will ignore the tag characters completely. They may even prohibit their use to prevent conflicts with the equivalent markup." S?awomir Osipiuk From richard.wordingham at ntlworld.com Tue Dec 15 18:42:29 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 16 Dec 2020 00:42:29 +0000 Subject: Is there a difference between converting a string of ASCII digits to an integer versus a string of non-ASCII digits to an integer? In-Reply-To: References: Message-ID: <20201216004229.51af1612@JRWUBU2> On Tue, 15 Dec 2020 15:45:11 -0800 Markus Scherer via Unicode wrote: > I suspect that Roger is just looking at decimal digits (property gc=Nd > > ). > I believe that they can all be parsed like strings of ASCII digits > (and you can call ICU or other libraries to get at the digit values > and other properties). > I suggest you double-check about the RTL digits (N'Ko & Adlam); > please take a look at the relevant Unicode book chapters. It looks as though the N'ko section documents the significance by accident! I thought a policy was going to be documented (2012 or slightly later) that decimal digits are stored most significant digit first, but that doesn't seem to have happened. Richard. From sosipiuk at gmail.com Tue Dec 15 19:14:42 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Tue, 15 Dec 2020 20:14:42 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: On Tue, Dec 15, 2020 at 6:07 PM Kent Karlsson wrote: > Now, where did I see something very much like this??? > Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very close (especially the ?invisible by default? (?default ignorable?) IF parsed correctly). ECMA-48 aka ISO 6429 was on my mind the moment I read the OP. I didn't mention it because it's a bit outdated (even if I do have a fondness for it) and if you're using such a thing, why not a more modern HTML subset, or BBCode, or any number of other options in use or from the list the OP gave? There are, after all, so many to choose from. And if none of those satisfy, you can always make your own! But that "if parsed correctly" is quite the nit, isn't it? > It is not entirely inconceivable to map all the (otherwise) printable characters used by such control sequences to TAG characters, thus making the ?default ignorable? part of this a bit easier. And this is just the BabelPad solution but applied to a different protocol. Replacing regular markup by corresponding characters from the tag block to gain ignorable-ness may seem like a cool idea at first, but it's just spinning yet another markup. (With no offense intended to BabelPad's author; it's not a bad idea except that it starts at the bottom of the mountain just like any other.) Tag syntax is already part of Unicode. I'd rather use it than import something wholesale from another protocol. Finally, what I'm envisioning ? and I'm not sure how closely this matches Christian Kleineidam's intention (where did he go, anyway?) ? is not Yet Another Presentation Layer or a Shiny New Toy for people to use in their tweets, but more of a sombre hint that "in the original source document, this text had an alternative presentation; indicate this to the user in an appropriate way, if applicable". It's meant for preservation, not decoration. That's why I hear the "spirit of Unicode". S?awomir Osipiuk From copypaste at kittens.ph Tue Dec 15 19:58:57 2020 From: copypaste at kittens.ph (Fredrick Brennan) Date: Tue, 15 Dec 2020 20:58:57 -0500 Subject: =?UTF-8?B?Mcui4bWXLCAy4oG/4bWILCAzyrPhtYgsIDThtZfKsCDigKYgOeG1l8qw?= Message-ID: <9137826.KFeHLySHN7@laptop> Hello! With Unicode superscript lowercase letters, dates with superscript ordinal indicators in English can be written in plaintext, e.g.: 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on. The only problem I've encountered is in font fallback; fonts are more likely to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA. So, ? often appears in a different style in the word 2?? for example. This can be somewhat avoided by using a font which supports all the letters, such as Gentium Plus, EB Garamond, etc. However, I have a feeling that this use is an abuse of the standard, but that brings up an interesting comparison with the ordinal indicators for Spanish, Portuguese (& other languages?), the masculine ? and the feminine ?. If anyone has time to answer, why is one an abuse and the other not, if indeed 1?? is an abuse as I think? If it's not an abuse, then that could perhaps be an argument for the necessity of encoding ????????? ???????? ?????? s???? ?, as ? is one of the few letters without a combining counterpart in Cyrillic Extended-A or Extended-B. (Of course, no breaking spaces would need to be used to write Russian 2-? if this character were to be encoded, e.g. as U+32 U+A0 U+XXXX, while no-break spaces aren't needed for Latin. Best, Fred Brennan -------------- next part -------------- An HTML attachment was scrubbed... URL: From copypaste at kittens.ph Tue Dec 15 20:04:55 2020 From: copypaste at kittens.ph (Fredrick Brennan) Date: Tue, 15 Dec 2020 21:04:55 -0500 Subject: =?UTF-8?B?Mcui4bWXLCAy4oG/4bWILCAzyrPhtYgsIDThtZfKsCDigKYgOeG1l8qw?= In-Reply-To: <9137826.KFeHLySHN7@laptop> References: <9137826.KFeHLySHN7@laptop> Message-ID: <2171140.htiGsxgcq4@laptop> Oh dear, my email-client was erroneously configured to use SHIFT_JIS, which mangled my message. Corrections... On Tuesday, December 15, 2020 8:58:57 PM EST I wrote: > 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on. 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on. > to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA. > So, ? often appears in a different style in the word 2?? for example. to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA. So, ? often appears in a different style in the word 2?? for example. > the masculine ? and the feminine ?. the masculine ? and the feminine ?. > if indeed 1?? is an abuse as I think? if indeed 1?? is an abuse as I think? > necessity of encoding ????????? ???????? ?????? s???? ? necessity of encoding COMBINING CYRILLIC LETTER SHORT I Very ironic :) Best, Fred Brennan From mark at kli.org Tue Dec 15 20:36:07 2020 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 15 Dec 2020 21:36:07 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org> Message-ID: <0e31ec9e-490f-191a-c912-5a56d9abb602@kli.org> An HTML attachment was scrubbed... URL: From indolering at gmail.com Tue Dec 15 21:18:41 2020 From: indolering at gmail.com (Zach Lym) Date: Tue, 15 Dec 2020 19:18:41 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: > Finally, what I'm envisioning ? and I'm not sure how closely this > matches Christian Kleineidam's intention (where did he go, anyway?) ? > is not Yet Another Presentation Layer or a Shiny New Toy for people to > use in their tweets, but more of a sombre hint that "in the original > source document, this text had an alternative presentation; indicate > this to the user in an appropriate way, if applicable". It's meant for > preservation, not decoration. That's why I hear the "spirit of > Unicode". For those of us that can recall the exuberance of the XHTML movement, , and friends were all deemed to be insufficiently semantic and slated to be replaced by and . Of course, this was a distinction without a difference and now we just have extra tags that are more verbose and less literal. But that begs the question: if the authors of a rich text standard can't agree on what counts as semantic, how would Unicode decide? What about , , or as I previously suggested ? was added to HTML because it was the only styling that could be displayed in plaintext console environments. So if doesn't make your cutoff, then I guess the bar is personal taste? The line between semantics and styling is inherently fuzzy, but every attempt at encoding similarly fuzzy semantics within Unicode is something humanity must deal with for the rest of all time. Take the newline vs paragraph separators, a noble attempt at trying to encode what essentially amounts to the plaintext/typewriter hack of using \n\n to insert whitespace after a paragraph. No-one uses either of them, not even Markdown (which does use and ) because most plain text doesn't make the distinction, users can't input it via a keyboard, and no one else supports it. Yet myself and a colleague had to spend waaaay too much of our short lives figuring out what to support as breaking separators in WASI text streams. What puzzles me is why this discussion wasn't moderated to the null bin. This *exact* question is answered in the FAQ and is regularly shot down. -Zach Lym From prosfilaes at gmail.com Tue Dec 15 22:19:46 2020 From: prosfilaes at gmail.com (David Starner) Date: Tue, 15 Dec 2020 20:19:46 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org> Message-ID: On Tue, Dec 15, 2020 at 4:47 PM S?awomir Osipiuk via Unicode wrote: > "Implementations of Unicode that already make use of out-of-band > mechanisms for language [or format] tagging or ?heavy-weight? in-band > mechanisms such as XML or HTML will continue to do exactly what they > are doing and will ignore the tag characters completely. They may even > prohibit their use to prevent conflicts with the equivalent markup." So every single thing that interfaces with HTML now has to handle Unicode italics on any plain text input, or silently dump them into the stream, and the web browser may have to handle them or not. > It's meant for preservation, not decoration. I've done preservation, and don't see how this helps at all. You can go with various preservation file formats, like TEI Lite, or various more directly readable file formats like HTML or PDF. None of those has any problem handling italics. Plain text willfully drops many details, so probably isn't a realistic choice for preservation. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From asmusf at ix.netcom.com Tue Dec 15 23:49:48 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 15 Dec 2020 21:49:48 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org> Message-ID: <416d509b-b97c-5153-ec4c-aae451570919@ix.netcom.com> An HTML attachment was scrubbed... URL: From john.w.kennedy at gmail.com Wed Dec 16 07:13:07 2020 From: john.w.kennedy at gmail.com (John W Kennedy) Date: Wed, 16 Dec 2020 08:13:07 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: Message-ID: -- John W. Kennedy "Compact is becoming contract, Man only earns and pays." -- Charles Williams. "Bors to Elayne: On the King's Coins" > On Dec 15, 2020, at 10:25 PM, Zach Lym via Unicode wrote: > > ? >> >> Finally, what I'm envisioning ? and I'm not sure how closely this >> matches Christian Kleineidam's intention (where did he go, anyway?) ? >> is not Yet Another Presentation Layer or a Shiny New Toy for people to >> use in their tweets, but more of a sombre hint that "in the original >> source document, this text had an alternative presentation; indicate >> this to the user in an appropriate way, if applicable". It's meant for >> preservation, not decoration. That's why I hear the "spirit of >> Unicode". > > For those of us that can recall the exuberance of the XHTML movement, > , and friends were all deemed to be insufficiently semantic and > slated to be replaced by and . Of course, this was a > distinction without a difference and now we just have extra tags that > are more verbose and less literal. and go back to HTML+ in 1993, where they replaced and from the original HTML, which had inherited them from IBM?s original GML (no S) of the 1970s. From costello at mitre.org Wed Dec 16 07:47:58 2020 From: costello at mitre.org (Roger L Costello) Date: Wed, 16 Dec 2020 13:47:58 +0000 Subject: =?utf-8?B?VW5pY29kZSBpcyB1bml2ZXJzYWwsIHNvIGhvdyBjb21lIHRoYXQgdW5pdmVy?= =?utf-8?B?c2FsaXR5IGRvZXNu4oCZdCBhcHBseSB0byBkaWdpdHM/?= Message-ID: Hi Folks, Unicode make it possible to write things in different languages. For example, rather than this XML: 42 a Bengali-speaking person can write this: 42 Or, in a programming language, rather than this assignment statement: Number_Students = 42 a Bengali-speaking person can write this: ??????_????? = 42 That?s awesome. But, but, but, ? how come that universality doesn?t extend to digits? How come we can only use these digits: 0 (hex 30), 1 (hex 31), ?, 9 (hex 39)? Why, for example, can?t a Bengali-speaking person use the Bengali digits: Bengali digit 0 (U+09E6), Bengali digit 1 (U+09E7), ?, Bengali digit 9 (U+09EF)? Why, for example, can?t a Bengali-speaking person create XML such as this: ?? or write a program assignment statement like this: ??????_????? = ?? Let me explain why I assert that the Bengali-speaking person ?cannot? do that. Numbers in an XML document or in a program are just strings and, to perform arithmetic operations on them, those string numbers must be converted to actual numbers. I looked at the source code for the C function (strtol) that converts strings to numbers and here is the key to how it converts a character digit to a number digit: digit_number = digit_character - '0? Yikes! That generates a number digit by treating the character digit as a number and subtracting the number corresponding to the character ?0?. For example, if the character digit is ?4? (hex 34) then when we subtract ?0? (hex 30) we get the number 4. Perfect! But ??? only if we allow European digits (0, 1, ?, 9). Clearly, if we were to subtract ?0? (hex 30) from the Bengali digit 4 we do not get the number 4. Thus I conclude: * When expressing numbers, the only digits that can be used are the European digits * Unicode is universal, but that universality does not apply to digits or numbers Obviously I am not understanding something correctly. Please help me to understand. /Roger -------------- next part -------------- An HTML attachment was scrubbed... URL: From marius.spix at web.de Wed Dec 16 07:56:58 2020 From: marius.spix at web.de (Marius Spix) Date: Wed, 16 Dec 2020 14:56:58 +0100 Subject: =?UTF-8?Q?Aw=3A_Re=3A_1=CB=A2=E1=B5=97=2C_2=E2=81=BF=E1=B5=88?= =?UTF-8?Q?=2C_3=CA=B3=E1=B5=88=2C_4?= =?UTF-8?Q?=E1=B5=97=CA=B0_=E2=80=A6_9=E1=B5=97=CA=B0?= In-Reply-To: <2171140.htiGsxgcq4@laptop> References: <9137826.KFeHLySHN7@laptop> <2171140.htiGsxgcq4@laptop> Message-ID: This is similiar to the mathematical italic letters, where ? and ? have very special appearances depending on the used font, ?(?) (function of x) is a very common character sequence in mathematical context. For some reason ? (U+1D4F5) and ? (U+2113) are different characters, because the latter is also used for the unit litre. ? ? Gesendet:?Mittwoch, 16. Dezember 2020 um 03:04 Uhr Von:?"Fredrick Brennan via Unicode" An:?"Unicode Discussion" Betreff:?Re: 1??, 2??, 3??, 4?? ? 9?? Oh dear, my email-client was erroneously configured to use SHIFT_JIS, which mangled my message. Corrections... On Tuesday, December 15, 2020 8:58:57 PM EST I wrote: > 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on. 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on. > to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA. > So, ? often appears in a different style in the word 2?? for example. to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA. So, ? often appears in a different style in the word 2?? for example. > the masculine ? and the feminine ?. the masculine ? and the feminine ?. > if indeed 1?? is an abuse as I think? if indeed 1?? is an abuse as I think? > necessity of encoding ????????? ???????? ?????? s???? ? necessity of encoding COMBINING CYRILLIC LETTER SHORT I Very ironic :) Best, Fred Brennan ? From harjitmoe at outlook.com Wed Dec 16 08:50:56 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Wed, 16 Dec 2020 14:50:56 +0000 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: > For those of us that can recall the exuberance of the XHTML movement, > , and friends were all deemed to be insufficiently semantic and > slated to be replaced by and . Of course, this was a > distinction without a difference and now we just have extra tags that > are more verbose and less literal. Not strictly speaking?although and are back in vogue, is now only supposed to be used for italics which set text apart in some other fashion as opposed to emphasising it (which should still be done with ).? The distinction may appear ?without a difference? for graphically displaying text in visual clients, but they can represent considerably different tone changes when reading it out (a relevant consideration if you are writing, say, an aural client for the visually impaired), hence using these properly is /theoretically/ more accessible, though I do not know to what extent that is true in practice since there's bound to be a lot of deployed legacy, WYSIWYG or generated-from-Markdown-etc HTML which doesn't make this distinction, which might preclude relying on it. ?Har -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Dec 16 09:40:15 2020 From: doug at ewellic.org (Doug Ewell) Date: Wed, 16 Dec 2020 08:40:15 -0700 Subject: =?utf-8?Q?RE:_Unicode_is_universal=2C_so_how?= =?utf-8?Q?_come_that_universality_doesn=E2=80=99t_?= =?utf-8?Q?apply_to_digits=3F?= In-Reply-To: References: Message-ID: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org> What I don't understand here is why this is being framed implicitly as a Unicode problem, or an XML problem, or a general law of nature ("why can?t a Bengali-speaking person use the Bengali digits"), instead of an inherent limitation of that particular library function used for that particular language. One could easily extend strtol() to accept a string of characters with a General_Category of "Nd", and use the Numeric_Value property of each character to get its numeric value instead of subtracting 48 (ASCII '0'). Of course, in order to do that, the Unicode properties General_Category and Numeric_Value must be available to the conversion function. The C language and its standard libraries are optimized for speed and size, and are still chosen to this day when speed and size are at a premium. Operating only on ASCII '0' through '9' and subtracting ASCII '0' to get the numeric value is much faster and lighter-weight than table lookup. ICU probably provides a method to do this in C. A good follow-up question for me is why the heavier-weight C# and .NET Framework (Core, Standard) also don't support non-ASCII digits in the Convert.ToInt32() method, even when the string of digits is all from the same script (unlike your mixed Bengali/Oriya example), and even when the appropriate locale is specified as a parameter. C# compiles to intermediate code and runs in an interpreter, and has huge libraries available to it, including all of the Unicode properties, so the "speed and size" constraints don't apply as much. But this is still a characteristic of the code libraries, not a Unicode problem. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From wjgo_10009 at btinternet.com Wed Dec 16 10:02:00 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 16 Dec 2020 16:02:00 +0000 (GMT) Subject: =?UTF-8?Q?Re:_Unicode_is_universal,_so_how_come_th?= =?UTF-8?Q?at_universality_doesn=E2=80=99t_apply_to_digits=3F?= In-Reply-To: References: Message-ID: Hi Well, is the way to make progress that Unicode Inc. could make available a pseudo-code algorithm that can be converted to various programming languages that is such that the way that a digit is derived from the text characters is an algorithm with a structure of the form if (digit_character >= 'A') AND (digit_character <= 'B') then digit_number := digit_character - 'C' elsif (digit_character >= 'D') AND (digit_character <= 'E') then digit_number := digit_character - 'F' elsif ... . . . elsif ... end; where A, B, C, D etc in the above are here each a placeholder for a Unicode character for the start and end of a range of digit characters as appropriate? Would that do it? Assuming that compiler manufacturers used the algorithm, converted as appropriate! :-) The algorithm written once and then updated as needed by Unicode Inc., then applicable throughout many programming languages. Best regards, William Overington Wednesday 16 December 2020 ------ Original Message ------ From: "Roger L Costello via Unicode" To: "unicode at unicode.org" Sent: Wednesday, 2020 Dec 16 At 13:47 Subject: Unicode is universal, so how come that universality doesn?t apply to digits? Hi Folks, Unicode make it possible to write things in different languages. For example, rather than this XML: 42 a Bengali-speaking person can write this: 42 Or, in a programming language, rather than this assignment statement: Number_Students = 42 a Bengali-speaking person can write this: ??????_????? = 42 That?s awesome. But, but, but, ? how come that universality doesn?t extend to digits? How come we can only use these digits: 0 (hex 30), 1 (hex 31), ?, 9 (hex 39)? Why, for example, can?t a Bengali-speaking person use the Bengali digits: Bengali digit 0 (U+09E6), Bengali digit 1 (U+09E7), ?, Bengali digit 9 (U+09EF)? Why, for example, can?t a Bengali-speaking person create XML such as this: ?? or write a program assignment statement like this: ??????_????? = ?? Let me explain why I assert that the Bengali-speaking person ?cannot? do that. Numbers in an XML document or in a program are just strings and, to perform arithmetic operations on them, those string numbers must be converted to actual numbers. I looked at the source code for the C function (strtol) that converts strings to numbers and here is the key to how it converts a character digit to a number digit: digit_number = digit_character - '0? Yikes! That generates a number digit by treating the character digit as a number and subtracting the number corresponding to the character ?0?. For example, if the character digit is ?4? (hex 34) then when we subtract ?0? (hex 30) we get the number 4. Perfect! But ??? only if we allow European digits (0, 1, ?, 9). Clearly, if we were to subtract ?0? (hex 30) from the Bengali digit 4 we do not get the number 4. Thus I conclude: * When expressing numbers, the only digits that can be used are the European digits * Unicode is universal, but that universality does not apply to digits or numbers Obviously I am not understanding something correctly. Please help me to understand. /Roger -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Wed Dec 16 11:34:55 2020 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Wed, 16 Dec 2020 18:34:55 +0100 Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?= =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?= In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Dec 16 12:05:52 2020 From: doug at ewellic.org (Doug Ewell) Date: Wed, 16 Dec 2020 11:05:52 -0700 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> Message-ID: <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> abrahamgross wrote: >> Children learn to write with upper case and lower case letters in >> school, and most people continue to use both as adults. (There are >> exceptions of course, some people write only with lower case, and >> some only with upper case.) > > Unicode refused to encode arabic letter variants (not counting > compatibility chars), which are taught in school and adults use it, > and its how arabic is written, so ur argument here doesn't hold water. I'm not sure what to make of that sentence. That's like saying "Unicode refused to encode the capital letter A (not counting U+0041)." The compatibility characters are exactly how one is supposed to represent Arabic letter forms outside of their normal context, as described here. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From abrahamgross at disroot.org Wed Dec 16 12:11:52 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 16 Dec 2020 18:11:52 +0000 (UTC) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: Message-ID: Can't unicode just make an edit and say that the mathematical italic letters can be used for regular english too? (the character names can stay as MATHEMATICAL ITALIC etc, or aliases can be added) From frederic.grosshans at gmail.com Wed Dec 16 12:50:40 2020 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Wed, 16 Dec 2020 19:50:40 +0100 Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?= =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?= In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Wed Dec 16 12:55:37 2020 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Wed, 16 Dec 2020 19:55:37 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: Message-ID: <01b67baa-d0b4-b321-4c4a-4d6c6d16fbff@gmail.com> An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Wed Dec 16 12:58:51 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Wed, 16 Dec 2020 18:58:51 +0000 (UTC) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <01b67baa-d0b4-b321-4c4a-4d6c6d16fbff@gmail.com> References: <01b67baa-d0b4-b321-4c4a-4d6c6d16fbff@gmail.com> Message-ID: Whoop, That makes sense Dec 16, 2020 1:56:29 PM Fr?d?ric Grosshans via Unicode : > And then, speaker of German languages will ask the encoding of italic ?, Icelandic speakers, ? and ?, French speakers, ? and ?, etc. Because special casing English is quite the opposite of the purpose of Unicode... > > Fr?d?ric > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Dec 16 13:23:09 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 16 Dec 2020 19:23:09 +0000 Subject: Unicode is universal, so how come that universality =?UTF-8?B?ZG9lc27igJl0?= apply to digits? In-Reply-To: References: Message-ID: <20201216192309.6e9fea34@JRWUBU2> On Wed, 16 Dec 2020 16:02:00 +0000 (GMT) William_J_G Overington via Unicode wrote: > Hi > > Well, is the way to make progress that Unicode Inc. could make > available a pseudo-code algorithm that can be converted to various > programming languages that is such that the way that a digit is > derived from the text characters is an algorithm with a structure of > the form > > if (digit_character >= 'A') AND (digit_character <= 'B') then > digit_number := digit_character - 'C' > > elsif (digit_character >= 'D') AND (digit_character <= 'E') then > digit_number := digit_character - 'F' > > elsif ... It looks to me as though some versions of wcstol() already accept a sequence of decimal digits. C-11 allows such behaviour. The simple algorithm sketched here won't work for 8-bit char - ISCII Indian digits and TIS-620 Thai digits overlap but do not coincide. Thus for strtol(), you would need to include the locale. As Fr?d?ric Grosshans has noticed, there is also the issue of digit sequences spoofing, besides variations of the letter 'O' being harmful. Not every call of strtol() parsing a digit string actually checks that the offered string is in the form of a number. Richard. From richard.wordingham at ntlworld.com Wed Dec 16 13:57:46 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 16 Dec 2020 19:57:46 +0000 Subject: Unicode is universal, so how come that universality =?UTF-8?B?ZG9lc27igJl0?= apply to digits? In-Reply-To: References: Message-ID: <20201216195746.37c2237b@JRWUBU2> On Wed, 16 Dec 2020 18:34:55 +0100 Fr?d?ric Grosshans via Unicode wrote: > It?s quite easy to make a lbrary which parses UniccodeData.txt > (version 13.0 here) and extract the digit ranges of the various > scripts and convert the various strings into number for the 50 > scripts listed in table 22-3 of the standard plus the western digits > (Unicode 13.0 pdf here), it should be reasonably furureproof, in the > sense that parsing future unicode datafile should add stipts as they > are encoded. However, do not forget to check the exceptions in the > text around this table in in the relevant script pages: in Unicode > 13.0, it concerns Arabic, which has to sets of digits, Myanmar (3 > sets), and Tai Tham (2 sets). Or just scan UnicodeData.txt for decimal digits with the value 0. Richard. From marius.spix at web.de Wed Dec 16 14:09:23 2020 From: marius.spix at web.de (Marius Spix) Date: Wed, 16 Dec 2020 21:09:23 +0100 Subject: Aw: RE: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> Message-ID: An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Wed Dec 16 15:32:10 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 16 Dec 2020 13:32:10 -0800 Subject: Aw: RE: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> Message-ID: An HTML attachment was scrubbed... URL: From billposer2 at gmail.com Wed Dec 16 15:32:48 2020 From: billposer2 at gmail.com (Bill Poser) Date: Wed, 16 Dec 2020 13:32:48 -0800 Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?= =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?= In-Reply-To: <20201216195746.37c2237b@JRWUBU2> References: <20201216195746.37c2237b@JRWUBU2> Message-ID: It seems to me that, in spite of the superficial similarity of the way numbers are written in many languages, this is NOT, in general, a matter of encoding conversion or even transliteration but rather one of translation and therefore not part of Unicode for the same reason that Unicode does not handle the translation of text from, say, Japanese to English. There is, actually, a library, which I have written, that handles conversions between Unicode strings and integers for most systems of writing numbers. (I have yet to update it to handle some of the more recently encoded systems.) It is a C library which also has a TCL binding: http://billposer.org/Software/libuninum.html It handles a number of systems that require algorithms rather different from that of atoi/strtol. Bill On Wed, Dec 16, 2020 at 12:04 PM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Wed, 16 Dec 2020 18:34:55 +0100 > Fr?d?ric Grosshans via Unicode wrote: > > > It?s quite easy to make a lbrary which parses UniccodeData.txt > > (version 13.0 here) and extract the digit ranges of the various > > scripts and convert the various strings into number for the 50 > > scripts listed in table 22-3 of the standard plus the western digits > > (Unicode 13.0 pdf here), it should be reasonably furureproof, in the > > sense that parsing future unicode datafile should add stipts as they > > are encoded. However, do not forget to check the exceptions in the > > text around this table in in the relevant script pages: in Unicode > > 13.0, it concerns Arabic, which has to sets of digits, Myanmar (3 > > sets), and Tai Tham (2 sets). > > Or just scan UnicodeData.txt for decimal digits with the value 0. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Wed Dec 16 18:46:33 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 17 Dec 2020 01:46:33 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: > 16 dec. 2020 kl. 02:14 skrev S?awomir Osipiuk via Unicode : > > On Tue, Dec 15, 2020 at 6:07 PM Kent Karlsson > wrote: >> Now, where did I see something very much like this??? >> Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very close (especially the ?invisible by default? (?default ignorable?) IF parsed correctly). > > ECMA-48 aka ISO 6429 was on my mind the moment I read the OP. I didn't > mention it because it's a bit outdated (even if I do have a fondness It is certainly not outdated. It?s a long time since the last update, but it is not outdated. It us used in EVERY terminal emulator (worthy of the name), granted to varying degrees and varying quality of implementation (but that is another matter). Italics, bold, underline and colouring are popular uses of the formatting part of ECMA-48 in terminals. One could imagine completely reinventing how terminals (i.e. terminal emulators nowadays) work. But that would face massive compatibility issues. My projection is that 1) terminal emulators will continue to be used indefinitely, and 2) they will continue to use ECMA-48 or an extension thereof (there are already some extensions that have been implemented). (That is opposed to Teletext, which still is very much used in practice, but I think that may change in five or ten yers time.) > for it) and if you're using such a thing, why not a more modern HTML > subset, or BBCode, or any number of other options in use or from the > list the OP gave? There are, after all, so many to choose from. And if Because: 1) They would be incompatible with how terminals work. 2) They cannot work for terminals since there is no clear distinction between what is ?markup? and what is not; the distinction today much relies on file type (via name suffix or other mechanism, like document setting or view mode, or ?guessing? from reading the beginning of the document). Those mechanisms do not exist in terminals. > none of those satisfy, you can always make your own! Again, if one were to invent something entirely new (not based on ECMA-48) in this area that still has the potential to be used in terminals, that would face massive compatibility issues with how terminals work today and are expected to work ?from the other side of the terminal? (i.e. what programs send to the terminal side). (Yes I know about termcap.) > But that "if parsed correctly" is quite the nit, isn't it? If every terminal (emulator) can handle it (granted, to varying degrees of quality), it does not seem too hard? > >> It is not entirely inconceivable to map all the (otherwise) printable characters used by such control sequences to TAG characters, thus making the ?default ignorable? part of this a bit easier. > > And this is just the BabelPad solution but applied to a different > protocol. Replacing regular markup by corresponding characters from > the tag block to gain ignorable-ness may seem like a cool idea at > first, but it's just spinning yet another markup. (With no offense In a sense, yes. But the idea to use TAG characters for this has popped up on this list multiple times. So if mapping ECMA-48-ish control sequences to use TAG characters makes ECMA-48-ish formatting control sequences more palatable, then ok. /Kent Karlsson From kent.b.karlsson at bahnhof.se Wed Dec 16 18:46:47 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 17 Dec 2020 01:46:47 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> Message-ID: <6F848713-997C-4D81-A326-F37848F3FEF8@bahnhof.se> > 16 dec. 2020 kl. 04:18 skrev Zach Lym : > But that begs the question: if the authors of a rich text standard > can't agree on what counts as semantic, how would Unicode decide? Eeh, file.html would be an HTML file, intending to interpret HTML tags as markup (unless in a view tags mode of display) for programs/apps that can interpret HTML markup, and regarding RTF or other non-HTML markup markup as plain text. file.rtf would be an RTF file, intending to interpret RTF markup (unless in a view markup mode of display) for programs/apps that can interpret RTF markup, and regarding HTML or other non-RTF markup markup as plain text. and so on. There are several other ways of indicating the file ?type?, but filename suffix is the most obvious method that is used. So what was the problem did you say? > What about , , or as I previously suggested > ? was added to HTML because it was the only > styling that could be displayed in plaintext console environments. I?m not sure that history is correct. Anyhow, for terminals (emulators nowadays) underline, bold, italic, and coloring (also in combination) is commonly available. (Even when terminals were monochrome, underline and bold could still be done, even if done in a non-standard way.) Blink is often suppressed in modern terminal emulators (but then can be enabled by a preference setting). /Kent Karlsson -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at bahnhof.se Wed Dec 16 18:47:02 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 17 Dec 2020 01:47:02 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <416d509b-b97c-5153-ec4c-aae451570919@ix.netcom.com> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org> <416d509b-b97c-5153-ec4c-aae451570919@ix.netcom.com> Message-ID: <94D07690-1B8D-4617-9D1A-6ABD164BA07F@bahnhof.se> > 16 dec. 2020 kl. 06:49 skrev Asmus Freytag via Unicode : > > On 12/15/2020 8:19 PM, David Starner via Unicode wrote: >> On Tue, Dec 15, 2020 at 4:47 PM S?awomir Osipiuk via Unicode >> wrote: >>> "Implementations of Unicode that already make use of out-of-band >>> mechanisms for language [or format] tagging or ?heavy-weight? in-band >>> mechanisms such as XML or HTML will continue to do exactly what they >>> are doing and will ignore the tag characters completely. They may even >>> prohibit their use to prevent conflicts with the equivalent markup." >> So every single thing that interfaces with HTML now has to handle >> Unicode italics on any plain text input, or silently dump them into >> the stream, and the web browser may have to handle them or not. > ^^^That. Let me paraphrase: ?So every single thing that interfaces with HTML now has to handle RTF italics on any plain text input, or silently dump them into the stream, and the web browser may have to handle them or not.? You would not use that as an argument to say that RTF (which I picked just because it is well-known) should be wiped from the face of Earth? I would think not? (You may want to wipe RTF from the face of the Earth, I don?t know, but you would not use that argument even if you do want that.) Even if, in these threads, the term ?plain text formatting? is used (or worse ?Unicode formatting?), that is a bit misleading (of course). I don?t think these proposals should be applied to text data of the ?type? ?tex/plain? (or as a filename suffix, ?.txt?), nor such things as filenames themselves, and of course not to ?text/html?/?.html?, nor to ?application/pdf?/?.pdf?, nor to ?application/rtf?/?.rtf?, etc. One should be using (a) new file type(s), POSSIBLY (if one can agree on a single one) even apply it to ?text/plain?/?.txt? (but not to HTML, RTF, etc., and not (I would say) to filenames or similar, such markup should not even be permitted in filenames and similar; note: ?should...?, not ?are..."). The point being that the markup would be default-ignorable, and thus normally ?invisible? when not interpreted, even in a ?plain? text file. Granted, the ECMA-48 approach (if not mapping to TAG characters) would need a bit of ?extending? the default-ignorability property to certain follow-on characters (that normally are printable) after ESC and CSI (terminal emulators do that all the time, and have done so for decades, so it is nothing revolutionary). That is, that the markup does not ?hijack? normal printable characters for its markup syntax; if ECMA-48 had been done today I think it would use default-ignorable characters through-out the ESC- and CSI-sequences, not just for the lead character. (Plus, I think that no use of out-of-band stylesheets is also a point. Plus that some argue for excessive ?bare-boned-ness?; but I don?t agree with that.) That is my take on this issue at least. ---- > hardcoding > > visual appearance is really the least helpful, because that totally > undercuts the the ability for style sheets to address presentation. Yes, but? Re. ECMA-48 (which we touched on in this thread), there the styling is really ?hardcoded?, and there are no style sheets. For ECMA-48 (which is still very much in use, and extensions are being implemented). I don?t think it would be a good idea to introduce any (separate) style sheets of any kind. It is not at all geared for that, and re-gearing it for that would not be a good idea to do (IMHO). Similarly for any ?plain text? (?low level?, really) formatting proposal other than ECMA-48. But for HTML and similar, fine; stylesheets are great! /Kent Karlsson -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Wed Dec 16 19:16:48 2020 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 16 Dec 2020 20:16:48 -0500 Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?= =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?= In-Reply-To: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org> References: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org> Message-ID: <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org> An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Wed Dec 16 23:36:53 2020 From: prosfilaes at gmail.com (David Starner) Date: Wed, 16 Dec 2020 21:36:53 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <94D07690-1B8D-4617-9D1A-6ABD164BA07F@bahnhof.se> References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com> <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com> <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org> <416d509b-b97c-5153-ec4c-aae451570919@ix.netcom.com> <94D07690-1B8D-4617-9D1A-6ABD164BA07F@bahnhof.se> Message-ID: On Wed, Dec 16, 2020 at 4:54 PM Kent Karlsson via Unicode wrote: > On 12/15/2020 8:19 PM, David Starner via Unicode wrote: > >> On Tue, Dec 15, 2020 at 4:47 PM S?awomir Osipiuk via Unicode >> wrote: > >>> "Implementations of Unicode that already make use of out-of-band >>> mechanisms for language [or format] tagging or ?heavy-weight? in-band >>> mechanisms such as XML or HTML will continue to do exactly what they >>> are doing and will ignore the tag characters completely. They may even >>> prohibit their use to prevent conflicts with the equivalent markup." > >> So every single thing that interfaces with HTML now has to handle >> Unicode italics on any plain text input, or silently dump them into >> the stream, and the web browser may have to handle them or not. > > Let me paraphrase: > > ?So every single thing that interfaces with HTML now has to handle RTF italics on any plain text input, > or silently dump them into the stream, and the web browser may have to handle them or not.? > > You would not use that as an argument to say that RTF (which I picked just because it is well-known) > should be wiped from the face of Earth? I would think not? (You may want to wipe RTF from the face > of the Earth, I don?t know, but you would not use that argument even if you do want that.) I wouldn't use that argument because it makes no sense. RTF and HTML are at the same level. Plain text (and Unicode specifically, for HTML) are at a lower, underlying level. If you want to make another rich text format, it's no skin off my nose. It is completely off-topic on this list, though. This list is about Unicode and changes thereto. > Similarly for any ?plain text? (?low level?, really) > formatting proposal other than ECMA-48. Exactly. They're not "plain text". So why are low-level formatting proposals relevant to this list at all? ECMA-48 is not plain text. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From duerst at it.aoyama.ac.jp Thu Dec 17 02:22:14 2020 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Thu, 17 Dec 2020 17:22:14 +0900 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> Message-ID: <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> On 17/12/2020 03:05, Doug Ewell via Unicode wrote: > abrahamgross wrote: > >>> Children learn to write with upper case and lower case letters in >>> school, and most people continue to use both as adults. (There are >>> exceptions of course, some people write only with lower case, and >>> some only with upper case.) >> >> Unicode refused to encode arabic letter variants (not counting >> compatibility chars), which are taught in school and adults use it, >> and its how arabic is written, so ur argument here doesn't hold water. > > I'm not sure what to make of that sentence. That's like saying "Unicode refused to encode the capital letter A (not counting U+0041)." > > The compatibility characters are exactly how one is supposed to represent Arabic letter forms outside of their normal context, as described here. Not necessarily. The 'official' way of representing specific contextual Arabic letter forms outside of their usual context is to prefix or postfix them with the appropriate JOINER or NON-JOINER characters. So there is indeed a non-compatibility encoding for these letter variants in Unicode, even if they appear out of context. What's of course more important is that in their usual context (and that's the way they are usually taught and used), these contextual variants don't need to be encoded because both humans and computers can do the shaping 'automatically'. Neither something like JOINERS, nor context work as well for the upper case / lower case distinction, and that's why it's fair to say that one reason for encoding this distinction (in Unicode as well as in many predecessor encodings) is that the distinction is learned in school and made in handwriting. Regards, Martin. From abrahamgross at disroot.org Thu Dec 17 08:41:28 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 17 Dec 2020 14:41:28 +0000 (UTC) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> Message-ID: <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> Microsoft Word does a very good job auto capitalizing, so the same internal dictionary that Word uses can also be used by OpenType to shape lowercase into uppercase. for the edge cases where you want uppercase when it doesn't automatically do it, you can use opentype alternate variants, or some other char (like the joiners or something) Dec 17, 2020 3:22:29 AM Martin J. D?rst : > Neither something like JOINERS, nor context work as well for the upper case / lower case distinction, and that's why it's fair to say that one reason for encoding this distinction (in Unicode as well as in many predecessor encodings) is that the distinction is learned in school and made in handwriting. > From asmusf at ix.netcom.com Thu Dec 17 13:28:22 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 17 Dec 2020 11:28:22 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> Message-ID: <4c63b13d-59bd-fd87-4a56-bb92a242691c@ix.netcom.com> An HTML attachment was scrubbed... URL: From chaw at eip10.org Fri Dec 18 10:49:26 2020 From: chaw at eip10.org (Sudarshan S Chawathe) Date: Fri, 18 Dec 2020 11:49:26 -0500 Subject: Interpretation of emoji-ordering-rules.txt Message-ID: <13627.1608310166@localhost> I would be grateful if someone could point me to a good reference for the syntax and semantics of the rules used to describe the emoji ordering at the following: https://www.unicode.org/emoji/charts-13.1/emoji-ordering-rules.txt Regards, -chaw From markus.icu at gmail.com Fri Dec 18 11:42:35 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 18 Dec 2020 09:42:35 -0800 Subject: Interpretation of emoji-ordering-rules.txt In-Reply-To: <13627.1608310166@localhost> References: <13627.1608310166@localhost> Message-ID: On Fri, Dec 18, 2020 at 9:06 AM Sudarshan S Chawathe via Unicode < unicode at unicode.org> wrote: > I would be grateful if someone could point me to a good reference for > the syntax and semantics of the rules used to describe the emoji > ordering at the following: > > https://www.unicode.org/emoji/charts-13.1/emoji-ordering-rules.txt Overview: https://www.unicode.org/reports/tr51/#Sorting Collation tailoring syntax: https://www.unicode.org/reports/tr35/tr35-collation.html#Rules The emoji ordering is also provided as part of CLDR: https://github.com/unicode-org/cldr/blob/master/common/collation/root.xml#L950 And like much of CLDR that is then available via the ICU C/C++/Java libraries, via a Collator for language tag "und-u-co-emoji". http://site.icu-project.org/ https://unicode-org.github.io/icu/userguide/collation/ https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1Collator.html https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/Collator.html Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From copypaste at kittens.ph Fri Dec 18 21:37:00 2020 From: copypaste at kittens.ph (Fredrick Brennan) Date: Fri, 18 Dec 2020 19:37:00 -0800 Subject: Adlam Message-ID: <176791270b7.10fb4e49546040.1392071428347243955@kittens.ph> Often when other scripts are discussed, Adlam is used comparatively, as in, "well we want to avoid what happened with Adlam", or "we have had painful experience with this with Adlam".I know some of the issues in Adlam, but if anyone has the time, I (and hopefully others!) would benefit from a retelling of the "Adlam in Unicode" story. I know in the end it's a very happy story, but I'm especially curious about the bumps along the road.?Best,Fred Brennan -------------- next part -------------- An HTML attachment was scrubbed... URL: From otto.stolz at uni-konstanz.de Sat Dec 19 06:42:33 2020 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Sat, 19 Dec 2020 13:42:33 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> Message-ID: <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Hello, am 2020-12-17 um 15:41 schrieb abrahamgross--- via Unicode: > Microsoft Word does a very good job auto capitalizing, so the same internal dictionary that Word uses can also be used by OpenType to shape lowercase into uppercase. for the edge cases where you want uppercase when it doesn't automatically do it, you can use opentype alternate variants, or some other char (like the joiners or something) Whatever MS Word does, it cannot decide the correct spelling in many cases, as the casing may well make a semantic difference. For example, you may well serve a turkey for dinner, but never a Turkey. A notorious German example: Er hat in Moskau liebe Genossen. (= He?s got dear comrades at Moskow) Er hat in Moskau Liebe genossen. (= He has enjoyed love at Moskow) ? (And I assure you, the prosody varies accordingly, hence the ? difference is quite clear in speech, and must be preserved ? in writing.) As only the author (and no other stage, be it human or automatic) can know the intended meaning, Unicode is quite right when encoding the case distinction. Best wishes, Otto From prosfilaes at gmail.com Sun Dec 20 01:23:31 2020 From: prosfilaes at gmail.com (David Starner) Date: Sat, 19 Dec 2020 23:23:31 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: On Sat, Dec 19, 2020 at 4:49 AM Otto Stolz via Unicode wrote: > A notorious German example: > Er hat in Moskau liebe Genossen. (= He?s got dear comrades at Moskow) > Er hat in Moskau Liebe genossen. (= He has enjoyed love at Moskow) > (And I assure you, the prosody varies accordingly, hence the > difference is quite clear in speech, and must be preserved > in writing.) She _loves_ him !?! (= I can't believe her emotion towards him is love.) She loves _him_ !?! (= I can't believe that he is the one she loves, and not someone else.) And the prosody varies accordingly, and any accurate preservation in writing would need to record the difference. > As only the author (and no other stage, be it human or automatic) can > know the intended meaning, Unicode is quite right when encoding the case > distinction. Meh. I could come up with similar examples, though probably a bit more contrived, for just about every bit of markup. Italics/emphasis has a bunch of pretty clear meaning changes, like the example above, possibly more than casing in English. Fraktur/Antiqua mixing allows for any number of examples; "Er was clever." is different from "Er was clever".* Casing certainly had more of an argument to be encoded in the character set than italics, historically, but I can imagine an alternate history, maybe one the leaders in computing history used a non-casing script, where casing was relegated to markup, and a lot of issues would be easier--no more problems with case-insensitive matching, and the Turkish i would be a font difference under markup. * Italics marking in English could serve the same role in making a bunch of examples; e.g. "The French man said to stop at the coin" and "The French man said to stop at the coin." mean different things. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From indolering at gmail.com Sun Dec 20 13:55:53 2020 From: indolering at gmail.com (Zach Lym) Date: Sun, 20 Dec 2020 11:55:53 -0800 Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?= =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?= In-Reply-To: <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org> References: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org> <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org> Message-ID: I don't think it's fair to dismiss this as "not a unicode problem." As the OP pointed out, support for non-latin variable names is largely due to Unicode's identity standard and extensive implementation advice. The section on numbering (5.5) is only a page long and essentially recommends handling decimal based numbering systems. There isn't nearly as much care given to this topic. There is a standard annex on mathematics, but that is in PDF form and is largely concerned with parsing and display of mathematical formulas. However, as is the answer to most questions, it is a matter of time and money. If someone is willing to spend the time expanding 5.5 writing a new annex, I am sure the Unicode committee would be happy to review it. Would you be interested in doing that legwork? I'm actually pretty new here, what's the best way Roger could contribute to make Unicode better in this regard? Thanks, -Zach Lym On Wed, Dec 16, 2020 at 5:23 PM Mark E. Shoulson via Unicode < unicode at unicode.org> wrote: > On 12/16/20 10:40 AM, Doug Ewell via Unicode wrote: > > What I don't understand here is why this is being framed implicitly as a Unicode problem, or an XML problem, or a general law of nature ("why can?t a Bengali-speaking person use the Bengali digits"), instead of an inherent limitation of that particular library function used for that particular language. > > Yes, exactly. This is "a characteristic of the code libraries, not a > Unicode problem." > > > There are probably reasonable reasons not to update the actual atol/strtol > calls, but one could certainly write a library to do what you're talking > about... and apparently someone has, by Bill Poser's report of his > libuninum. There ya go. > > > ~mark > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Dec 20 15:40:14 2020 From: doug at ewellic.org (Doug Ewell) Date: Sun, 20 Dec 2020 14:40:14 -0700 Subject: Unicode is universal, so how come that universality =?UTF-8?Q?doesn=E2=80=99t=20apply=20to=20digits=3F?= Message-ID: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com> Zach Lym wrote: > I don't think it's fair to dismiss this as "not a unicode problem." > As the OP pointed out, support for non-latin variable names is largely > due to Unicode's identity standard and extensive implementation > advice. I don't recall Roger saying anything about non-Latin variable names. He wrote: > Why, for example, can?t a Bengali-speaking person create XML such as > this: > ?? > or write a program assignment statement like this: > ??????_????? = ?? This doesn't claim that the Bengali variable name ??????_????? is not supported, but rather the mixed Bengali/Oriya constant ??. In fact, a few lines earlier Roger wrote: > a Bengali-speaking person can write this: > ??????_????? = 42 so variable names aren't the issue. > The section on numbering (5.5) is only a page long and essentially > recommends handling decimal based numbering systems. There isn't > nearly as much care given to this topic. Bengali and Oriya are decimal-based. (Whether they should be used together in a single number is another matter.) The first paragraph of Section 5.5 specifically discusses interpreting Devanagari digits as one would interpret Basic Latin digits. I don't know what needs to be added here. > There is a standard annex on mathematics, but that is in PDF form and > is largely concerned with parsing and display of mathematical > formulas. UTR #25 (a Technical Report, not a Standard Annex) does focus on Basic Latin digits, at one point (2.2) claiming that Basic Latin digits are essentially the only digits used in math, but it's true that the UTR is about math notation and that isn't really in scope here. The fact that the UTR is a PDF document doesn't seem pertinent. > However, as is the answer to most questions, it is a matter of time > and money. If someone is willing to spend the time expanding 5.5 > writing a new annex, I am sure the Unicode committee would be happy to > review it. Would you be interested in doing that legwork? Again, I don't see what is lacking in Section 5.5, especially considering its Devanagari example. The legwork that needs to be done is to make implementations more internationalized and more Unicode-aware. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From asmusf at ix.netcom.com Sun Dec 20 17:13:01 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 20 Dec 2020 15:13:01 -0800 Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?= =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?= In-Reply-To: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com> References: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com> Message-ID: <417fce02-95b0-dd8f-49aa-8e056c033e93@ix.netcom.com> An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Mon Dec 21 03:08:08 2020 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Mon, 21 Dec 2020 18:08:08 +0900 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: Hello David, others, On 20/12/2020 16:23, David Starner via Unicode wrote: > On Sat, Dec 19, 2020 at 4:49 AM Otto Stolz via Unicode > wrote: >> A notorious German example: >> Er hat in Moskau liebe Genossen. (= He?s got dear comrades at Moskow) >> Er hat in Moskau Liebe genossen. (= He has enjoyed love at Moskow) >> (And I assure you, the prosody varies accordingly, hence the >> difference is quite clear in speech, and must be preserved >> in writing.) > > She _loves_ him !?! (= I can't believe her emotion towards him is love.) > She loves _him_ !?! (= I can't believe that he is the one she loves, > and not someone else.) > > And the prosody varies accordingly, and any accurate preservation in > writing would need to record the difference. I think the above "and most be preserved in writing" is easy to misunderstand, as it is a bit too strong. It wouldn't have been preserved on very early computers (or earlier, in telegrams) that only used upper case. But there was a very strong expectation that it would be preserved on things as simple as a typewriter, and definitely also in handwriting. On the other hand, there is no such expectation for your example. If prosody has to be reconstructed, that might happen e.g. from context (e.g. in a playscript), or the sentences might have been rewritten for clarity in the first place. I don't think there is a single writing system that is able to denote every aspect of spoken language. When compared with spoken language, most writing systems leave something out. (Some may also add something, e.g. distinction of some homonyms.) >> As only the author (and no other stage, be it human or automatic) can >> know the intended meaning, Unicode is quite right when encoding the case >> distinction. > > Meh. I could come up with similar examples, though probably a bit more > contrived, for just about every bit of markup. Italics/emphasis has a > bunch of pretty clear meaning changes, like the example above, > possibly more than casing in English. Fraktur/Antiqua mixing allows > for any number of examples; "Er was clever." is > different from "Er was clever".* Casing certainly > had more of an argument to be encoded in the character set than > italics, historically, Exactly. > but I can imagine an alternate history, maybe > one the leaders in computing history used a non-casing script, where > casing was relegated to markup, and a lot of issues would be > easier--no more problems with case-insensitive matching, and the > Turkish i would be a font difference under markup. An alternate history indeed. The history we followed gave us italics relegated to markup, and avoided the problems with italic-insensitive matching. And please note that your alternate history does NOT lead to technology that encodes italics separately. [And that I was perfectly able to put stress on a word in the previous sentence without italics, even if the main purpose of that was just to make a point.] Also, it's not clear that encoders starting with a non-casing script would have decided to relegate casing to markup. It's pretty annoying to markup single letters, and to change the markup when a word moves to the start of a sentence, and these are the main uses for upper case. > * Italics marking in English could serve the same role in making a > bunch of examples; e.g. "The French man said to stop at the coin" and > "The French man said to stop at the coin." mean different > things. The important thing here is "could". Unicode doesn't invent writing systems. And I have to admit that I don't understand the difference between these two sentences even with your italic markup. But that may be only me. Regards, Martin. From prosfilaes at gmail.com Mon Dec 21 03:48:02 2020 From: prosfilaes at gmail.com (David Starner) Date: Mon, 21 Dec 2020 01:48:02 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: On Mon, Dec 21, 2020 at 1:10 AM Martin J. D?rst via Unicode wrote: > > She _loves_ him !?! (= I can't believe her emotion towards him is love.) > > She loves _him_ !?! (= I can't believe that he is the one she loves, > > and not someone else.) > > > > And the prosody varies accordingly, and any accurate preservation in > > writing would need to record the difference. > > I think the above "and most be preserved in writing" is easy to > misunderstand, as it is a bit too strong. It wouldn't have been > preserved on very early computers (or earlier, in telegrams) that only > used upper case. But there was a very strong expectation that it would > be preserved on things as simple as a typewriter, and definitely also in > handwriting. Er, but that's a different argument altogether. An expectation that it be preserved is entirely different from "any accurate preservation in writing would need to record the difference." > On the other hand, there is no such expectation for your example. If > prosody has to be reconstructed, that might happen e.g. from context > (e.g. in a playscript), or the sentences might have been rewritten for > clarity in the first place. I'd say there's certainly an expectation that emphasis be preserved in those statements in some way. If those were real statements, one can not simply rewrite them, and if they were used in fiction, rewriting would change the colloquial effect. > And please note that your alternate history does NOT lead to > technology that encodes italics separately. Sure. The response was about the silliness of the argument, not for italics being encoded. > [And that I was perfectly > able to put stress on a word in the previous sentence without italics, > even if the main purpose of that was just to make a point.] You also could have written the sentence in all caps. > > * Italics marking in English could serve the same role in making a > > bunch of examples; e.g. "The French man said to stop at the coin" and > > "The French man said to stop at the coin." mean different > > things. > > The important thing here is "could". Unicode doesn't invent writing > systems. And I have to admit that I don't understand the difference > between these two sentences even with your italic markup. But that may > be only me. I could create many examples where the italics distinguishes the meaning, because, like the Fraktur/Antigua example, one use of italics in English is to denote foreign words. English "coin" and French "coin" are false friends; the first sentence says to stop at the coin, and the second says to stop at the corner. -- The standard is written in English . If you have trouble understanding a particular section, read it again and again and again . . . Sit up straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185 (1991) From frederic.grosshans at gmail.com Mon Dec 21 04:10:09 2020 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Mon, 21 Dec 2020 11:10:09 +0100 Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?= =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?= In-Reply-To: References: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org> <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org> Message-ID: An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Dec 21 04:40:44 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Dec 2020 02:40:44 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: <7349b420-3a2a-2c80-9f78-bba839d9ec63@ix.netcom.com> An HTML attachment was scrubbed... URL: From lyratelle at gmx.de Mon Dec 21 05:21:27 2020 From: lyratelle at gmx.de (Dominikus Dittes Scherkl) Date: Mon, 21 Dec 2020 12:21:27 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: Am 20.12.20 um 08:23 schrieb David Starner via Unicode: > On Sat, Dec 19, 2020 at 4:49 AM Otto Stolz via Unicode > wrote: >> A notorious German example: >> Er hat in Moskau liebe Genossen. (= He?s got dear comrades at Moskow) >> Er hat in Moskau Liebe genossen. (= He has enjoyed love at Moskow) >> (And I assure you, the prosody varies accordingly, hence the >> difference is quite clear in speech, and must be preserved >> in writing.) > > She _loves_ him !?! (= I can't believe her emotion towards him is love.) > She loves _him_ !?! (= I can't believe that he is the one she loves, > and not someone else.) > > And the prosody varies accordingly, and any accurate preservation in > writing would need to record the difference. Prosody is a wholly different thing, as others already mentioned. But in fact, you DID preserve it - in plain text - by adding an underscore before and after the word with emphasis. You could also have used ' or " or even * for the same effect, but nevertheless it is already possible to preserve the special intent of the author _without_ any further additions. Also even with italics allowed (and maybe bold or othere style features) this does not indicate _what_ was special about the highlighted words. Was it emphasis? Or indicated a thought? Or a special meaning of an ambiguous word? Or whatever else? - all this would need further agreement or conventions, which are not standardized so far. -- Dominikus Dittes Scherkl From richard.wordingham at ntlworld.com Mon Dec 21 05:27:44 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 21 Dec 2020 11:27:44 +0000 Subject: Unicode is universal, so how come that universality =?UTF-8?B?ZG9lc27igJl0?= apply to digits? In-Reply-To: <417fce02-95b0-dd8f-49aa-8e056c033e93@ix.netcom.com> References: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com> <417fce02-95b0-dd8f-49aa-8e056c033e93@ix.netcom.com> Message-ID: <20201221112744.2c88dace@JRWUBU2> On Sun, 20 Dec 2020 15:13:01 -0800 Asmus Freytag via Unicode wrote: > Those data may not support parsing or formatting arbitrary > mixed-script digit combinations. That is also OK, because the data is > geared towards getting the ordinary use of numbers correct for as > many locales and languages, not to deal with fancyful stuff that > doesn't have a real-life user community using it in daily life. I can imagine a few situations where mixed sequences may occur. Firstly, the early non-Indian Unicode usage of Tamil script place notation would have required that the 'digit zero' come from another script, as Unicode initially only supported Indian Tamil script usage, which lacks a zero. Secondly, but not strictly an example, it seems that the Lao-style of the Tai Tham script will mix the use of the two digit sets. I wouldn't be surprised at the use of eclectic mixes of Arabic digits at the eastern end of the Arabic script domain. The glyph shapes of the EXTENDED ARABIC-INDIC digits are language-dependent, and language-dependence has only recently hit mainstream rendering for the masses. I wouldn't be surprised to find mixed selections in use in the Union of Burma. That could be a big nuisance, because the three series of digits provide some opportunity for digits to spoof digits! Richard. From wjgo_10009 at btinternet.com Mon Dec 21 08:17:23 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 21 Dec 2020 14:17:23 +0000 (GMT) Subject: Expressing thoughts in plain text (from Re: Italics get used to express important semantic meaning, so unicode should support them) In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: <6960bdc0.5ad.17685a97278.Webtop.220@btinternet.com> Dominikus Dittes Scherkl wrote as follows. > Also even with italics allowed (and maybe bold or othere style > features) this does not indicate _what_ was special about the > highlighted words. Was it emphasis? Or indicated a thought? Or a > special meaning of an ambiguous word? Or whatever else? - all this > would need further agreement or conventions, which are not > standardized so far. In my novels I express a character's thoughts, as contrasted with his or her spoken words, by using single quotes for thoughts and double quotes for spoken words. In fact, the desktop publishing software that I use automatically substitutes smart quotes, provided that the font has them: The font that I usually use for text has smart quotes. As far as I am aware this is just my own way of writing, though it is possible that I saw it somewhere years ago and it was in my memory somewhere and that memory influenced me. It may perhaps be non-standard, but it seems to work fine. I publish my novels myself in pure electronic format. William Overington Monday 21 December 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Mon Dec 21 13:05:55 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Mon, 21 Dec 2020 19:05:55 +0000 (UTC) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: Its also possible to come up with a system to use double letters or lets say exclamation point+letter instead of any uppercase letters. I can write !belgium or bbelgium instead of Belgium and get ppl to agree to do it and then I wouldn't neet italics. The only reason why things like _italics_ or *italics* are around is because of the lack of real italics. I would go as far as to say that the very existence of *italics* in plain text shows that theres a real need for italics when writing plain text. This is a workaround around a real problem of the lack of italics if I've ever seen one? Dec 21, 2020 6:22:26 AM Dominikus Dittes Scherkl via Unicode : > Prosody is a wholly different thing, as others already mentioned. > But in fact, you DID preserve it - in plain text - by adding an > underscore before and after the word with emphasis. You could also have > used ' or " or even * for the same effect, but nevertheless it is > already possible to preserve the special intent of the author _without_ > any further additions. > From sosipiuk at gmail.com Mon Dec 21 13:20:25 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Mon, 21 Dec 2020 14:20:25 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: On Mon, Dec 21, 2020 at 6:23 AM Dominikus Dittes Scherkl via Unicode wrote: > > But in fact, you DID preserve it - in plain text - by adding an > underscore before and after the word with emphasis. You could also have > used ' or " or even * for the same effect, but nevertheless it is > already possible to preserve the special intent of the author _without_ > any further additions. This doesn't hold water. People can cobble together methods of conveying meaning. It doesn't mean they're ideal, good, or even acceptable. The use of underscores, asterisks, and whatnot to indicate emphasis is a hack to fit with the limitations imposed by technology. By the same logic, one could argue that ? and ? didn't need to be encoded for the benefit of Spanish users, because they COULD just use ordinary ? and ! and they would still be understood. I can use an axe to bang nails into a wall, but it's silly to say I don't REALLY need a hammer. As a mildly interesting aside, technical limitations of print have driven changes to language before. It's partly the reason why ? (thorn) is no longer part of the English alphabet. It's still not an excuse for doing similar things today. In a few more decades underscores and asterisks may become fully accepted punctuation, resulting from the limits we currently have in plain text. Technology should adapt to us, not the other way around. Indeed, I would argue that the use of such "human-readable markup" is evidence FOR the inclusion of basic formatting in plain text. There is such demand for it that people are willing to settle for inelegant hacks to get their meaning across. > Also even with italics allowed (and maybe bold or othere style features) > this does not indicate _what_ was special about the highlighted words. > Was it emphasis? Or indicated a thought? Or a special meaning of an > ambiguous word? Or whatever else? - all this would need further > agreement or conventions, which are not standardized so far. Newspapers often italicize words and they're clearly following some (possibly internal) standard. The example of novels has already been given. There are conventions for such things, often varying by medium and language. The precise meaning of formatting does NOT need to be standardised by Unicode to make it available as a tool. S?awomir Osipiuk From kenwhistler at sonic.net Mon Dec 21 13:42:54 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Mon, 21 Dec 2020 11:42:54 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: On 12/21/2020 11:05 AM, abrahamgross--- via Unicode wrote: > The only reason why things like_italics_ or*italics* are around is because of the lack of real italics. I would go as far as to say that the very existence of*italics* in plain text shows that theres a real need for italics when writing plain text. > This is a workaround around a real problem of the lack of italics if I've ever seen one? Actually, simple markup conventions like that mostly date from early days of email, when plain text (and usually just ASCII at that) were all you got. (By the way, the most usual interpretation of those is _underscore_, /italic/, and *bold*, but whatever.) Nowadays, presto chango, most email clients support rich text (in HTML, usually), and you get to _underscore_, /italicize/, and *bold* your text correctly whenever you want to, and even change the font size to SHOUT, if you want. Some folks here seem to be viewing the "problem" here the wrong way round. The issue isn't that plain text cannot preserve all the "meaning" conveyed in writing systems. When dealing with meaning conveyed with conventions that involve styling, font change, color and such, you simply depend on properly tiered text architecture and build support for that in rich text and markup. It is ass-backwards to try to continue to clot up plain text as the backbone of text interchange by trying to import all the complications of styling directly into it as if that representation were a plain text issue -- it isn't. Instead the *real* problem here is that in some communication contexts that should be supporting rich text, implementations are still restricting people to plain text when what they really want is easily accessible and dependable rich text to convey more nuances accurately (or just to be more expressive). If Twitter is half-assed about supporting text styling, then direct your concerns in the proper direction. You don't fix Twitter's or texting apps' use of text by trying to force styling into the Unicode encoding of plain text. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Dec 21 14:58:04 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 21 Dec 2020 12:58:04 -0800 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: <2408ce9d-aa27-af56-09cf-3a0a5fc80e24@ix.netcom.com> An HTML attachment was scrubbed... URL: From jameskass at code2001.com Mon Dec 21 16:08:50 2020 From: jameskass at code2001.com (James Kass) Date: Mon, 21 Dec 2020 22:08:50 +0000 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <2408ce9d-aa27-af56-09cf-3a0a5fc80e24@ix.netcom.com> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> <2408ce9d-aa27-af56-09cf-3a0a5fc80e24@ix.netcom.com> Message-ID: <3a3f8743-efdd-6d66-d4ce-788d53017bad@code2001.com> On 2020-12-21 8:58 PM, Asmus Freytag via Unicode wrote: > On 12/21/2020 11:20 AM, S?awomir Osipiuk via Unicode wrote: > > I can use an axe to bang nails into a wall, but it's silly to say I > > don't REALLY need a hammer. > > To paraphrase Ken: if you need rich text, you really need rich text, so go out > and tackle those that force you to use plain text instead. > > A./ > Some of us choose to use plain-text rather than to be forced into using rich-text. Written communication is of, by, and for human beings.? Regardless of the media used to exchange that communication or the tools used to produce it.? As human beings (the inventors and owners of the graphic symbols used in written communication), it is our birthright to insert any graphic character whatsoever in our written communication for any purpose we deem fit.? Earnestly or whimsically. It?s fair use ? never abuse. We don?t need anyone?s permission to exercise that birthright. People have been using and repurposing each other?s graphic symbols since day one.? That?s how writing evolves. Twitter users are already using the Latin italic letters (which had been repurposed as math symbols) encoded in Unicode to convey the notion that their authorial intention was to deploy Latin italic letters.? Their (the Twitter users) needs are being well served by the existing Unicode repertoire.? If Twitter thought that this was some kind of problem, or if Twitter users were /really/ clamoring for rich-text, then Twitter would have acted long ago. From indolering at gmail.com Mon Dec 21 19:00:12 2020 From: indolering at gmail.com (Zach Lym) Date: Mon, 21 Dec 2020 17:00:12 -0800 Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?= =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?= In-Reply-To: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com> References: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com> Message-ID: > I don't recall Roger saying anything about non-Latin variable names. We agree that non-latin variable names are not the issue, I just worded my response clumsily ?\_(?)_/?? So ... why isn't the treatment of parsing numbers as good as variable names? Well, to cite Conway's Law, "Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure." The identifier standard annex is ~30 pages of polished hand holding for a language implementor: it provides examples, gets into parsing, gives advice on customization, and explains tricky issues such as handling zero-width-joiners. I assume UAX 31 has received a disproportionate level of attention thanks to hammering out DNS and URL standards, but maybe that's just because I have a background in DNS. > > > The section on numbering (5.5) is only a page long and essentially > > recommends handling decimal based numbering systems. There isn't > > nearly as much care given to this topic. > > Bengali and Oriya are decimal-based. (Whether they should be used > together in a single number is another matter.) The first paragraph of > Section 5.5 specifically discusses interpreting Devanagari digits as one > would interpret Basic Latin digits. I don't know what needs to be added > here. As Fr?d?ric points in his reply, section 22.3 has a lengthier treatment (which I totally missed). At a minimum, 5.5 should cross reference 22.3. > > There is a standard annex on mathematics, but that is in PDF form and > > is largely concerned with parsing and display of mathematical > > formulas. > > UTR #25 (a Technical Report, not a Standard Annex) does focus on Basic > Latin digits, at one point (2.2) claiming that Basic Latin digits are > essentially the only digits used in math, but it's true that the UTR is > about math notation and that isn't really in scope here. I think it's significant to answering Roger's question. How much demand is there for using native numeric literals when most control-flow logic is going to be in English? > The fact that the UTR is a PDF document doesn't seem pertinent. PDFs do not rank well on Google, you can't deeply link to specific sections, and they are generally a PITA to work with. The Unicode standard publishes PDFs *not* because it is a good idea, but because it's inconvenient to change a 30-year-old publishing workflow. > > However, as is the answer to most questions, it is a matter of time > > and money. If someone is willing to spend the time expanding 5.5 > > writing a new annex, I am sure the Unicode committee would be happy to > > review it. Would you be interested in doing that legwork? > > Again, I don't see what is lacking in Section 5.5, especially > considering its Devanagari example. The legwork that needs to be done is > to make implementations more internationalized and more Unicode-aware. Yes: it's ultimately on implementers and Unicode != i18n. And: couldn't we do a better job at transitioning people to resources on how to handle i18n in a more comprehensive fashion? But also: Unicode is hella confusing, even to world-class programmers. Shouldn't we try to recruit suckers like Roger and I into making it better? ? -Zach Lym From kenwhistler at sonic.net Mon Dec 21 20:08:32 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Mon, 21 Dec 2020 18:08:32 -0800 Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?= =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?= In-Reply-To: References: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com> Message-ID: <9e70ebf1-7f81-5730-1964-ad86d425a82e@sonic.net> On 12/21/2020 5:00 PM, Zach Lym via Unicode wrote: >> The fact that the UTR is a PDF document doesn't seem pertinent. > PDFs do not rank well on Google, you can't deeply link to specific > sections, Actually, you can, if you set them up correctly: https://www.unicode.org/versions/Unicode13.0.0/ch22.pdf#G12146 That links right to Table 22-3, Script-Specific Decimal Digits on p. 829, in Section 22.3 of the latest version of the core specification. > and they are generally a PITA to work with. Well, your mileage may vary. HTML has its own PITA aspects. > The Unicode > standard publishes PDFs*not* because it is a good idea, but because > it's inconvenient to change a 30-year-old publishing workflow. 20, not 30, actually. Prior to Unicode 3.0, the Unicode Standard was done with a different family of editorial tooling. But yeah, it is inconvenient to change, especially since the document is riddled with hand-tweaked figures and hacked up fonts. And it's a thousand pages long, and it has internal indexing and the sections, figures, and tables are all cross-referenced in the document. And oh, did I mention? It's a thousand pages long. Various folks have wanted to reformat it to something more web-friendly and searchable over the years, but they have tended to discover other things that they needed to do when faced with the actual amount of work involved. ;-) --Ken > -------------- next part -------------- An HTML attachment was scrubbed... URL: From junicode at jcbradfield.org Tue Dec 22 04:15:09 2020 From: junicode at jcbradfield.org (Julian Bradfield) Date: Tue, 22 Dec 2020 10:15:09 +0000 (GMT) Subject: Italics get used to express important semantic meaning, so unicode should support them References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: Veering further off-topic...but why not :-? On 2020-12-21, Ken Whistler via Unicode wrote: > Actually, simple markup conventions like that mostly date from early > days of email, when plain text (and usually just ASCII at that) were all > you got. (By the way, the most usual interpretation of those is > _underscore_, /italic/, and *bold*, but whatever.) But underscore is just the manuscript equivalent of italic print...both naively (when you underline a word in a letter, you would now italicize it in a typeset letter) and as formalized in copy-editing markup. So for many of us, _italic_ has always been natural, and /italic/ always looks a bit weird. It's curious that nobody (as far as I know) adapts copy-editing markup and writes ~bold~ . From wjgo_10009 at btinternet.com Tue Dec 22 03:52:25 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 22 Dec 2020 09:52:25 +0000 (GMT) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <3a3f8743-efdd-6d66-d4ce-788d53017bad@code2001.com> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> <2408ce9d-aa27-af56-09cf-3a0a5fc80e24@ix.netcom.com> <3a3f8743-efdd-6d66-d4ce-788d53017bad@code2001.com> Message-ID: <386343f0.bc8.17689dd3b10.Webtop.51@btinternet.com> Hi James Kass wrote as follows. > Written communication is of, by, and for human beings. Wow, nominative, genitive, ablative and dative all in one short sentence. Can you express that sentence with emoji? Maybe if these abstract emoji were encoded in regular Unicode one could encode that sentence and lots of other sentences too. http://www.users.globalnet.co.uk/ ~ngo/abstract_emoji.htm Yet are abstract emoji acceptable to Unicode Inc. for encoding into The Unicode Standard? Best regards, William Overington Tuesday 22 December 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue Dec 22 04:31:31 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 22 Dec 2020 10:31:31 +0000 (GMT) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: <2ab4a4c9.c23.1768a010515.Webtop.51@btinternet.com> Hi Julian Bradfield wrote as follows. > It's curious that nobody (as far as I know) adapts copy-editing markup and writes ~bold~ . Well, that seems like a very good idea of a stylish way to express bold type. That can be put into practice now if people choose to use it. Best regards, William Overington Tuesday 22 December 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue Dec 22 04:50:25 2020 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 22 Dec 2020 10:50:25 +0000 (GMT) Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: <31808f37.c6b.1768a125280.Webtop.51@btinternet.com> Hi Ken Whistler wrote as follows. > Actually, simple markup conventions like that mostly date from early > days of email, when plain text (and usually just ASCII at that) were > all you got. Back in the early 1990s when only ASCII was available, the circumflex accented characters needed for Esperanto were often expressed by using a lowercase letter x after the ASCII base of the character. Namely as follows, Cx cx Gx gx Hx hx Jx jx Sx sx I do not remember whether the U breve and the u breve were expressed as Ux or ux at all. This seemed to me at first to be quite strange, but I got used to reading it. The method was suitable for unambiguous use because the Esperanto language does not use the letter x in its alphabet. Best regards, William Overington Tuesday 22 December 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From indolering at gmail.com Tue Dec 22 16:52:32 2020 From: indolering at gmail.com (Zach Lym) Date: Tue, 22 Dec 2020 14:52:32 -0800 Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?= =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?= In-Reply-To: <9e70ebf1-7f81-5730-1964-ad86d425a82e@sonic.net> References: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com> <9e70ebf1-7f81-5730-1964-ad86d425a82e@sonic.net> Message-ID: On Mon, Dec 21, 2020 at 6:08 PM Ken Whistler wrote: > On 12/21/2020 5:00 PM, Zach Lym via Unicode wrote: > PDFs do not rank well on Google, you can't deeply link to specific > sections, > > Actually, you can, if you set them up correctly: > > https://www.unicode.org/versions/Unicode13.0.0/ch22.pdf#G12146 > > That links right to Table 22-3, Script-Specific Decimal Digits on p. 829, in Section 22.3 of the latest version of the core specification. I don't want to be rude ... but protips just enable user abuse ?. Being an expert in something insulates you from the harsh realities faced by your users ?. During my review of filename normalization decisions ... experts were confused and made poor choices at virtually every step. Usability engineering views confused end-users as a signal that important information isn't being surfaced in an appropriate manner. When you are failing to meet your target demographic of smart-people-in-a-hurry, what hope is there for us idiots? Not much, **because no one reads manuals.** That is one of the most reliable findings of technical documentation research stretching back to 1987 [1]. > and they are generally a PITA to work with. > > Well, your mileage may vary. HTML has its own PITA aspects. Everything involves trade-offs, but Unicode's PDFs are even worse than the IETF's typewriter emulator ?. > The Unicode > standard publishes PDFs *not* because it is a good idea, but because > it's inconvenient to change a 30-year-old publishing workflow. > > 20, not 30, actually. Prior to Unicode 3.0, the Unicode Standard was done with a different family of editorial tooling. So what are you using, DocBook? > But yeah, it is inconvenient to change, especially since the document is riddled with hand-tweaked figures and hacked up fonts. And it's a thousand pages long, and it has internal indexing and the sections, figures, and tables are all cross-referenced in the document. And oh, did I mention? It's a thousand pages long. Oh, no. That sounds terrible ... for someone who isn't a print ?! My father taught graphic design, so I grew up messing around with PageMaker and doing table based HTML layouts. My post high-school job involved Vietmanese and Amharic print work. The Ethiopian history books weren't a thousand pages long, but they had reference indexes, figures, and codepage mixing ... the whole nine ?. > Various folks have wanted to reformat it to something more web-friendly and searchable over the years, I only suggested additional cross-references within the standard and possibly a new technical report, which would be more *user* friendly. The UX rabbit hole started based on my assertion that the disproportionate amount of effort put into the identifier documentation and widespread support for i18n variable names is more than *just* a correlation. > but they have tended to discover other things that they needed to do when faced with the actual amount of work involved. ;-) That is not something an outsider could do, as the primary audience for any product are the people who make it. And if the "insiders" don't see a problem.... Hence my invocation of Conway's law: the standard reflects the particular bureaucratic mould in which it is formed. ? - Zach Lym [1]: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=minimal+manual&btnG= From duerst at it.aoyama.ac.jp Tue Dec 22 18:14:59 2020 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=) Date: Wed, 23 Dec 2020 09:14:59 +0900 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> Message-ID: <3bce8fcb-cd55-aa09-fe24-a27b21101816@it.aoyama.ac.jp> Hello everybody, Just for what it's worth, here are a few details on how at least some email clients handle ASCII email styling conventions (/ for italics, _ for underscore, and * for boldface). On 22/12/2020 04:42, Ken Whistler via Unicode wrote: > > On 12/21/2020 11:05 AM, abrahamgross--- via Unicode wrote: >> The only reason why things like_italics_? or*italics*? are around is >> because of the lack of real italics. I would go as far as to say that >> the very existence of*italics*? in plain text shows that theres a real >> need for italics when writing plain text. >> This is a workaround around a real problem of the lack of italics if >> I've ever seen one? The two places above are displayed without styling in plaintext, probably because they are quoted. They show up styled in HTML because that contains additional tags (,...). > Actually, simple markup conventions like that mostly date from early > days of email, when plain text (and usually just ASCII at that) were all > you got. (By the way, the most usual interpretation of those is > _underscore_, /italic/, and *bold*, but whatever.) These show up styled in plaintext display, but not in HTML, presumably because Ken entered the styling characters by hand (in the HTML version, there is no markup). > Nowadays, presto chango, most email clients support rich text (in HTML, > usually), and you get to _underscore_, /italicize/, and *bold* your text > correctly whenever you want to, and even change the font size to SHOUT, > if you want. These show up styled in HTML, most probably because Ken used the text editor to style them that way. The plaintext version contains ASCII email styling characters (but the HTML version doesn't), and my guess is that they were added when the mailer produced the plaintext version. Your mailer and your mileage may vary. Regards, Martin. > Some folks here seem to be viewing the "problem" here the wrong way > round. The issue isn't that plain text cannot preserve all the "meaning" > conveyed in writing systems. When dealing with meaning conveyed with > conventions that involve styling, font change, color and such, you > simply depend on properly tiered text architecture and build support for > that in rich text and markup. It is ass-backwards to try to continue to > clot up plain text as the backbone of text interchange by trying to > import all the complications of styling directly into it as if that > representation were a plain text issue -- it isn't. > > Instead the *real* problem here is that in some communication contexts > that should be supporting rich text, implementations are still > restricting people to plain text when what they really want is easily > accessible and dependable rich text to convey more nuances accurately > (or just to be more expressive). If Twitter is half-assed about > supporting text styling, then direct your concerns in the proper > direction. You don't fix Twitter's or texting apps' use of text by > trying to force styling into the Unicode encoding of plain text. > > --Ken From kent.b.karlsson at bahnhof.se Tue Dec 22 18:55:15 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Wed, 23 Dec 2020 01:55:15 +0100 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <3bce8fcb-cd55-aa09-fe24-a27b21101816@it.aoyama.ac.jp> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> <3bce8fcb-cd55-aa09-fe24-a27b21101816@it.aoyama.ac.jp> Message-ID: > 23 dec. 2020 kl. 01:14 skrev Martin J. D?rst via Unicode : > > ... > >> Nowadays, presto chango, most email clients support rich text (in HTML, usually), and you get to _underscore_, /italicize/, and *bold* your text correctly whenever you want to, and even change the font size to SHOUT, if you want. > > These show up styled in HTML, most probably because Ken used the text editor to style them that way. The plaintext version contains ASCII email styling characters (but the HTML version doesn't), and my guess is that they were added when the mailer produced the plaintext version. > > Your mailer and your mileage may vary. That kind of markup now goes by the name ?markdown? (apparently, I don?t like that name, the pun only(?) works in English), and each system has their own variant. Wikipedia has one, various chat platforms have theirs (all likely slightly different), Trac has its variant, Jira has its variant of this, etc. etc. Some have bullet lists, some not, some have headings perhaps allowing different levels, some allow for strike-over. At which point the ?markdown? is converted to (e.g) HTML (or other more robust markup) may vary. It?s the Wild Wild West. And it is not at all robust, mishaps easily happen and may be hard to get out of. But I agree it is handy (easy to type on the keyboard), most often it works as intended but not always? /Kent K > Regards, Martin. From sosipiuk at gmail.com Tue Dec 22 19:01:59 2020 From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=) Date: Tue, 22 Dec 2020 20:01:59 -0500 Subject: Italics get used to express important semantic meaning, so unicode should support them In-Reply-To: <3bce8fcb-cd55-aa09-fe24-a27b21101816@it.aoyama.ac.jp> References: <000301d6d00e$4d244330$e76cc990$@ewellic.org> <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com> <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp> <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org> <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org> <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp> <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org> <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de> <3bce8fcb-cd55-aa09-fe24-a27b21101816@it.aoyama.ac.jp> Message-ID: <000701d6d8c7$37f71700$a7e54500$@gmail.com> I think you forgot the most important part: Which email clients? None of the markup has the intended effect for me, and in pure plaintext none of it (currently) can. Whatever client you're using is interpreting the markup and applying the formatting. -----Original Message----- From: Unicode On Behalf Of Martin J. D?rst via Unicode Sent: Tuesday, December 22, 2020 7:15 PM To: unicode at unicode.org Subject: Re: Italics get used to express important semantic meaning, so unicode should support them Hello everybody, Just for what it's worth, here are a few details on how at least some email clients handle ASCII email styling conventions (/ for italics, _ for underscore, and * for boldface). On 22/12/2020 04:42, Ken Whistler via Unicode wrote: > > On 12/21/2020 11:05 AM, abrahamgross--- via Unicode wrote: >> The only reason why things like_italics_ or*italics* are around is >> because of the lack of real italics. I would go as far as to say that >> the very existence of*italics* in plain text shows that theres a >> real need for italics when writing plain text. >> This is a workaround around a real problem of the lack of italics if >> I've ever seen one? The two places above are displayed without styling in plaintext, probably because they are quoted. They show up styled in HTML because that contains additional tags (,...). > Actually, simple markup conventions like that mostly date from early > days of email, when plain text (and usually just ASCII at that) were > all you got. (By the way, the most usual interpretation of those is > _underscore_, /italic/, and *bold*, but whatever.) These show up styled in plaintext display, but not in HTML, presumably because Ken entered the styling characters by hand (in the HTML version, there is no markup). > Nowadays, presto chango, most email clients support rich text (in > HTML, usually), and you get to _underscore_, /italicize/, and *bold* > your text correctly whenever you want to, and even change the font > size to SHOUT, if you want. These show up styled in HTML, most probably because Ken used the text editor to style them that way. The plaintext version contains ASCII email styling characters (but the HTML version doesn't), and my guess is that they were added when the mailer produced the plaintext version. Your mailer and your mileage may vary. Regards, Martin. > Some folks here seem to be viewing the "problem" here the wrong way > round. The issue isn't that plain text cannot preserve all the "meaning" > conveyed in writing systems. When dealing with meaning conveyed with > conventions that involve styling, font change, color and such, you > simply depend on properly tiered text architecture and build support > for that in rich text and markup. It is ass-backwards to try to > continue to clot up plain text as the backbone of text interchange by > trying to import all the complications of styling directly into it as > if that representation were a plain text issue -- it isn't. > > Instead the *real* problem here is that in some communication contexts > that should be supporting rich text, implementations are still > restricting people to plain text when what they really want is easily > accessible and dependable rich text to convey more nuances accurately > (or just to be more expressive). If Twitter is half-assed about > supporting text styling, then direct your concerns in the proper > direction. You don't fix Twitter's or texting apps' use of text by > trying to force styling into the Unicode encoding of plain text. > > --Ken From doug at ewellic.org Wed Dec 23 16:40:43 2020 From: doug at ewellic.org (Doug Ewell) Date: Wed, 23 Dec 2020 15:40:43 -0700 Subject: Italics get used to express important semantic meaning, so unicode should support them Message-ID: <20201223154043.665a7a7059d7ee80bb4d670165c8327d.9832ff6ba9.wbe@email15.godaddy.com> Replying to a bunch of messages at once; the impending holidays and that have limited my available time for extended posts. Some of these topics may be ?resolved? by now, so enjoy the nostalgia. S?awomir Osipiuk wrote: >> All TAG symbols placed between a U+E003D TAG LESS-THAN SIGN and a >> U+E003E TAG GREATER-THAN SIGN, inclusive, are to be treated as if >> they were they corresponding ASCII characters, and run that through >> an HTML renderer. I guess if you wanted you could stipulate some >> reduced or restricted subset of HTML > > I've been informed off-list that BabelPad uses this as a formatting > option. So, it's been done. I do use this feature in BabelPad at times -- in fact, just today while copying the Unicode subsection on ?plain text? from Section 2.2 of the PDF, and not feeling inclined at the moment to open Word. But it?s a bit like using PUA characters, or even SCSU: I know this usage is not part of the standard and unlikely to be supported by anything else, so absent an explicit agreement, I?d better keep it to myself. > My guiding example is, "record fully the story text of a paperback > novel". So here is the salient part I gathered from the TUS definition, with BabelPad formatting (hee hee) removed. Apologies if this passage is too lengthy to qualify as fair use: Plain text represents character content only, not its appearance. It can be displayed in a variety of ways and requires a rendering process to make it visible with a particular appearance. If the same plain text sequence is given to disparate rendering processes, there is no expectation that rendered text in each instance should have the same appearance. Instead, the disparate rendering processes are simply required to make the text legible according to the intended reading. This legibility criterion constrains the range of possible appearances. The relationship between appearance and content of plain text may be summarized as follows: Plain text must contain enough information to permit the text to be rendered legibly, and nothing more. The emphasis on ?legibility? seems important here. Despite the focus on ?semantic meaning? in this thread, neither of those words appear anywhere in the TUS definition of plain text. Kent Karlsson wrote: > Now, where did I see something very much like [S?awomir?s original > suggestion with U+E0002 FORMAT TAG]??? > > Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very > close (especially the ?invisible by default? (?default ignorable?) IF > parsed correctly). And? ECMA-48 is already a standard. Perhaps surprisingly, or perhaps not, ECMA-48 is actually my favorite mechanism for low-level styling of plain text, mostly for the reasons Kent cites here and elsewhere: it?s lightweight, it?s been a standard for a long time, and it?s already in extensive use by at least one sector of text processing. Kent might be one of the surprised ones, because I haven?t been a fan of some of the ?updates? to ECMA-48 that he has recommended, in particular those that I feel extend, restrict, or invent too much. But I like the standard in general, and some modest amount of updating is probably inevitable to keep it current. S?awomir: > I didn't mention it because it's a bit outdated ?Outdated? is just generally a big red flag for me. If a standard doesn?t meet modern needs, and can?t reasonably be made to do so, that?s one thing, but the fact that it was developed some arbitrary number of years ago is not something I care about. Unicode itself is about 30 years old and I hope nobody sees that as evidence it needs imminent replacing. > But that "if parsed correctly" is quite the nit, isn't it? This is true for any such mechanism. I remember early HTML authors being upset when browsers stopped accepting text like this. Some of the emoji mechanisms involving combinations of ZWSP, variation selectors, Fitzpatrick swatches, and toupees might boggle some implementers? minds, but to play the game, you?ve got to learn the rules. David Starner wrote: > ECMA-48 is not plain text. Exactly so, but it?s a VERY thin layer above plain text, which is part of what I like about it. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From doug at ewellic.org Wed Dec 23 17:59:59 2020 From: doug at ewellic.org (Doug Ewell) Date: Wed, 23 Dec 2020 16:59:59 -0700 Subject: Is there a difference between converting a string of ASCII digits to an integer versus a string of non-ASCII digits to an =?UTF-8?Q?integer=3F?= Message-ID: <20201223165959.665a7a7059d7ee80bb4d670165c8327d.c68e32ad5b.wbe@email15.godaddy.com> Richard Wordingham wrote: >> I suggest you double-check about the RTL digits (N'Ko & Adlam); >> please take a look at the relevant Unicode book chapters. > > It looks as though the N'ko section documents the significance by > accident! I thought a policy was going to be documented (2012 or > slightly later) that decimal digits are stored most significant > digit first, but that doesn't seem to have happened. It happened for N?Ko anyway: ?N?Ko uses decimal digits specific to the script. These digits have strong right-to-left directionality. Numbers are stored in text in logical order with most significant digit first; when displayed, numerals are then laid out in right-to-left order, with the most significant digit at the rightmost side, as illustrated for the numeral 144 in Figure 19-3. This situation differs from how numerals are handled in Hebrew and Arabic, where numerals are laid out in left-to-right order, even though the overall text direction is right to left.? -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From doug at ewellic.org Wed Dec 23 18:16:44 2020 From: doug at ewellic.org (Doug Ewell) Date: Wed, 23 Dec 2020 17:16:44 -0700 Subject: Is there a difference between converting a string of ASCII digits to an integer versus a string of non-ASCII digits to an =?UTF-8?Q?integer=3F?= Message-ID: <20201223171644.665a7a7059d7ee80bb4d670165c8327d.ee5c25e810.wbe@email15.godaddy.com> >> I thought a policy was going to be documented (2012 or >> slightly later) that decimal digits are stored most significant >> digit first, but that doesn't seem to have happened. > > It happened for N?Ko anyway: Ohh, you mean a formal policy of the kind found on https://www.unicode.org/policies/policies.html . No, there doesn?t appear to be such a policy, although there also don?t appear to be any sets of decimal digits that deviate from it. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From doug at ewellic.org Wed Dec 23 18:42:10 2020 From: doug at ewellic.org (Doug Ewell) Date: Wed, 23 Dec 2020 17:42:10 -0700 Subject: =?UTF-8?Q?=31=CB=A2=E1=B5=97=2C=20=32=E2=81=BF=E1=B5=88=2C=20=33?= =?UTF-8?Q?=CA=B3=E1=B5=88=2C=20=34=E1=B5=97=CA=B0=20=E2=80=A6=20=39?= =?UTF-8?Q?=E1=B5=97=CA=B0?= Message-ID: <20201223174210.665a7a7059d7ee80bb4d670165c8327d.b4f303433f.wbe@email15.godaddy.com> Fredrick Brennan wrote: > With Unicode superscript lowercase letters, dates with superscript > ordinal indicators in English can be written in plaintext, e.g.: > > 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so > on. > > [...] > > However, I have a feeling that this use is an abuse of the standard, > but that brings up an interesting comparison with the ordinal > indicators for Spanish, Portuguese (& other languages?), the masculine > ? and the feminine ?. > > If anyone has time to answer, why is one an abuse and the other not, > if indeed 1?? is an abuse as I think? I suppose it is, and the best answer to ?why? is definitional: because ? and ? were encoded (in legacy standards, and consequently brought into Unicode) for the purpose of being ordinal indicators, whereas ? and ? and ? and ? and ? and ? were encoded for the purpose of being phonetic modifiers. (Even ?, encoded alongside the superscript digits, ?functions as a modifier letter? according to the note in the code chart.) I know that 1st and 2nd and 3rd and 4th (no superscripts) are generally considered legible in English (back to the ?plain text is for legibility? definition). I don?t know if 1o and 2a are considered equally legible in Spanish and Portuguese; if they are not, that might help explain why dedicated characters for ? and ? were prioritized in earlier character sets. There are two types of people: those who are bothered by ?Unicode abuse? and those who are not. -- Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org From kent.b.karlsson at bahnhof.se Wed Dec 23 19:20:42 2020 From: kent.b.karlsson at bahnhof.se (Kent Karlsson) Date: Thu, 24 Dec 2020 02:20:42 +0100 Subject: =?utf-8?B?UmU6IDHLouG1lywgMuKBv+G1iCwgM8qz4bWILCA04bWXyrAg4oCm?= =?utf-8?B?IDnhtZfKsA==?= In-Reply-To: <20201223174210.665a7a7059d7ee80bb4d670165c8327d.b4f303433f.wbe@email15.godaddy.com> References: <20201223174210.665a7a7059d7ee80bb4d670165c8327d.b4f303433f.wbe@email15.godaddy.com> Message-ID: > 24 dec. 2020 kl. 01:42 skrev Doug Ewell via Unicode : > > Fredrick Brennan wrote: > >> With Unicode superscript lowercase letters, dates with superscript >> ordinal indicators in English can be written in plaintext, e.g.: >> >> 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so >> on. >> >> [...] >> >> However, I have a feeling that this use is an abuse of the standard, >> but that brings up an interesting comparison with the ordinal >> indicators for Spanish, Portuguese (& other languages?), the masculine >> ? and the feminine ?. >> >> If anyone has time to answer, why is one an abuse and the other not, >> if indeed 1?? is an abuse as I think? > > I suppose it is, and the best answer to ?why? is definitional: > because ? and ? were encoded (in legacy standards, and consequently > brought into Unicode) for the purpose of being ordinal indicators, > whereas ? and ? and ? and ? and ? and ? were encoded for the > purpose of being phonetic modifiers. (Even ?, encoded alongside the > superscript digits, ?functions as a modifier letter? according to > the note in the code chart.) > > I know that 1st and 2nd and 3rd and 4th (no superscripts) are generally > considered legible in English (back to the ?plain text is for > legibility? definition). I don?t know if 1o and 2a are considered > equally legible in Spanish and Portuguese; I think they are. At least it is not uncommon to write them without superscripting them, and I don?t think that causes any confusion. > if they are not, that might > help explain why dedicated characters for ? and ? were prioritized in > earlier character sets. There may be some stronger preference to superscript them, but not more than that. (And that France did not insist on ?/? in Latin-1?) Note that superscript o and superscript a are doubly encoded. AFAICT, I think the explanation for that is the following: The ordinal indicators are optionally underlined (varies by font) at the superscript level, whereas the modifier letters are not underlined. (And I know of no current styling mechanism, or font feature, to underline them at the superscript level; underlining would underline them at the normal letter baseline level.) > There are two types of people: those who are bothered by ?Unicode > abuse? and those who are not. Nit: I submitted to CLDR RBNF rules for numeric ordinals in several languages using the superscript letters several years ago. After a year or two the CLDR committee replaced the superscript letters by ordinary letters, citing lack of (consistent) font support for the superscript letters. Even now, looking at this email, I see superscript letters of inconsistent sizes and positions, and some superscript letters (even if only looking for a-z) might not be supported in ?all? fonts. /Kent K > -- > Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org > > > From richard.wordingham at ntlworld.com Thu Dec 24 09:50:29 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 24 Dec 2020 15:50:29 +0000 Subject: Is there a difference between converting a string of ASCII digits to an integer versus a string of non-ASCII digits to an integer? In-Reply-To: <20201223165959.665a7a7059d7ee80bb4d670165c8327d.c68e32ad5b.wbe@email15.godaddy.com> References: <20201223165959.665a7a7059d7ee80bb4d670165c8327d.c68e32ad5b.wbe@email15.godaddy.com> Message-ID: <20201224155029.39e3f212@JRWUBU2> On Wed, 23 Dec 2020 16:59:59 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > >> I suggest you double-check about the RTL digits (N'Ko & Adlam); > >> please take a look at the relevant Unicode book chapters. > > > > It looks as though the N'ko section documents the significance by > > accident! I thought a policy was going to be documented (2012 or > > slightly later) that decimal digits are stored most significant > > digit first, but that doesn't seem to have happened. > > It happened for N?Ko anyway: > > ?N?Ko uses decimal digits specific to the script. These digits have > strong right-to-left directionality. Numbers are stored in text in > logical order with most significant digit first; when displayed, > numerals are then laid out in right-to-left order, with the most > significant digit at the rightmost side, as illustrated for the > numeral 144 in Figure 19-3. This situation differs from how numerals > are handled in Hebrew and Arabic, where numerals are laid out in > left-to-right order, even though the overall text direction is right > to left.? As you later noted, the third expresses not a policy, but a rule for N'ko 'decimal digits'. The last sentence is simply appalling: 1. Hebrew numerals are written with the most significant element on the right. For Unicode, what is significant is that as the elements are letters, they follow the normal presentation rule for sequences of Hebrew letters. 2. I would expect the components of Arabic letter numerals to follow the same rules as when the elements are being used as letters. I can find examples of both biggest first and smallest first. 3. The 'decimal digits' for Arabic 'five and twenty' are laid out in the order sounded, i.e. the digit 5 is on the right and the digit 2 is on the left. As with N'ko, the most significant digit is stored first. Richard. From mark at kli.org Thu Dec 24 10:41:36 2020 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 24 Dec 2020 11:41:36 -0500 Subject: Is there a difference between converting a string of ASCII digits to an integer versus a string of non-ASCII digits to an integer? In-Reply-To: <20201224155029.39e3f212@JRWUBU2> References: <20201223165959.665a7a7059d7ee80bb4d670165c8327d.c68e32ad5b.wbe@email15.godaddy.com> <20201224155029.39e3f212@JRWUBU2> Message-ID: <3552ad0b-60d1-11a9-0293-12dd07e50eab@kli.org> An HTML attachment was scrubbed... URL: From indolering at gmail.com Tue Dec 29 12:37:41 2020 From: indolering at gmail.com (Zach Lym) Date: Tue, 29 Dec 2020 10:37:41 -0800 Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?= =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?= In-Reply-To: References: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org> <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org> Message-ID: Trying to reboot this conversation, what *demand* is there for supporting non-latin digits? AFAICT [1], most literate adults use latin digits (0-9) for basic math in North America, South America, Europe, Australia, and countries using a CJK script. Online stores and license plates for the UAE and India lists prices using latin digits. By process of elimination [2], language groups that don't use latin numerals drops to <100 million. If those numbers are accurate, then there isn't enough of a critical mass to justify the implementation effort. Not if you aren't also going to translate keywords.... Thank you, -Zach Lym [1]: https://linguistics.stackexchange.com/questions/37899/how-prevalent-are-western-hindu-arabic-numerals-digits-0-9-in-cultures-with-si?noredirect=1#comment87069_37899 [2]: https://en.wikipedia.org/wiki/List_of_writing_systems#List_of_writing_scripts_by_adoption On Wed, Dec 23, 2020 at 7:30 AM Steven R. Loomis wrote: > For much more on localized numbers, see CLDR, > https://www.unicode.org/reports/tr35/tr35-numbers.html#Contents > > -s > > On Sun, Dec 20, 2020 at 1:57 PM Zach Lym via Unicode > wrote: > >> I don't think it's fair to dismiss this as "not a unicode problem." As >> the OP pointed out, support for non-latin variable names is largely due to >> Unicode's identity standard and extensive implementation advice. >> >> The section on numbering (5.5) is only a page long and >> essentially recommends handling decimal based numbering systems. There >> isn't nearly as much care given to this topic. There is a standard annex >> on mathematics, but that is in PDF form and is largely concerned with >> parsing and display of mathematical formulas. >> >> However, as is the answer to most questions, it is a matter of time and >> money. If someone is willing to spend the time expanding 5.5 writing a new >> annex, I am sure the Unicode committee would be happy to review it. Would >> you be interested in doing that legwork? >> >> I'm actually pretty new here, what's the best way Roger could contribute >> to make Unicode better in this regard? >> >> Thanks, >> -Zach Lym >> >> On Wed, Dec 16, 2020 at 5:23 PM Mark E. Shoulson via Unicode < >> unicode at unicode.org> wrote: >> >>> On 12/16/20 10:40 AM, Doug Ewell via Unicode wrote: >>> >>> What I don't understand here is why this is being framed implicitly as a Unicode problem, or an XML problem, or a general law of nature ("why can?t a Bengali-speaking person use the Bengali digits"), instead of an inherent limitation of that particular library function used for that particular language. >>> >>> Yes, exactly. This is "a characteristic of the code libraries, not a >>> Unicode problem." >>> >>> >>> There are probably reasonable reasons not to update the actual >>> atol/strtol calls, but one could certainly write a library to do what >>> you're talking about... and apparently someone has, by Bill Poser's report >>> of his libuninum. There ya go. >>> >>> >>> ~mark >>> >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Tue Dec 29 13:28:23 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 29 Dec 2020 11:28:23 -0800 Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?= =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?= In-Reply-To: References: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org> <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org> Message-ID: On Tue, Dec 29, 2020 at 10:41 AM Zach Lym via Unicode wrote: > Trying to reboot this conversation, what *demand* is there for supporting > non-latin digits? > There are hundreds of millions of people in the Arabic-speaking world, in & near Iran, in parts of India, ... that routinely use and prefer their native digits. If those numbers are accurate, then there isn't enough of a critical mass > to justify the implementation effort. > What effort? Given basic Unicode support in many programming languages and libraries, it takes minutes to go from parsing ASCII digits to parsing any & all decimal digits. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskass at code2001.com Tue Dec 29 13:30:00 2020 From: jameskass at code2001.com (James Kass) Date: Tue, 29 Dec 2020 19:30:00 +0000 Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?= =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?= In-Reply-To: References: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org> <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org> Message-ID: On 2020-12-29 6:37 PM, Zach Lym via Unicode wrote: > If those numbers are accurate, then there isn't enough of a critical mass > to justify the implementation effort. Not if you aren't also going to > translate keywords.... Figure 4 from N5076 shows an Adlam calculator app. https://unicode.org/wg2/docs/n5076-19119r-adlam-font-repl.pdf Third party developers exist and members of less used writing systems learn to code.? Even in the absence of critical mass, gaps get filled and needs get met. Unicode?s r?le is to provide a standard means of exchanging and storing non-Western digits as well as assigning properites and so forth to them.? Libraries such as CLDR exist to help implementers, and members of the actual user communities seem happy to help with keyword translation. From richard.wordingham at ntlworld.com Tue Dec 29 13:58:05 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 29 Dec 2020 19:58:05 +0000 Subject: Unicode is universal, so how come that universality =?UTF-8?B?ZG9lc27igJl0?= apply to digits? In-Reply-To: References: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org> <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org> Message-ID: <20201229195805.4c45425c@JRWUBU2> On Tue, 29 Dec 2020 11:28:23 -0800 Markus Scherer via Unicode wrote: > What effort? Given basic Unicode support in many programming > languages and libraries, it takes minutes to go from parsing ASCII > digits to parsing any & all decimal digits. I think you've overlooked the paperwork. There's probably code that relies on non-ASCII digits not being treated the same way as ASCII digits. Richard. From jameskass at code2001.com Tue Dec 29 14:01:25 2020 From: jameskass at code2001.com (James Kass) Date: Tue, 29 Dec 2020 20:01:25 +0000 Subject: Adlam In-Reply-To: <176791270b7.10fb4e49546040.1392071428347243955@kittens.ph> References: <176791270b7.10fb4e49546040.1392071428347243955@kittens.ph> Message-ID: <3a4f6836-f18f-d495-f410-25750a608712@code2001.com> On 2020-12-19 3:37 AM, Fredrick Brennan via Unicode wrote: > Often when other scripts are discussed, Adlam is used comparatively, as in, "well we want to avoid what happened with Adlam", or "we have had painful experience with this with Adlam".I know some of the issues in Adlam, but if anyone has the time, I (and hopefully others!) would benefit from a retelling of the "Adlam in Unicode" story. I know in the end it's a very happy story, but I'm especially curious about the bumps along the road.?Best,Fred Brennan (I would also welcome some insider insight on this history.) Perusing this document might give a rough sketch: https://unicode.org/wg2/docs/n5076-19119r-adlam-font-repl.pdf Briefly, Adlam is a dynamic and developing writing system.? After publication, shapes of many of the letter forms were changed.? Which means that revisions were needed not only in on-line charts, but also the fonts used to produce those charts, and software bundles containing those fonts.? As any programmer knows, revision can be costly. This should be regarded as ?par for the course? when striving to support a developing writing system. From asmusf at ix.netcom.com Tue Dec 29 15:08:33 2020 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 29 Dec 2020 13:08:33 -0800 Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?= =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?= In-Reply-To: <20201229195805.4c45425c@JRWUBU2> References: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org> <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org> <20201229195805.4c45425c@JRWUBU2> Message-ID: <47370a52-bdc7-fab2-7458-8334f8fd8bee@ix.netcom.com> An HTML attachment was scrubbed... URL: From marcelpauluk at ufpr.br Wed Dec 30 14:37:29 2020 From: marcelpauluk at ufpr.br (Prof. Pauluk) Date: Wed, 30 Dec 2020 17:37:29 -0300 Subject: =?UTF-8?Q?Origins_of_=E2=8C=9A_U=2B231A_WATCH_and_=E2=8C=9B_U=2B231B_HOURGLASS?= Message-ID: Ol? a todos, I am trying to do some kind of "provenance history" of a series of proleptic emoji, and I am stuck with these two here: ? U+231A WATCH and ? U+231B HOURGLASS. While, for example, ? U+232B ERASE TO THE LEFT and ? U+2328 KEYBOARD could be easily seen as motivated by the symbols 2023 BACKWARD ERASE and 5991 KEYBOARD from ISO7000/ IEC60417 Graphical Symbols for Use on Equipment, it is difficult to see ? U+231A WATCH and ? U+231B HOURGLASS being originated from, let's say, 5184 CLOCK and 1366 ELAPSED OPERATING HOURS from the same standard. I am well acquainted with ISO/TC145 Graphical Symbols collection of signs, and I am almost sure that those two symbols didn't come from any technical standard from ISO or IEC. Does anyone remember why these two Miscellaneous Technical Symbols were added, back then in the 1990s? Could it be because of Xerox Star/ Apple Lisa's HOURGLASS and Susan Kare's WRISTWATCH icon for the 1984 Macintosh? Regards, Marcel Pauluk -------------- next part -------------- An HTML attachment was scrubbed... URL: From abrahamgross at disroot.org Wed Dec 30 18:18:51 2020 From: abrahamgross at disroot.org (abrahamgross at disroot.org) Date: Thu, 31 Dec 2020 00:18:51 +0000 (UTC) Subject: =?UTF-8?Q?Origins_of_=E2=8C=9A_U+231A_WATC?= =?UTF-8?Q?H_and_=E2=8C=9B_U+231B_HOURGLASS?= In-Reply-To: References: Message-ID: Id assume these emoji are from the original japanese set -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at sonic.net Wed Dec 30 19:50:23 2020 From: kenwhistler at sonic.net (Ken Whistler) Date: Wed, 30 Dec 2020 17:50:23 -0800 Subject: =?UTF-8?Q?Re=3a_Origins_of_=e2=8c=9a_U+231A_WATCH_and_=e2=8c=9b_U+2?= =?UTF-8?Q?31B_HOURGLASS?= In-Reply-To: References: Message-ID: Nope. Check their Age (see DerivedAge.txt in the UCD). Their Age is 1.1. And in fact, they go back even further -- they were published in Unicode 1.0 in 1991. They predate the Japanese telcom vendor sets that were incorporated in Unicode 6.0 in 2010. They were later mapped to KDDI and DoCoMo emoji in 2007 (see L2/07-257), so WATCH and HOURGLASS did exist in those sets, but that wasn't their original source for encoding in Unicode. I don't think they were in XCCS (the Xerox character set) or in IBM sets. They might have been picked up as well-known computer interface symbols from the 80's. --Ken On 12/30/2020 4:18 PM, abrahamgross--- via Unicode wrote: > Id assume these emoji are from the original japanese set -------------- next part -------------- An HTML attachment was scrubbed... URL: From marcelpauluk at ufpr.br Wed Dec 30 20:45:29 2020 From: marcelpauluk at ufpr.br (M. Pauluk) Date: Wed, 30 Dec 2020 23:45:29 -0300 Subject: =?UTF-8?Q?Re=3A_Origins_of_=E2=8C=9A_U=2B231A_WATCH_and_=E2=8C=9B_U=2B231B_HOURG?= =?UTF-8?Q?LASS?= In-Reply-To: References: Message-ID: Thanks Ken! I had already checked XCCS and IBM code pages too, ? U+231A WATCH and ? U+231B HOURGLASS really couldn't have originated there. Is there any documentation of this selection process? I would also very much like to know why some symbols like U+262E PEACE SYMBOL or U+2668 HOT SPRINGS were added right from the beginning, before there were even any kind of pressure to encode pictographs! Those initial blocks of symbols remain the most obscure for me... On Wed, Dec 30, 2020 at 10:56 PM Ken Whistler via Unicode < unicode at unicode.org> wrote: > Nope. Check their Age (see DerivedAge.txt in the UCD). Their Age is 1.1. > And in fact, they go back even further -- they were published in Unicode > 1.0 in 1991. They predate the Japanese telcom vendor sets that were > incorporated in Unicode 6.0 in 2010. > > They were later mapped to KDDI and DoCoMo emoji in 2007 (see L2/07-257), > so WATCH and HOURGLASS did exist in those sets, but that wasn't their > original source for encoding in Unicode. > > I don't think they were in XCCS (the Xerox character set) or in IBM sets. > They might have been picked up as well-known computer interface symbols > from the 80's. > > --Ken > On 12/30/2020 4:18 PM, abrahamgross--- via Unicode wrote: > > Id assume these emoji are from the original japanese set > > -------------- next part -------------- An HTML attachment was scrubbed... URL: