From wjgo_10009 at btinternet.com Fri Dec 4 06:30:32 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 4 Dec 2020 12:30:32 +0000 (GMT)
Subject: A workaround for using colour fonts in some application programs
that do not support colour fonts
Message-ID: <36a93647.41c.1762dbb82d0.Webtop.218@btinternet.com>
Hi
Some readers might like to know of a workaround that I have devised for
using colour fonts in some application programs that do not support
colour fonts.
The technique works because the Unicode code point for each character is
exactly the same whether the character is displayed in a colour font or
in a monochrome font.
The technique is to compose the design using the application program,
the characters appearing in plain monochrome form, then export as an svg
file without selecting the option to convert the text to curves.
The svg file is then displayed using an application program that does
support colour fonts.
This works simply because the application program places the Unicode
character code points in the svg file and those Unicode character code
points are successfully used by the colour font supporting application
program. This is because the Unicode code point for each character is
exactly the same whether the character is displayed in a colour font or
in a monochrome font.
For example, I started with Serif Affinity Publisher, which at present
does not support colour fonts, produced an svg file without converting
the text to curves, displayed the svg file using Microsoft Edge, made a
'print screen' image, then trimmed out the browser window parts using
Microsoft Paint and saved the result as a png file.
The technique has been found to work with Affinity Publisher, Affinity
Designer and two legacy Serif products, PagePlus and CraftArtist2.
Please find attached a graphic made by me using Affinity Publisher,
Microsoft Edge, Microsoft Paint and the Playbox colour font designed and
kindly supplied free with a licence by Matt Lyon.
https://forum.affinity.serif.com/index.php?/topic/128285-colour-fonts-and-affinity-products/
William Overington
Friday 4 December 2020
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: playbox_in_publisher.png
Type: image/png
Size: 40979 bytes
Desc: not available
URL:
From christian.kleineidam at gmail.com Fri Dec 11 06:57:23 2020
From: christian.kleineidam at gmail.com (Christian Kleineidam)
Date: Fri, 11 Dec 2020 13:57:23 +0100
Subject: Italics get used to express important semantic meaning, so unicode
should support them
Message-ID:
In the FAQ on Ligatures it's written "The mathematical letters and digits
are meant to be used only in mathematics, where the distinction between a
plain and a bold letter is fundamentally semantic rather than stylistic."
This suggests that the spirit of Unicode includes the intention to be able
to represent semantic meaning.
On Wikidata, we have the open problem of what to do with academic articles
that have italics in their official title. We store for example the paper
https://www.wikidata.org/wiki/Q33988883 which according to what the
publisher writes on http://www.biochemsoctrans.org/content/33/4/582 has
italics as part of it's proper name. In Wikidata we want to be able to
store the semantic meaning.
This gives us the choice between either using In Wikidata, to either list
the paper as "Evidence suggesting that Homo neanderthalensis contributed
the H2 MAPT haplotype to Homo sapiens" or "Evidence suggesting that
???? ???????????????? contributed the H2 ????
haplotype to ???? ???????" which uses the mathematical
characters against recommendations while the website lists it as "Evidence
suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo
sapiens".
In scientific articles like that the ability to represent italics is needed
to express all of the semantic meaning that's contained in the title. In
contrast to properties like font-size, italics start to be used in the real
world to express semantic meaning. For a project like Wikidata that cares
about storing the semantic meaning of the title of an academic paper that
unicode problematic as it leads us to lose information.
You might say that if unicode doesn't serve the needs of Wikidata to store
the semantic content of the texts we care about, we should add additional
formatting on-top of unicode. Between RTF, Markdown, SGML, HTML, XML and
Wikitext there are multiple different formats we could use on Wikidata to
potentially represent italics. If we would however choose any one of them
that would make it harder for data-reusers who use another format to
interact with our data as they would need to run a parser over the data
which increases their code complexity and makes it harder to interact with
our data.
Official style guidelines like the Chicago Manual of Style (18th edition)
specify that certain italics should be used to express certain semantic
meaning:
22.1.3 Other Types of Names
Other types of names also follow specific patterns for capitalization, and
some require italics.
22.2.1 Foreign-Language Terms
Italicize isolated words and phrases in foreign languages likely to be
unfamiliar to readers of English, and capitalize them as in their language.
22.3.2.1 ITALICS. Italicize the titles of most longer works, including the
types listed here. An initial the should be roman and lowercase before
titles of periodicals, or when it is not considered part of the title. For
parts of these works and shorter works of the same type, see 22.3.2.2.
The inability to follow the recommendations of the Chicago Manual of Style
to express semantic meaning in italics means that unicode fails in it's
mission to be able to express all semantic distinctions. This means that
it's technically impossible to follow the Chicago Manual of Style in code
comments of programming code that are in unicode.
Outside of specialized needs like those of Wikidata and programmers who
might want to follow the Chicago Manual of Style in context the inability
of unicode to represent italics and bold of texts makes life harder for
average users as well. Web browsers can't offer their users the ability to
format a part of the text as italics or bold. As a result many users don't
know how to italicize or bold text when they write online as different
website use different standards. Many online systems break WYSIWYG for
italics and bold which makes it harder for non-technical users to use them
to express themselves.
If Unicode would support italics and bold, the browser could make it easy
for users to have italics or boldness. Even smartphone would have the
option to offer a user to italicize or bold a text in the menu that
currently allows copying and pasting.
Websites like https://yaytext.com/bold-italic/ get used by users to express
themselves in italics and bold on platforms like Facebook and Twitter that
use Unicode without additional formatting.
Having to use the unofficial workaround of mathematical letters is
undesirable because it means that software like screen readers is less
likely to interact well with the resulting text.
Proposal of a solution:
In today's usage italics often have semantic meaning. There are many cases
where it's desirable that a user can express such meaning but where there's
no intention to give the user control over features such as font size that
the user gets when HTML or RTF is used as format. With the symbol for
Right-to-Left text there's a precedent in unicode for having signs that
manipulate multiple following characters. At the time of the design italics
weren't used for expressing fundamentally semantic meaning such as "Homo
neanderthalensis" referring to a a species as it's used in the title of the
above paper.
Create a new unicode character for begin/end italic formatting and
begin/end bold formatting that works like the unicode character for the
Right-to-Left switch.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From kenwhistler at sonic.net Fri Dec 11 12:42:52 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Fri, 11 Dec 2020 10:42:52 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
Message-ID:
On 12/11/2020 4:57 AM, Christian Kleineidam via Unicode wrote:
> Create a new unicode character for begin/end italic formatting and
> begin/end bold formatting that works like the unicode character for
> the Right-to-Left switch.
... and ...
Yeah, they are sequences of 3 (or 4) existing characters, and not single
code points, but they accomplish what you are asking for and they work
everywhere on the web already.
Nobody would thank you for introducing yet *another* form of scoped
markup for the same effects that would take years to be picked up
(inconsistently) in thousands of implementations, and which would
introduce yet more possibilities for conflicts in dueling schemes for
markup in text.
--Ken
From kilobyte at angband.pl Fri Dec 11 13:42:01 2020
From: kilobyte at angband.pl (Adam Borowski)
Date: Fri, 11 Dec 2020 20:42:01 +0100
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
Message-ID: <20201211194201.GA5630@angband.pl>
On Fri, Dec 11, 2020 at 10:42:52AM -0800, Ken Whistler via Unicode wrote:
> On 12/11/2020 4:57 AM, Christian Kleineidam via Unicode wrote:
> > Create a new unicode character for begin/end italic formatting and
> > begin/end bold formatting that works like the unicode character for the
> > Right-to-Left switch.
>
> ... and ...
>
> Yeah, they are sequences of 3 (or 4) existing characters, and not single
> code points, but they accomplish what you are asking for and they work
> everywhere on the web already.
>
> Nobody would thank you for introducing yet *another* form of scoped markup
> for the same effects that would take years to be picked up (inconsistently)
> in thousands of implementations, and which would introduce yet more
> possibilities for conflicts in dueling schemes for markup in text.
And, despite the original recommendation, enough people use math characters
for that, so even Google considers them equivalent to basic ASCII.
So just:
echo 'Homo sapiens'|tran italic
and 'ere you go.
?!
--
??????? Latin: meow 4 characters, 4 columns, 4 bytes
??????? Greek: ???? 4 characters, 4 columns, 8 bytes
??????? Runes: ???? 4 characters, 4 columns, 12 bytes
??????? Chinese: ? 1 character, 2 columns, 3 bytes <-- best!
From doug at ewellic.org Fri Dec 11 16:38:07 2020
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 11 Dec 2020 15:38:07 -0700
Subject: Italics get used to express important semantic meaning,
so unicode should support them
In-Reply-To:
References:
Message-ID: <000301d6d00e$4d244330$e76cc990$@ewellic.org>
Christian Kleineidam wrote:
> "Evidence suggesting that ???? ???????????????? contributed the H2
> ???? haplotype to ???? ???????"
"Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo sapiens"
This title is completely meaningful in plain text. The convention to style the names of species and haplotypes in italics is just that, a styling convention.
> Between RTF, Markdown, SGML, HTML, XML and Wikitext there are multiple
> different formats we could use on Wikidata to potentially represent
> italics. If we would however choose any one of them that would make it
> harder for data-reusers who use another format to interact with our
> data as they would need to run a parser over the data which increases
> their code complexity and makes it harder to interact with our data.
https://xkcd.com/927/
> The inability to follow the recommendations of the Chicago Manual of
> Style to express semantic meaning in italics means that unicode fails
> in it's mission to be able to express all semantic distinctions. This
> means that it's technically impossible to follow the Chicago Manual of
> Style in code comments of programming code that are in unicode.
Style guides such as Chicago and AP and MLA cover many stylistic realms beyond this. They tell the writer how to indent certain passages and what sort of contrastive font faces and sizes should be used for quotations and how tables should be laid out. None of this is within the scope of a plain-text encoding standard either.
--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
From richard.wordingham at ntlworld.com Fri Dec 11 17:19:08 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 11 Dec 2020 23:19:08 +0000
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
Message-ID: <20201211231908.29035298@JRWUBU2>
On Fri, 11 Dec 2020 13:57:23 +0100
Christian Kleineidam via Unicode wrote:
> At the time of the design italics weren't used for
> expressing fundamentally semantic meaning such as "Homo
> neanderthalensis" referring to a a species as it's used in the title
> of the above paper.
I just looked in a 1969 reprint of a school biology textbook published
in 1966. It consistently italicises generic names such as _Drosophila_
within sentences, so I find your claim hard to credit.
Of course, typewritten materials had to resort to underlining to
indicate italicisation in such cases. I think I've seen such usage,
but my memory may not be reliable.
Richard.
From jameskass at code2001.com Fri Dec 11 19:41:31 2020
From: jameskass at code2001.com (James Kass)
Date: Sat, 12 Dec 2020 01:41:31 +0000
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
Message-ID:
On 2020-12-11 12:57 PM, Christian Kleineidam via Unicode wrote:
> This suggests that the spirit of Unicode includes the intention to be able
> to represent semantic meaning.
That's the spirit!
The topic of italics in Unicode was last discussed extensively on this
list in January of 2019, bleeding into February.
https://unicode.org/mail-arch/unicode-ml/y2019-m01/
As Adam Borowski points out, enough people are using the math
alphanumerics that we have a ?? ????? method.
From indolering at gmail.com Fri Dec 11 22:14:08 2020
From: indolering at gmail.com (Zach Lym)
Date: Fri, 11 Dec 2020 20:14:08 -0800
Subject: Normalization Generics (NFx, NFKx, NFxy)
Message-ID:
I have been tracking down the rationale behind the normalization
choices in filesystems. One trouble spot for implementers is
interpreting strict logician terminology paired with imprecise pseudo
code. Take the definition of Unicode's caseless matching algorithm
[D145]:
> A string X is a canonical caseless match for a string Y if and only if:
> NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))
The W3C Canonical Case Fold Normalization algorithm claims to be
compatible with [D145], but uses NFC in the last step
[w3c-charmod-norm], leading to an apparent contradiction. Even though
Unicode explains that "case folding is closed under canonical
normalization" it took me a long time to find that passage and
convince myself that the W3C and Unicode matching algorithms are
equivalent. I am not alone: *Linux kernel hackers couldn't figure it
out either* [linux-norm]!
I was originally going to propose additions to D145 textual
description, cross-references to the implementation section, and
adding discussion of W3C charmod-norm. However, I don't think this
would help as the text is already quite dense and most people will
just ignore everything outside the example anyway [minimalist-manual].
I would instead like to propose normalization form generics for use in
pseudo code definitions:
NFx = NFD|NFC
NFKx = NFKD|NFKC
NFxy = NFD|NFC|NFKD|NFKC
Freestanding `X`/`Y` variables should be probably be replaced to
disambiguate them from the `NFx` nomenclature. `s1`/`s2` would work
but `foo`/`bar` is less dense:
NFx(caseFold(NFD(foo))) = NFx(caseFold(NFD(bar)))
`NFx` does not currently appear within the Unicode standard itself,
but is used in the normalization technical note [UAX15]. However,
**UAX15 defines `NFx` twice**, first as NFD|NFC|NFKD|NFKC and later on
as NFD|NFC. I think the proposed convention gets the most mileage out
of the nomenclature and is how I have seen `NFx` used in the real
world [linus].
Thank you!
-Zach Lym
[w3c-charmod-norm]:
https://w3c.github.io/charmod-norm/#CanonicalFoldNormalizationStep
[linux-norm]: https://lwn.net/ml/linux-fsdevel/20190318202745.5200-10-krisman%40collabora.com
[minimalist-manual]: https://dl.acm.org/doi/10.1207/s15327051hci0302_2
[UAX15]: https://unicode.org/reports/tr15/
[linus]: https://lore.kernel.org/linux-fsdevel/CAHk-=wiFtZL5rK3T-HQPm0oG4vekDJEKS47P8BbzHSXt_6SHuA at mail.gmail.com/
From sosipiuk at gmail.com Fri Dec 11 23:58:41 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Sat, 12 Dec 2020 00:58:41 -0500
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To:
References:
Message-ID:
On Fri, Dec 11, 2020 at 11:49 PM Zach Lym via Unicode
wrote:
>
> > A string X is a canonical caseless match for a string Y if and only if:
> > NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))
>
> The W3C Canonical Case Fold Normalization algorithm claims to be
> compatible with [D145], but uses NFC in the last step
> [w3c-charmod-norm], leading to an apparent contradiction. Even though
> Unicode explains that "case folding is closed under canonical
> normalization" it took me a long time to find that passage and
> convince myself that the W3C and Unicode matching algorithms are
> equivalent.
The more general rule is that:
NFC(X) = NFC(Y) if and only if NFD(X) = NFD(Y).
I.e. you can always replace one canonical form with the other in
equivalence comparisons. (As long as you apply the same one to both
sides, of course, but which one is up to you.)
> I would instead like to propose normalization form generics for use in
> pseudo code definitions:
>
> NFx = NFD|NFC
> NFKx = NFKD|NFKC
> NFxy = NFD|NFC|NFKD|NFKC
I would prefer the last one to be:
NF(K)x = NFD|NFC|NFKD|NFKC; or perhaps
NF[K]x = NFD|NFC|NFKD|NFKC; to look a bit more like ABNF.
S?awomir Osipiuk
From wjgo_10009 at btinternet.com Sat Dec 12 09:39:28 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Sat, 12 Dec 2020 15:39:28 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <79fb5335.d72.17657954ac1.Webtop.223@btinternet.com>
References:
<79fb5335.d72.17657954ac1.Webtop.223@btinternet.com>
Message-ID: <35cce9c0.d7f.176579b5f0a.Webtop.223@btinternet.com>
Hi
You might find the following links of interest. The proposal was not
successful and was dismissed strongly, indeed using italics for
emphasis.
For the avoidance of doubt I did not advocate regarding the encoding of
the mathematical italic characters as a precedent for what I proposed.
It is somewhat ironic that the refusal uses italics for emphasis and
could, in my opinion, be reasonably regarded as supporting evidence for
the case of what you are wanting encoding, as that emphasis cannot at
present be expressed in plain text. If it is not a semantic difference
then it seems to me that there is no reason whatsoever to use italics at
all in that refusal notice. So has Unicode Inc. in fact shown in its
refusal the very need that it is refusing to encode?
https://www.unicode.org/L2/L2019/19063-italic-vs.pdf
https://www.unicode.org/L2/L2019/19195-italic-cmt.pdf
https://forum.high-logic.com/viewtopic.php?f=10&t=7831
https://www.unicode.org/alloc/nonapprovals.html
However, such dismissals are not absolute because sometimes there is a
U-turn later, for example with the encoding of emoji. Look at where
emoji encoding is now, no longer about just backwards compatibility yet
pushing forward with new designs. For the avoidance of doubt I am
pleased that emoji are being encoded. I wish that they would not insist
that my proposals for encoding a futuristic idea of mine are out of
scope and refuse to allow them to be discussed in this mailing list or
put to The Unicode Technical Committee.
I note that you mention a QID item.
There is an ongoing public review about encoding what are being called
QID emoji.
https://www.unicode.org/review/pri408/
Although the page currently shows a closing date that has passed, the
public review has, in fact, been reopened as listed on the following
page.
https://www.unicode.org/review/
Best regards,
William Overington
Saturday 12 December 2020
http://www.users.globalnet.co.uk/~ngo/
My website is safe to use, it is not hosted on my own computer, but is
hosted on a server run by Plusnet PLC, a United Kingdom company.
From christian.kleineidam at gmail.com Sat Dec 12 13:01:05 2020
From: christian.kleineidam at gmail.com (Christian Kleineidam)
Date: Sat, 12 Dec 2020 20:01:05 +0100
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <000301d6d00e$4d244330$e76cc990$@ewellic.org>
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
Message-ID:
On Fri, Dec 11, 2020 at 11:38 PM Doug Ewell wrote:
> Christian Kleineidam wrote:
>
> > "Evidence suggesting that ???? ????????????????
> contributed the H2
> > ???? haplotype to ???? ???????"
>
> "Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT
> haplotype to Homo sapiens"
>
> This title is completely meaningful in plain text. The convention to style
> the names of species and haplotypes in italics is just that, a styling
> convention.
>
Would you also say there's no semantic difference between "Evidence
suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to
Homo sapiens" and EVIDENCE SUGGESTING THAT HOMO NEANDERTHALENSIS
CONTRIBUTED THE H2 MAPT HAPLOTYPE TO HOMO SAPIENS"? If so, why does unicode
allow those to be formatted differently?
I think that capitalization generally gets used to express semantic
meaning. Capitalizing the first character of a sentence is a way to
semantically mark the start of the sentence. Capitalizing Homo is a way to
express semantics. Homo gets capitalized here for the same reasons as it
gets italicized. In both cases it's because the semantics of a species name
dictate it if you follow official recommendations.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From asmusf at ix.netcom.com Sat Dec 12 16:32:33 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 12 Dec 2020 14:32:33 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
Message-ID: <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
An HTML attachment was scrubbed...
URL:
From duerst at it.aoyama.ac.jp Sat Dec 12 19:25:06 2020
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Sun, 13 Dec 2020 10:25:06 +0900
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
<8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
Message-ID: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
Asmus gives a lot of good reasons below. Here are some more:
Children learn to write with upper case and lower case letters in
school, and most people continue to use both as adults. (There are
exceptions of course, some people write only with lower case, and some
only with upper case.) On the other hand, people who distinguish upright
and italic in handwriting are extremely rare (maybe limited to editors
of certain journals?).
Also, case is important in names. It's Ludwig van Beethoven, not Ludwig
Van Beethoven, and LeBron James, not Lebron James. Italics don't come
into consideration here at all.
For all these reasons, the upper/lower case distinction was and is also
available on typewriters and keyboards. Again not so for italic.
Regards, Martin.
On 13/12/2020 07:32, Asmus Freytag via Unicode wrote:
> On 12/12/2020 11:01 AM, Christian Kleineidam via Unicode wrote:
>> On Fri, Dec 11, 2020 at 11:38 PM Doug Ewell > > wrote:
>>
>> Christian Kleineidam wrote:
>>
>> > "Evidence suggesting that ???? ????????????????
>> contributed the H2
>> > ???? haplotype to ???? ???????"
>>
>> "Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT
>> haplotype to Homo sapiens"
>>
>> This title is completely meaningful in plain text. The convention to style
>> the names of species and haplotypes in italics is just that, a styling
>> convention.
>>
>> Would you also say there's no semantic difference between "Evidence suggesting
>> that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo sapiens"
>> and EVIDENCE SUGGESTING THAT HOMO NEANDERTHALENSIS CONTRIBUTED THE H2 MAPT
>> HAPLOTYPE TO HOMO SAPIENS"? If so, why does unicode allow those to be
>> formatted differently?
>>
>> I think that capitalization generally gets used to express semantic meaning.
>> Capitalizing the first character of a sentence is a way to semantically mark
>> the start of the sentence. Capitalizing Homo is a way to express semantics.
>> Homo gets capitalized here for the same reasons as it gets italicized. In both
>> cases it's because the semantics of a species name dictate it if you follow
>> official recommendations.
>
> There are significant differences in usage as well as implication.
>
> A style, like "italics" can be applied to nearly the entire set of Unicode
> characters, while case is limited to a comparatively tiny subset. If Unicode
> wanted to encode styles like it does for case, it would mean multiplying the
> number of characters.
>
> But Mathalphabetics, you say. Well, in mathematical notation, certain styles are
> applied to very limited subsets. In effect, you could argue that in those
> contexts, certain stylistic variants work like case in ordinary orthographies.
> (Mathematical use of letter shapes is special, as it is almost exclusively
> using letter shapes as individual symbols, not part of words).
>
> Styles, commonly, are applied in runs, not to isolated code points. For case,
> the default is the other way around. In both cases, the exceptions prove the
> underlying rule.
>
> ALL UPPER CASE, as well as SMALL CAPS are more like a style than normal casing.
> As shown by the way they are supported like styles in feature-rich word
> processing apps.(The latter are not encoded: extending the arguments for
> encoding italics would force adding support for small caps as well).
>
> Styles, unlike case when applied to selected letters, tends to not have
> orthographic use. Even if it carries meaning that goes beyond being
> "decorative". There are exceptions even here, that prove the rule.
>
> Finally, the guiding design principle for "plain text" is that it is stateless
> (again, exceptions like bidi, are there to prove the rule). Styles, being
> applied in runs, are inherently not stateless, so are best expressed in stateful
> ways (that is, in one or the other rich-text protocols).
>
> The use case comes from lack of support of stateful text protocols (even limited
> ones) in places such as social media. There is no inherent reason why Twitter,
> Facebook and the like could not support "markdown" or similar protocols.
>
> On balance, all proposals for supporting some sort of "italics in Unicode"
> ignore not only the interrelationship shown in these facts, but also the well
> established historical division of "plain text" and "rich text" -- which Unicode
> has no business upsetting.
>
> A./
>
From prosfilaes at gmail.com Sat Dec 12 19:59:53 2020
From: prosfilaes at gmail.com (David Starner)
Date: Sat, 12 Dec 2020 17:59:53 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
<8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
<7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
Message-ID:
There's a lot of good answers, but I'd like to circle back to what I
think is the core reason: we've had character sets for seven decades,
virtually all of which supported English, and if any have supported
italics, I've never heard of it. Unicode supports italics the most of
any character set I've heard of. Whether in some sense italics should
be encoded in plain text is not an open problem; it's been assigned to
a level above plain text, and is well supported there.
--
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)
From indolering at gmail.com Sat Dec 12 20:23:23 2020
From: indolering at gmail.com (Zach Lym)
Date: Sat, 12 Dec 2020 18:23:23 -0800
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To:
References:
Message-ID:
> The more general rule is that:
> NFC(X) = NFC(Y) if and only if NFD(X) = NFD(Y).
> I.e. you can always replace one canonical form with the other in
> equivalence comparisons. (As long as you apply the same one to both
> sides, of course, but which one is up to you.)
Yes, and a careful reading of the standard will show that this is the
case. But we don't live in a world where people have time to read the
standard. Oh dear, I included the wrong link in my citation! It
should have been:
https://lwn.net/ml/linux-fsdevel/20190206084752.nwjkeiixjks34vao at pali/
At any rate, someone suggested using NFC, but this objection came up:
>> Is there any case where
>> NFC(x) == NFC(y) && NFD(x) != NFD(y) , or
>> NFC(x) != NFC(y) && NFD(x) == NFD(y)
>
>This is good question. And I think we should get definite answer for it
>prior inclusion of normalization into kernel.
Which was simply never followed up on. This is a feature that was
included after years of debate and developed in an open process. If
even Linux can't get this one right, then we need to do a better job
at explaining Unicode.
> > I would instead like to propose normalization form generics for use in
> > pseudo code definitions:
> >
> > NFx = NFD|NFC
> > NFKx = NFKD|NFKC
> > NFxy = NFD|NFC|NFKD|NFKC
>
> I would prefer the last one to be:
> NF(K)x = NFD|NFC|NFKD|NFKC; or perhaps
> NF[K]x = NFD|NFC|NFKD|NFKC; to look a bit more like ABNF.
I don't care for NFxy either, but I strongly prefer sticking to C
programming conventions.
From mark at kli.org Sat Dec 12 20:48:54 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Sat, 12 Dec 2020 21:48:54 -0500
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
<8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
<7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
Message-ID: <8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org>
An HTML attachment was scrubbed...
URL:
From doug at ewellic.org Sat Dec 12 21:20:01 2020
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 12 Dec 2020 20:20:01 -0700
Subject: Italics get used to express important semantic meaning,
so unicode should support them
In-Reply-To: <8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org>
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
<8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
<7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
<8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org>
Message-ID: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org>
Others have covered pretty much everything I was going to respond to Christian with.
David Starner wrote:
> I'd like to circle back to what I think is the core reason: we've had
> character sets for seven decades, virtually all of which supported
> English, and if any have supported italics, I've never heard of it.
The only conceivable exception might be ISO-IR-68, which represented APL and its distinctive italic uppercase letters. The registration? for ISO-IR-68 named the letters (e.g.) "CAPITAL APL LETTER A" and noted that they were "[u]sually printed in italics," revealing that this was merely a font preference specific to APL, as the corresponding roman (non-italic) letters were not also included. All mapping tables from ISO-IR-68, including Unicode's, map the italic APL letters to normal ASCII letters.
? https://www.itscj.ipsj.or.jp/iso-ir/068.pdf
Christian wrote:
> If so, why does unicode allow those [uppercase and lowercase letters]
> to be formatted differently?
For "formatted differently" I read "encoded separately"; Unicode doesn't dictate whether characters are displayed in an upright (roman) or italic style. If one uses George Douros's Akkadian font, for example, everything comes out in italics.
I wonder if the spelling "unicode" was meant here as a statement about the semantics of initial capitals. Standard English orthography requires that trade names like "Unicode" be spelled with an initial capital, whereas no orthographic requirement exists to spell anything with italics.
We do understand that not every possible nuance of human communication, such as shades of emphasis, can be expressed in plain text. It seems that the nearly 30-year-old Unicode definition of "plain text" still has not caught on universally, since requests continue to emerge for UTC to encode things that are not plain text by that definition.
--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
From pandey at umich.edu Sat Dec 12 21:33:08 2020
From: pandey at umich.edu (Anshuman Pandey)
Date: Sat, 12 Dec 2020 21:33:08 -0600
Subject: Italics get used to express important semantic meaning,
so unicode should support them
In-Reply-To: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org>
References: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org>
Message-ID: <64B08E21-8D4B-4AC1-85A9-C40D9E468178@umich.edu>
Doug basically covered everything I had to say.
?
> On Dec 12, 2020, at 9:20 PM, Doug Ewell via Unicode wrote:
>
> ?Others have covered pretty much everything I was going to respond to Christian with.
>
> David Starner wrote:
>
>> I'd like to circle back to what I think is the core reason: we've had
>> character sets for seven decades, virtually all of which supported
>> English, and if any have supported italics, I've never heard of it.
>
> The only conceivable exception might be ISO-IR-68, which represented APL and its distinctive italic uppercase letters. The registration? for ISO-IR-68 named the letters (e.g.) "CAPITAL APL LETTER A" and noted that they were "[u]sually printed in italics," revealing that this was merely a font preference specific to APL, as the corresponding roman (non-italic) letters were not also included. All mapping tables from ISO-IR-68, including Unicode's, map the italic APL letters to normal ASCII letters.
>
> ? https://www.itscj.ipsj.or.jp/iso-ir/068.pdf
>
> Christian wrote:
>
>> If so, why does unicode allow those [uppercase and lowercase letters]
>> to be formatted differently?
>
> For "formatted differently" I read "encoded separately"; Unicode doesn't dictate whether characters are displayed in an upright (roman) or italic style. If one uses George Douros's Akkadian font, for example, everything comes out in italics.
>
> I wonder if the spelling "unicode" was meant here as a statement about the semantics of initial capitals. Standard English orthography requires that trade names like "Unicode" be spelled with an initial capital, whereas no orthographic requirement exists to spell anything with italics.
>
> We do understand that not every possible nuance of human communication, such as shades of emphasis, can be expressed in plain text. It seems that the nearly 30-year-old Unicode definition of "plain text" still has not caught on universally, since requests continue to emerge for UTC to encode things that are not plain text by that definition.
>
> --
> Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
>
>
>
From asmusf at ix.netcom.com Sat Dec 12 22:03:58 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 12 Dec 2020 20:03:58 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org>
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
<8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
<7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
<8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org>
<000001d6d0fe$d9860d40$8c9227c0$@ewellic.org>
Message-ID: <8168fd17-31c3-5c23-d94d-7864bbd455a9@ix.netcom.com>
On 12/12/2020 7:20 PM, Doug Ewell via Unicode wrote:
> We do understand that not every possible nuance of human communication, such as shades of emphasis, can be expressed in plain text. It seems that the nearly 30-year-old Unicode definition of "plain text" still has not caught on universally, since requests continue to emerge for UTC to encode things that are not plain text by that definition.
If you go against an established method/truth/system/consensus/anything
and win, you'll be famous. That's the lure that keeps people up at night
trying to create a /perpetuum mobile/.
Problem is, that chances of that winning are usually more than elusive.
Doesn't prevent people from trying.
If conservation of energy, posited by Julius von Mayer in 1842 and
well-tested in the over 150 years since then, does not prevent people
trying the impossible, then why should 30 years of Unicode be sufficient :)
A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From sosipiuk at gmail.com Sat Dec 12 23:28:56 2020
From: sosipiuk at gmail.com (=?utf-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Sun, 13 Dec 2020 00:28:56 -0500
Subject: Italics get used to express important semantic meaning,
so unicode should support them
In-Reply-To:
References:
Message-ID: <002401d6d110$db4ac280$91e04780$@gmail.com>
I mostly agree with the general consensus, though probably not as firmly. However, I had a showerthought that, specifically in the case of Latin terms, marking them as such would be a legitimate use of the Unicode language tags. Indeed, an indication of ?this is Latin text? would be more correct and future-proof than ?this is italicized?, since the proper styling to indicate Latin text may change with the times, and because tags are default-ignorable, this approach would still be compatible with ?plain text? programs. The wiki (or whatever software) could be made to italicize Latin-within-English text that is tagged as such.
I know the tags are officially deprecated, but I personally think they got a bad rap. If ? and that is a big if ? a system for basic formatting (italic/bold/underlined/nonspecifically-emphasized) is ever implemented in Unicode, it should be via the default-ignorable tags.
S?awomir Osipiuk
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From asmusf at ix.netcom.com Sat Dec 12 23:45:54 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 12 Dec 2020 21:45:54 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <002401d6d110$db4ac280$91e04780$@gmail.com>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
Message-ID: <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
An HTML attachment was scrubbed...
URL:
From richard.wordingham at ntlworld.com Sun Dec 13 18:51:56 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 14 Dec 2020 00:51:56 +0000
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To:
References:
Message-ID: <20201214005156.6125d895@JRWUBU2>
On Fri, 11 Dec 2020 20:14:08 -0800
Zach Lym via Unicode wrote:
> > A string X is a canonical caseless match for a string Y if and only
> > if: NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))
> Even though
> Unicode explains that "case folding is closed under canonical
> normalization" it took me a long time to find that passage and
> convince myself that the W3C and Unicode matching algorithms are
> equivalent.
What does that quoted statement mean? I'm having a hard job working
out what the meaning of full case folding is. I'm not having any
doubts about the meaning of toCasefold(NFD(X)), so there is no issue
for 'canonical caseless matching'.
Richard.
From indolering at gmail.com Sun Dec 13 22:08:08 2020
From: indolering at gmail.com (Zach Lym)
Date: Sun, 13 Dec 2020 20:08:08 -0800
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To: <20201214005156.6125d895@JRWUBU2>
References:
<20201214005156.6125d895@JRWUBU2>
Message-ID:
> What does that quoted statement mean? I'm having a hard job working
> out what the meaning of full case folding is. I'm not having any
> doubts about the meaning of toCasefold(NFD(X)), so there is no issue
> for 'canonical caseless matching'.
The "case folding is closed under canonical normalization" or the other part?
Closed as in closure: https://en.wikipedia.org/wiki/Closure_(mathematics)
Refer to page 240 of the standard, Chaper 5 "Implementation
Guidelines" Section 18 "Case Mappings":
http://www.unicode.org/versions/latest/ch05.pdf
From marius.spix at web.de Mon Dec 14 05:26:47 2020
From: marius.spix at web.de (Marius Spix)
Date: Mon, 14 Dec 2020 12:26:47 +0100
Subject: Aw: Re: Normalization Generics (NFx, NFKx, NFxy)
Message-ID:
An HTML attachment was scrubbed...
URL:
From harjitmoe at outlook.com Mon Dec 14 08:22:59 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Mon, 14 Dec 2020 14:22:59 +0000
Subject: Aw: Re: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To:
References:
Message-ID:
Marius Spix via Unicode wrote:
> I understand that:
> [:toCaseFold=s:] = [sS?]
> [:toCaseFold=?:] = [???]
> But can someone explain me the following?
> [:toCaseFold=?:] = [?]
> [:toCaseFold=i:] = [iI]
> [:toCaseFold=?:] = []
> Why is it not:
> [:toCaseFold=?:] = [iI?]
> [:toCaseFold=i:] = [iI?]
> [:toCaseFold=?:] = [??]
> ?
>
? is often changed to SS in uppercase; the ? is a relatively new
addition as an encoded character and is not consistently used.? So
PREUSSEN and Preu?en are casings of the same word, for example.? I think
? might have been added after ?'s casefolding was already defined, but
I'm not sure so don't quote me on that.
"I" cannot casefold to *both* "i" and "?", it has to casefold to one of
them.? Not sure about "?" not casefolding the same as "I", but I don't
suppose there really exists any "good" locale-independent solution for
case insensitivity of "I".
? Har.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From sosipiuk at gmail.com Mon Dec 14 11:02:26 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Mon, 14 Dec 2020 12:02:26 -0500
Subject: Italics get used to express important semantic meaning,
so unicode should support them
In-Reply-To: <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
References: <002401d6d110$db4ac280$91e04780$@gmail.com> <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
Message-ID: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
On Sun, Dec 13, 2020 at 12:47 AM Asmus Freytag via Unicode wrote:
>
> Write a killer social media app that uses these in an integral fashion and requires them for interoperability and then sit back and watch how long they stay deprecated ...
That, or perhaps something like Wikidata could use it. ;)
I slept on it, and I'm leaning to the other side now. I think of the paper books I've read, and italics often appear within the text. Are the books "plain text"? Do the italics really fall into the category of typesetting and style, like the choice of overall font? Or are they a meaningful part of the text itself? Should it be possible to fit the content of a whole novel into a .txt file without losing any semantic meaning? The "spirit of Unicode" whispers that it should. Of course some books contain charts and graphics, and Unicode can't do everything, but if a solution can cover 95% of cases, it at least deserves consideration.
On Fri, Dec 11, 2020 at 1:13 PM Christian Kleineidam via Unicode wrote:
>
> Create a new unicode character for begin/end italic formatting and begin/end bold formatting that works like the unicode character for the Right-to-Left switch.
If you or someone else chooses to make a proposal, my own recommendation would be this:
- Assign a new character U+E0002 FORMAT TAG
- The syntax follows the specification for tagging (chapter 23.9)
- U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting.
- U+E0002 U+E007F CANCEL TAG to cancel all formatting
- Any use of U+E0002 overrides previous formatting (i.e. a "bold" tag alone cancels a previous "italic" tag), so format nesting must be done by combining all desired formats into a single tag.
- This method should only be used in cases where formatting is required without a higher-level protocol
- This method should not be used in instances where loss of formatting would greatly alter the meaning of the text or render it incomprehensible.
- Strikethrough and super/subscript are deliberately omitted for the above reason.
Advantages:
- Only a single new character needs definition.
- Uses an existing framework (tags)
- Formatting is ignorable, implementation is optional
- A viable method to preserve 95%+ of typical semantic formatting in plain-text
- IMO a stronger case to have this than either language tags or annotations (argument is to accurately preserve the lot of existing documents that include rudimentary formatting, rather than just invent new features).
Disadvantages:
https://xkcd.com/927/
S?awomir Osipiuk
From abrahamgross at disroot.org Mon Dec 14 11:15:34 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Mon, 14 Dec 2020 17:15:34 +0000 (UTC)
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID:
Wait till u see signwriting. now u can draw full on pictures in unicode
Dec 14, 2020 12:04:14 PM S?awomir Osipiuk via Unicode :
> Of course some books contain charts and graphics, and Unicode can't do everything,
>
From wjgo_10009 at btinternet.com Mon Dec 14 10:29:30 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 14 Dec 2020 16:29:30 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
Message-ID: <6e3feae1.712.1766215e59a.Webtop.210@btinternet.com>
Asmus Freytag wrote:
> But you need to be successful first :)
Indeed.
An invention of mine, a container for which needs encoding into Unicode
in order to achieve successful unambiguous interoperability, has been
banned from being discussed in this mailing list and blocked from going
before The Unicode Technical Committee because it has been deemed
without explanation to be "out of scope".
Yet scope can change according to need, yet discussion of scope also has
been blocked from being discussed in the mailing list.
My posts have been placed on permanent moderated posts status so as to
stop such discussion taking place.
So, at present, the bar is far too high for me to be able to achieve my
goal of successful unambiguous interoperability for the invention.
Unless discussion and fair consideration by The Unicode Technical
Committee is allowed then that success will be impossible and a
futuristic invention will never achieve its full potential.
Encoding into Unicode would also guarantee that the technique is applied
in a non-proprietary manner.
William Overington
Monday 14 December 2020
From wjgo_10009 at btinternet.com Mon Dec 14 11:19:20 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 14 Dec 2020 17:19:20 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID: <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com>
S?awomir Osipiuk wrote:
> Of course some books contain charts and graphics, and Unicode can't do
> everything, ?
In my opinion, Unicode could include charts and graphics by encoding
them within a plain text stream if people wanted that to be encoded.
William Overington
Monday 14 December 2020
From kenwhistler at sonic.net Mon Dec 14 13:19:12 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Mon, 14 Dec 2020 11:19:12 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
<3f99313c.7c9.176624383f5.Webtop.210@btinternet.com>
Message-ID: <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net>
On 12/14/2020 9:19 AM, William_J_G Overington via Unicode wrote:
> In my opinion, Unicode could include charts and graphics by encoding
> them within a plain text stream if people wanted that to be encoded.
You mean, as in the following sequence, shown here as a stream of plain
text characters in email?
And interpreted in the following document, in context, as HTML:
https://www.unicode.org/reports/tr51/#Major_Sources
??
--Ken
P.S. for the nitpickers... yeah, yeah, I realize that this email is
delivered as HTML, so the "plain text" is itself using quoting
conventions to embed in the HTML email. If you want this redelivered as
actual plain text, I could accommodate. ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From markus.icu at gmail.com Mon Dec 14 13:41:13 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 14 Dec 2020 11:41:13 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
<3f99313c.7c9.176624383f5.Webtop.210@btinternet.com>
<21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net>
Message-ID:
On Mon, Dec 14, 2020 at 11:25 AM Ken Whistler via Unicode <
unicode at unicode.org> wrote:
> P.S. for the nitpickers... yeah, yeah, I realize that this email is
> delivered as HTML, so the "plain text" is itself using quoting conventions
> to embed in the HTML email. If you want this redelivered as actual plain
> text, I could accommodate. ?
>
No need. I can confirm that your email was sent as
*Content-Type: multipart/alternative*;
boundary="------------7278B446390F3BC66C4D83C4"
And that the first part is in
*Content-Type: text/plain*; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
So we are all good.
(For Gmail users: Three-dot ?More? menu on the specific message, select
?Show original?)
Thanks,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From costello at mitre.org Mon Dec 14 14:17:48 2020
From: costello at mitre.org (Roger L Costello)
Date: Mon, 14 Dec 2020 20:17:48 +0000
Subject: Is there a difference between converting a string of ASCII digits to
an integer versus a string of non-ASCII digits to an integer?
Message-ID:
Hi Folks,
As I understand it, when the C programming language was created it just used ASCII. Programs written in C used ASCII digits.
Nowadays C supports Unicode and Unicode contains more digits than just the ASCII digits. (I think) modern C programs can express numbers using strings of non-ASCII digits.
Questions:
1. Is the algorithm for converting a string that contains non-ASCII digits different than the algorithm for converting a string containing ASCII digits?
2. The C function atoi() converts a string of digits to a number. I have seen the source code for atoi(). The source code that I saw was dated around the year 2000. Can you point me to the modern source code for atoi()?
/Roger
From wjgo_10009 at btinternet.com Mon Dec 14 15:48:07 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 14 Dec 2020 21:48:07 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
<3f99313c.7c9.176624383f5.Webtop.210@btinternet.com>
<21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net>
Message-ID: <31d43ca1.a6e.176633998ad.Webtop.216@btinternet.com>
Ken Whistler wrote as follows.
> You mean, as in the following sequence, shown here as a stream of
> plain text characters in email?
And interpreted in the following document, in context, as HTML:
https://www.unicode.org/reports/tr51/#Major_Sources
??
Actually no, for bitmap images I was thinking of a tag character
sequence method that was proposed in a document in The Unicode Technical
Committee Document Register some time ago that would directly embed an
image in a plain text file, no external link. It was not authored by me,
I cannot find it at present.
For vector graphics I was thinking of a tag character version of the
eutographics system that I devised back in 2002. (Please note that that
is eutographics, not eurographics, as which it has sometimes been
incorrectly described.)
http://www.users.globalnet.co.uk/~ngo/ast03000.htm
It worked well locally using a Java applet in a web page.
So, if The Unicode Technical Committee were to include these ideas in
Unicode, then Unicode could enable much more information to be
communicated unambiguously and interoperably in a plain text file.
William Overington
Monday 14 December 2020
Please note that the email address used in the listings in the
eutographics web page is not in regular use these days.
From harjitmoe at outlook.com Mon Dec 14 17:03:36 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Mon, 14 Dec 2020 23:03:36 +0000
Subject: Is there a difference between converting a string of ASCII digits
to an integer versus a string of non-ASCII digits to an integer?
In-Reply-To:
References:
Message-ID:
Roger L Costello via Unicode wrote:
> [?]
> 2. The C function atoi() converts a string of digits to a number. I have seen the source code for atoi(). The source code that I saw was dated around the year 2000. Can you point me to the modern source code for atoi()?
>
> /Roger
Here is the implementation from the FreeBSD libc:
https://github.com/freebsd/freebsd/blob/master/lib/libc/stdlib/strtol.c
(|strtol| and |strtol_l| are defined in that source file.? |atoi| and
|atoi_l| just wrap them, passing |NULL| for |endptr| and |10| for |base|.)
?Har.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mark at kli.org Mon Dec 14 18:59:33 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Mon, 14 Dec 2020 19:59:33 -0500
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID:
An HTML attachment was scrubbed...
URL:
From sosipiuk at gmail.com Mon Dec 14 22:36:03 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Mon, 14 Dec 2020 23:36:03 -0500
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID:
On Mon, Dec 14, 2020 at 8:05 PM Mark E. Shoulson via Unicode
wrote:
>
> All TAG symbols placed between a U+E003D TAG LESS-THAN SIGN and a U+E003E TAG GREATER-THAN SIGN, inclusive, are to be treated as if they were they corresponding ASCII characters, and run that through an HTML renderer. I guess if you wanted you could stipulate some reduced or restricted subset of HTML
I've been informed off-list that BabelPad uses this as a formatting
option. So, it's been done.
This solution technically constitutes a higher-level protocol anyway.
It's a markup language, just using unusual characters, but it's not in
any fundamental way a Unicode feature, official or not.
> If this sounds disturbing and wrong to you,
Disturbing? No. Wrong? I'd say "not my first choice". There are plenty
of things already approved that actually disturb me, but I won't go on
that tangent now.
> then other pseudo-markup ideas probably should as well.
Pseudo-markup already exists in Unicode, in multiple, inconsistent
ways. It exists because it was, at some point, by some people, deemed
useful enough and compatible enough with the aims of Unicode to be
included. I'm boggled by how annotations got in.
I'm well aware of scope creep and I'm not at all in favour of making
Unicode a Turing-complete programming language. That's why I proposed
something that fits into an already-established method that Unicode
has already defined. It even includes a bit of syntactic salt in the
way format nesting must be done that drives implementers to other
protocols for anything beyond rudimentary effects.
My guiding example is, "record fully the story text of a paperback
novel". There are things that are irrelevant for this purpose, such as
choice of font, or drop caps ("fancy first letters"), or page numbers,
or sizing of chapter titles, etc., etc.. Even something like
monospaced text is almost always used purely stylistically (to
indicate in-story things like signage, computer output, telegrams.)
and can be substituted with imagination by engaged readers. But
italics or underlines are often a meaningful part of text and
something is lost when that formatting is lost. Necessitating a
higher-level protocol for something so simple, when it can be easily
accommodated through an existing Unicode framework, is needlessly
conservative.
The thread-starter, Christian Kleineidam, gave a different use case
but I think it's a valid one as well. I think this would be an easy
win with not a whole lot of downside.
Reading the room here, not many agree. C'est la vie.
Cheers,
S?awomir Osipiuk
From beckiergb at gmail.com Tue Dec 15 00:47:33 2020
From: beckiergb at gmail.com (Rebecca Bettencourt)
Date: Mon, 14 Dec 2020 22:47:33 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
<8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
<7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
Message-ID:
On Sat, Dec 12, 2020 at 6:03 PM David Starner via Unicode <
unicode at unicode.org> wrote:
> we've had character sets for seven decades,
> virtually all of which supported English, and if any have supported
> italics, I've never heard of it.
ISCII 1991 had a mechanism called ATR Codes for applying styles and
switching character sets (see Annex E of
http://varamozhi.sourceforge.net/iscii91.pdf):
EF 30 - bold
EF 31 - italic
EF 32 - underline
EF 33 - expanded
EF 34 - highlight
EF 35 - outline
EF 36 - shadow
EF 37 - double height, top half
EF 38 - double height, bottom half
EF 39 - double height and width
Many character sets from 8-bit microcomputers had ?inverse? or ?reverse
video? characters that were treated as distinct from their ?normal video?
counterparts. When we proposed encoding these, as atomic characters or
using variation sequences or by any other means, the UTC shot down the idea
completely.
The existence of existing character sets, even when one is a government
standard, can't even get stylistic differences like italics or reverse
video into Unicode.
-- Rebecca Bettencourt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From markus.icu at gmail.com Tue Dec 15 11:31:36 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 15 Dec 2020 09:31:36 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
<8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
<7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
Message-ID:
On Mon, Dec 14, 2020 at 10:54 PM Rebecca Bettencourt via Unicode <
unicode at unicode.org> wrote:
> Many character sets from 8-bit microcomputers had ?inverse? or ?reverse
> video? characters that were treated as distinct from their ?normal video?
> counterparts. When we proposed encoding these, as atomic characters or
> using variation sequences or by any other means, the UTC shot down the idea
> completely.
>
Early computing systems conflated layers of processing where modern ones
separate them. For example, a quarter of ASCII and of EBCDIC, respectively,
was used for control codes which we inherited but which are now mostly
unused because we use lower-level mechanisms instead that carry text purely
as payload.
I think the plain text / rich text distinction has been quite successful. I
don't actually personally like the math-styled characters because they seem
specific to a particular math tradition. When I was in high school, the
vector-math teacher gave us a choice between the old style of using
Fraktur/S?tterlin for vector variables vs. the new style of regular letters
with an arrow on top. "Vector" markup with different style choices seems
better for this kind of thing.
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From richard.wordingham at ntlworld.com Tue Dec 15 13:10:01 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 15 Dec 2020 19:10:01 +0000
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To:
References:
<20201214005156.6125d895@JRWUBU2>
Message-ID: <20201215191001.356f5795@JRWUBU2>
On Sun, 13 Dec 2020 20:08:08 -0800
Zach Lym via Unicode wrote:
> > What does that quoted statement mean? I'm having a hard job working
> > out what the meaning of full case folding is. I'm not having any
> > doubts about the meaning of toCasefold(NFD(X)), so there is no issue
> > for 'canonical caseless matching'.
>
> The "case folding is closed under canonical normalization" or the
> other part?
That part.
> Closed as in closure:
> https://en.wikipedia.org/wiki/Closure_(mathematics)
That only tells me what it means for a _set_ to be closed under an
operation. What does it mean for a _function_ (or similar) to be
closed under an operation?
If I must use the definition for a set, then I can only conclude that
for one operation to be closed under another operation, the result
should be independent of the order in which they are applied.
But for X = :
NFD(toCasefold(X)) =
toCasefold(NFD(X)) =
NFC(toCasefold(X)) =
toCasefold(NFC(X)) =
So either "case folding is closed under canonical normalization" means
something else, or it is simply not true.
> Refer to page 240 of the standard, Chaper 5 "Implementation
> Guidelines" Section 18 "Case Mappings":
>
> http://www.unicode.org/versions/latest/ch05.pdf
Why?
The trick is not to be deflecting by the opening paragraph in TUS
Section 3.13, but to read on to find R4.
Richard.
From indolering at gmail.com Tue Dec 15 14:04:00 2020
From: indolering at gmail.com (Zach Lym)
Date: Tue, 15 Dec 2020 12:04:00 -0800
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To: <20201215191001.356f5795@JRWUBU2>
References:
<20201214005156.6125d895@JRWUBU2>
<20201215191001.356f5795@JRWUBU2>
Message-ID:
Okay, so points for pedantry ... but do you have any input on adding
normalization generics to Unicode pseudocode?
Or would you like to split this discussion out into a new topic?
On Tue, Dec 15, 2020 at 11:21 AM Richard Wordingham via Unicode
wrote:
>
> On Sun, 13 Dec 2020 20:08:08 -0800
> Zach Lym via Unicode wrote:
>
> > > What does that quoted statement mean? I'm having a hard job working
> > > out what the meaning of full case folding is. I'm not having any
> > > doubts about the meaning of toCasefold(NFD(X)), so there is no issue
> > > for 'canonical caseless matching'.
> >
> > The "case folding is closed under canonical normalization" or the
> > other part?
>
> That part.
>
> > Closed as in closure:
> > https://en.wikipedia.org/wiki/Closure_(mathematics)
>
> That only tells me what it means for a _set_ to be closed under an
> operation. What does it mean for a _function_ (or similar) to be
> closed under an operation?
>
> If I must use the definition for a set, then I can only conclude that
> for one operation to be closed under another operation, the result
> should be independent of the order in which they are applied.
>
> But for X = COMBINING GREEK PERISPOMENI>:
>
> NFD(toCasefold(X)) = SMALL LETTER IOTA, U+0342>
>
> toCasefold(NFD(X)) =
>
> NFC(toCasefold(X)) = PERISPOMENI>
>
> toCasefold(NFC(X)) = U+03B9>
>
> So either "case folding is closed under canonical normalization" means
> something else, or it is simply not true.
>
> > Refer to page 240 of the standard, Chaper 5 "Implementation
> > Guidelines" Section 18 "Case Mappings":
> >
> > http://www.unicode.org/versions/latest/ch05.pdf
>
> Why?
>
> The trick is not to be deflecting by the opening paragraph in TUS
> Section 3.13, but to read on to find R4.
>
> Richard.
From abrahamgross at disroot.org Tue Dec 15 14:52:35 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Tue, 15 Dec 2020 20:52:35 +0000 (UTC)
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
<8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
<7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
Message-ID: <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
Unicode refused to encode arabic letter variants (not counting compatibility chars), which are taught in school and adults use it, and its how arabic is written, so ur argument here doesn't hold water.
Dec 12, 2020 8:26:10 PM Martin J. D?rst via Unicode :
> Children learn to write with upper case and lower case letters in school, and most people continue to use both as adults. (There are exceptions of course, some people write only with lower case, and some only with upper case.)
>
From indolering at gmail.com Tue Dec 15 16:28:55 2020
From: indolering at gmail.com (Zach Lym)
Date: Tue, 15 Dec 2020 14:28:55 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID:
> If you or someone else chooses to make a proposal, my own recommendation would be this:
>
> - Assign a new character U+E0002 FORMAT TAG
> - The syntax follows the specification for tagging (chapter 23.9)
> - U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting.
How would one implement blink? I would consider that top priority, as
it was explicitly designed for styling plaintext.
From richard.wordingham at ntlworld.com Tue Dec 15 16:32:16 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 15 Dec 2020 22:32:16 +0000
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <000301d6d00e$4d244330$e76cc990$@ewellic.org>
References:
<000301d6d00e$4d244330$e76cc990$@ewellic.org>
Message-ID: <20201215223216.339e3a0a@JRWUBU2>
On Fri, 11 Dec 2020 15:38:07 -0700
Doug Ewell via Unicode wrote:
> Christian Kleineidam wrote:
>
> > "Evidence suggesting that ???? ???????????????? contributed the H2
> > ???? haplotype to ???? ???????"
>
> "Evidence suggesting that Homo neanderthalensis contributed the H2
> MAPT haplotype to Homo sapiens"
>
> This title is completely meaningful in plain text. The convention to
> style the names of species and haplotypes in italics is just that, a
> styling convention.
Yet there are cases where meaning is completely lost. There was a
Latin script spelling for Pali and Sanskrit that used italicised
guttural letters for palatals, and italicised letters where nowadays we
normally have a dot below. I think this scheme was introduced by Max
Mueller. Thus, a Sanskrit sequence meaning 'and this' is written not
'tacca' but 'ta??a'. (I naturally misread the latter as though it were
'takka'.) That naturally raises the question of how such italic letters
are to be italicised!
I've also seen phonetic respelling of English in the Thai script
where italicised consonants are used for English consonants for which
Thai has no equivalent.
When documenting program, there is a massive gain in readability when
the lower case names of programs and variables are written out in a
typewriter-style font like Courier. (Some monospace fonts lack the
distinctiveness.)
Richard.
From kent.b.karlsson at bahnhof.se Tue Dec 15 17:07:05 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Wed, 16 Dec 2020 00:07:05 +0100
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID:
(Below)
> 14 dec. 2020 kl. 18:02 skrev S?awomir Osipiuk via Unicode :
> If you or someone else chooses to make a proposal, my own recommendation would be this:
>
> - Assign a new character U+E0002 FORMAT TAG
> - The syntax follows the specification for tagging (chapter 23.9)
> - U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting.
> - U+E0002 U+E007F CANCEL TAG to cancel all formatting
> - Any use of U+E0002 overrides previous formatting (i.e. a "bold" tag alone cancels a previous "italic" tag), so format nesting must be done by combining all desired formats into a single tag.
> - This method should only be used in cases where formatting is required without a higher-level protocol
> - This method should not be used in instances where loss of formatting would greatly alter the meaning of the text or render it incomprehensible.
> - Strikethrough and super/subscript are deliberately omitted for the above reason.
Now, where did I see something very much like this???
?
?
Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very close (especially the ?invisible by default? (?default ignorable?) IF parsed correctly). And? ECMA-48 is already a standard. And? ECMA-48 is already successful, and still used every day by very many people. Though it is primarily used in terminal emulators. (Nit: ECMA-48 does have strikethrough? And more. As does HTML/CSS, and when doing ?copy as plain text?, also that formatting disappear.)
Your U+E0002 FORMAT TAG: ECMA-48 CSI ? m
Your U+E0062 (bold): ECMA-48 CSI 1m
Your U+E0065 (emphatic): don?t know what you mean by that
Your U+E0069 (italic): ECMA-48 CSI 3m
Your U+E0079 (underlined): ECMA-48 CSI 4m
Your U+E007F CANCEL TAG: ECMA-48 CSI 0m
It is not entirely inconceivable to map all the (otherwise) printable characters used by such control sequences to TAG characters, thus making the ?default ignorable? part of this a bit easier.
Extra nit: Some markdowns (however did that name stick?) allow for strikethrough as well, as -stricken-. Though a bit intuitive, it way too often has an unexpected effect where no strikethrough was intended (try doing ?ls -l? in your Linux terminal, and paste the result into some place that have that kind of markdown).
?Math Italic? is a hack for MathML. If done right, MathML would not have needed them either. ?Math Italic? for emphasis in running text (not MathML) only ?works? (sort of, and partially) for English, nearly no other language. Please don?t use the ?Math italic/bold/etc? outside of MathML.
/Kent Karlsson
PS
First edition of ECMA-48 came in 1976. About 44 years ago.
> Advantages:
> - Only a single new character needs definition.
> - Uses an existing framework (tags)
> - Formatting is ignorable, implementation is optional
> - A viable method to preserve 95%+ of typical semantic formatting in plain-text
> - IMO a stronger case to have this than either language tags or annotations (argument is to accurately preserve the lot of existing documents that include rudimentary formatting, rather than just invent new features).
>
> Disadvantages:
> https://xkcd.com/927/
>
> S?awomir Osipiuk
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From billposer2 at gmail.com Tue Dec 15 17:10:07 2020
From: billposer2 at gmail.com (Bill Poser)
Date: Tue, 15 Dec 2020 15:10:07 -0800
Subject: Is there a difference between converting a string of ASCII digits
to an integer versus a string of non-ASCII digits to an integer?
In-Reply-To:
References:
Message-ID:
What do you mean by "non-ASCII digits"? Things like superscript and
subscript versions of the usual Western "Arabic' numbers? Or are you
talking about numbers like those of Chinese, roman numerals, Tamil, etc.?
In the case of the former, once you map the digits to their standard forms,
the algorithm is the same. In the case of the latter, no, in many cases
very different algorithms are required.
On Mon, Dec 14, 2020 at 12:28 PM Roger L Costello via Unicode <
unicode at unicode.org> wrote:
> Hi Folks,
>
> As I understand it, when the C programming language was created it just
> used ASCII. Programs written in C used ASCII digits.
>
> Nowadays C supports Unicode and Unicode contains more digits than just the
> ASCII digits. (I think) modern C programs can express numbers using strings
> of non-ASCII digits.
>
> Questions:
>
> 1. Is the algorithm for converting a string that contains non-ASCII digits
> different than the algorithm for converting a string containing ASCII
> digits?
>
> 2. The C function atoi() converts a string of digits to a number. I have
> seen the source code for atoi(). The source code that I saw was dated
> around the year 2000. Can you point me to the modern source code for atoi()?
>
> /Roger
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mark at kli.org Tue Dec 15 17:26:31 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 15 Dec 2020 18:26:31 -0500
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID: <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
An HTML attachment was scrubbed...
URL:
From markus.icu at gmail.com Tue Dec 15 17:45:11 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 15 Dec 2020 15:45:11 -0800
Subject: Is there a difference between converting a string of ASCII digits
to an integer versus a string of non-ASCII digits to an integer?
In-Reply-To:
References:
Message-ID:
I suspect that Roger is just looking at decimal digits (property gc=Nd
).
I believe that they can all be parsed like strings of ASCII digits (and you
can call ICU or other libraries to get at the digit values and other
properties).
I suggest you double-check about the RTL digits (N'Ko & Adlam); please take
a look at the relevant Unicode book chapters.
What's more interesting is handling the grouping and decimal separators
which differ by both language and region.
Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From sosipiuk at gmail.com Tue Dec 15 18:41:09 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Tue, 15 Dec 2020 19:41:09 -0500
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To: <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
<8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
Message-ID:
On Tue, Dec 15, 2020 at 6:26 PM Mark E. Shoulson wrote:
>
> But how is that different from anything being proposed? If this idea were accepted as part of Unicode, then it *would* be a feature of Unicode, just as whatever is being proposed would be if it were accepted. How does it matter if italicizing something is marked by some new U+DEADBF characters or by existing tag characters?
- Rather than a completely new method, it's "just" an extension of an
existing feature. (Tag syntax, scope, and default ignorability are
already defined in the Unicode standard)
- The syntax "naturally" discourages complicated format nesting.
Unicode may formally restrict format combos.
> If you insist that Unicode-compliant text readers must show italics or bold when marked with such-and-such characters,
Absolutely not!
> Conversely, if you're okay with pseudo-markup, this should sound fine to you. Why doesn't it?
"Not my first choice" is what I said. It's not bad, but its similarity
to HTML is not a good thing in my eyes, because it raises the question
"I can do this in HTML, why can't I do it in UnicodeML??" and push for
more and more HTML features to be included. It encourages feature
creep, which I said I'm against. Familiarity is not always a good
thing.
> (how would this markup interact with other markup, like HTML, I wonder?)
(From the Unicode Standard, page 916, with [] additions by me; notice
how little the text changes)
"The rules for Unicode conformance for the tag characters are exactly
the same as those for any other Unicode characters. A conformant
process is not required to interpret the tag characters. If it does
interpret them, it should interpret them according to the standard?
that is, as spelled-out tags. However, there is no requirement to
provide a particular interpretation of the text because it is tagged
with a given language [or formatting]. If an application does not
interpret tag characters, it should leave their values undisturbed and
do whatever it does with any other uninterpreted characters.
[...]
"Implementations of Unicode that already make use of out-of-band
mechanisms for language [or format] tagging or ?heavy-weight? in-band
mechanisms such as XML or HTML will continue to do exactly what they
are doing and will ignore the tag characters completely. They may even
prohibit their use to prevent conflicts with the equivalent markup."
S?awomir Osipiuk
From richard.wordingham at ntlworld.com Tue Dec 15 18:42:29 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 16 Dec 2020 00:42:29 +0000
Subject: Is there a difference between converting a string of ASCII
digits to an integer versus a string of non-ASCII digits to an integer?
In-Reply-To:
References:
Message-ID: <20201216004229.51af1612@JRWUBU2>
On Tue, 15 Dec 2020 15:45:11 -0800
Markus Scherer via Unicode wrote:
> I suspect that Roger is just looking at decimal digits (property gc=Nd
>
> ).
> I believe that they can all be parsed like strings of ASCII digits
> (and you can call ICU or other libraries to get at the digit values
> and other properties).
> I suggest you double-check about the RTL digits (N'Ko & Adlam);
> please take a look at the relevant Unicode book chapters.
It looks as though the N'ko section documents the significance by
accident! I thought a policy was going to be documented (2012 or
slightly later) that decimal digits are stored most significant
digit first, but that doesn't seem to have happened.
Richard.
From sosipiuk at gmail.com Tue Dec 15 19:14:42 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Tue, 15 Dec 2020 20:14:42 -0500
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID:
On Tue, Dec 15, 2020 at 6:07 PM Kent Karlsson
wrote:
> Now, where did I see something very much like this???
> Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very close (especially the ?invisible by default? (?default ignorable?) IF parsed correctly).
ECMA-48 aka ISO 6429 was on my mind the moment I read the OP. I didn't
mention it because it's a bit outdated (even if I do have a fondness
for it) and if you're using such a thing, why not a more modern HTML
subset, or BBCode, or any number of other options in use or from the
list the OP gave? There are, after all, so many to choose from. And if
none of those satisfy, you can always make your own!
But that "if parsed correctly" is quite the nit, isn't it?
> It is not entirely inconceivable to map all the (otherwise) printable characters used by such control sequences to TAG characters, thus making the ?default ignorable? part of this a bit easier.
And this is just the BabelPad solution but applied to a different
protocol. Replacing regular markup by corresponding characters from
the tag block to gain ignorable-ness may seem like a cool idea at
first, but it's just spinning yet another markup. (With no offense
intended to BabelPad's author; it's not a bad idea except that it
starts at the bottom of the mountain just like any other.) Tag syntax
is already part of Unicode. I'd rather use it than import something
wholesale from another protocol.
Finally, what I'm envisioning ? and I'm not sure how closely this
matches Christian Kleineidam's intention (where did he go, anyway?) ?
is not Yet Another Presentation Layer or a Shiny New Toy for people to
use in their tweets, but more of a sombre hint that "in the original
source document, this text had an alternative presentation; indicate
this to the user in an appropriate way, if applicable". It's meant for
preservation, not decoration. That's why I hear the "spirit of
Unicode".
S?awomir Osipiuk
From copypaste at kittens.ph Tue Dec 15 19:58:57 2020
From: copypaste at kittens.ph (Fredrick Brennan)
Date: Tue, 15 Dec 2020 20:58:57 -0500
Subject: =?UTF-8?B?Mcui4bWXLCAy4oG/4bWILCAzyrPhtYgsIDThtZfKsCDigKYgOeG1l8qw?=
Message-ID: <9137826.KFeHLySHN7@laptop>
Hello!
With Unicode superscript lowercase letters, dates with superscript ordinal
indicators in English can be written in plaintext, e.g.:
1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on.
The only problem I've encountered is in font fallback; fonts are more likely to
contain ? than the other letters due to its use in Pe?h-?e-j? and IPA. So, ? often
appears in a different style in the word 2?? for example. This can be somewhat
avoided by using a font which supports all the letters, such as Gentium Plus, EB
Garamond, etc.
However, I have a feeling that this use is an abuse of the standard, but that brings
up an interesting comparison with the ordinal indicators for Spanish, Portuguese
(& other languages?), the masculine ? and the feminine ?.
If anyone has time to answer, why is one an abuse and the other not, if indeed 1?? is
an abuse as I think?
If it's not an abuse, then that could perhaps be an argument for the necessity of
encoding ????????? ???????? ?????? s???? ?, as ? is one of the few letters without a
combining counterpart in Cyrillic Extended-A or Extended-B. (Of course, no
breaking spaces would need to be used to write Russian 2-? if this character were
to be encoded, e.g. as U+32 U+A0 U+XXXX, while no-break spaces aren't needed
for Latin.
Best,
Fred Brennan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From copypaste at kittens.ph Tue Dec 15 20:04:55 2020
From: copypaste at kittens.ph (Fredrick Brennan)
Date: Tue, 15 Dec 2020 21:04:55 -0500
Subject: =?UTF-8?B?Mcui4bWXLCAy4oG/4bWILCAzyrPhtYgsIDThtZfKsCDigKYgOeG1l8qw?=
In-Reply-To: <9137826.KFeHLySHN7@laptop>
References: <9137826.KFeHLySHN7@laptop>
Message-ID: <2171140.htiGsxgcq4@laptop>
Oh dear, my email-client was erroneously configured to use SHIFT_JIS, which
mangled my message.
Corrections...
On Tuesday, December 15, 2020 8:58:57 PM EST I wrote:
> 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on.
1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on.
> to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA.
> So, ? often appears in a different style in the word 2?? for example.
to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA. So, ?
often appears in a different style in the word 2?? for example.
> the masculine ? and the feminine ?.
the masculine ? and the feminine ?.
> if indeed 1?? is an abuse as I think?
if indeed 1?? is an abuse as I think?
> necessity of encoding ????????? ???????? ?????? s???? ?
necessity of encoding COMBINING CYRILLIC LETTER SHORT I
Very ironic :)
Best,
Fred Brennan
From mark at kli.org Tue Dec 15 20:36:07 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 15 Dec 2020 21:36:07 -0500
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
<8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
Message-ID: <0e31ec9e-490f-191a-c912-5a56d9abb602@kli.org>
An HTML attachment was scrubbed...
URL:
From indolering at gmail.com Tue Dec 15 21:18:41 2020
From: indolering at gmail.com (Zach Lym)
Date: Tue, 15 Dec 2020 19:18:41 -0800
Subject: Italics get used to express important semantic meaning, so
unicode should support them
In-Reply-To:
References:
<002401d6d110$db4ac280$91e04780$@gmail.com>
<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
<000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID:
> Finally, what I'm envisioning ? and I'm not sure how closely this
> matches Christian Kleineidam's intention (where did he go, anyway?) ?
> is not Yet Another Presentation Layer or a Shiny New Toy for people to
> use in their tweets, but more of a sombre hint that "in the original
> source document, this text had an alternative presentation; indicate
> this to the user in an appropriate way, if applicable". It's meant for
> preservation, not decoration. That's why I hear the "spirit of
> Unicode".
For those of us that can recall the exuberance of the XHTML movement,
, and friends were all deemed to be insufficiently semantic and
slated to be replaced by and . Of course, this was a
distinction without a difference and now we just have extra tags that
are more verbose and less literal.
But that begs the question: if the authors of a rich text standard
can't agree on what counts as semantic, how would Unicode decide?
What about , , or as I previously suggested