From wjgo_10009 at btinternet.com  Fri Dec  4 06:30:32 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 4 Dec 2020 12:30:32 +0000 (GMT)
Subject: A workaround for using colour fonts in some application programs
 that do not support colour fonts
Message-ID: <36a93647.41c.1762dbb82d0.Webtop.218@btinternet.com>


Hi
Some readers might like to know of a workaround that I have devised for 
using colour fonts in some application programs that do not support 
colour fonts.

The technique works because the Unicode code point for each character is 
exactly the same whether the character is displayed in a colour font or 
in a monochrome font.
The technique is to compose the design using the application program, 
the characters appearing in plain monochrome form, then export as an svg 
file without selecting the option to convert the text to curves.
The svg file is then displayed using an application program that does 
support colour fonts.

This works simply because the application program places the Unicode 
character code points in the svg file and those Unicode character code 
points are successfully used by the colour font supporting application 
program. This is because the Unicode code point for each character is 
exactly the same whether the character is displayed in a colour font or 
in a monochrome font.
For example, I started with Serif Affinity Publisher, which at present 
does not support colour fonts, produced an svg file without converting 
the text to curves, displayed the svg file using Microsoft Edge, made a 
'print screen' image, then trimmed out the browser window parts using 
Microsoft Paint and saved the result as  a png file.
The technique has been found to work with Affinity Publisher, Affinity 
Designer and two legacy Serif products, PagePlus and CraftArtist2.
Please find attached a graphic made by me using Affinity Publisher, 
Microsoft Edge, Microsoft Paint and the Playbox colour font designed and 
kindly supplied free with a licence by Matt Lyon.
https://forum.affinity.serif.com/index.php?/topic/128285-colour-fonts-and-affinity-products/ 
<https://forum.affinity.serif.com/index.php?/topic/128285-colour-fonts-and-affinity-products/>
William Overington
Friday 4 December 2020

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201204/60ee029c/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: playbox_in_publisher.png
Type: image/png
Size: 40979 bytes
Desc: not available
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201204/60ee029c/attachment-0001.png>

From christian.kleineidam at gmail.com  Fri Dec 11 06:57:23 2020
From: christian.kleineidam at gmail.com (Christian Kleineidam)
Date: Fri, 11 Dec 2020 13:57:23 +0100
Subject: Italics get used to express important semantic meaning, so unicode
 should support them 
Message-ID: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>

In the FAQ on Ligatures it's written "The mathematical letters and digits
are meant to be used only in mathematics, where the distinction between a
plain and a bold letter is fundamentally semantic rather than stylistic."
This suggests that the spirit of Unicode includes the intention to be able
to represent semantic meaning.

On Wikidata, we have the open problem of what to do with academic articles
that have italics in their official title. We store for example the paper
https://www.wikidata.org/wiki/Q33988883 which according to what the
publisher writes on http://www.biochemsoctrans.org/content/33/4/582 has
italics as part of it's proper name. In Wikidata we want to be able to
store the semantic meaning.
This gives us the choice between either using In Wikidata, to either  list
the paper as "Evidence suggesting that Homo neanderthalensis contributed
the H2 MAPT haplotype to Homo sapiens" or "Evidence suggesting that
???? ???????????????? contributed the H2 ????
haplotype to ???? ???????" which uses the mathematical
characters against recommendations while the website lists it as "Evidence
suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo
sapiens".

In scientific articles like that the ability to represent italics is needed
to express all of the semantic meaning that's contained in the title. In
contrast to properties like font-size, italics start to be used in the real
world to express semantic meaning. For a project like Wikidata that cares
about storing the semantic meaning of the title of an academic paper that
unicode problematic as it leads us to lose information.

You might say that if unicode doesn't serve the needs of Wikidata to store
the semantic content of the texts we care about, we should add additional
formatting on-top of unicode. Between RTF, Markdown, SGML, HTML, XML and
Wikitext there are multiple different formats we could use on Wikidata to
potentially represent italics. If we would however choose any one of them
that would make it harder for data-reusers who use another format to
interact with our data as they would need to run a parser over the data
which increases their code complexity and makes it harder to interact with
our data.

Official style guidelines like the Chicago Manual of Style (18th edition)
specify that certain italics should be used to express certain semantic
meaning:

22.1.3 Other Types of Names
Other types of names also follow specific patterns for capitalization, and
some require italics.

22.2.1 Foreign-Language Terms
Italicize isolated words and phrases in foreign languages likely to be
unfamiliar to readers of English, and capitalize them as in their language.

22.3.2.1 ITALICS. Italicize the titles of most longer works, including the
types listed here. An initial the should be roman and lowercase before
titles of periodicals, or when it is not considered part of the title. For
parts of these works and shorter works of the same type, see 22.3.2.2.

The inability to follow the recommendations of the Chicago Manual of Style
to express semantic meaning in italics means that unicode fails in it's
mission to be able to express all semantic distinctions. This means that
it's technically impossible to follow the Chicago Manual of Style in code
comments of programming code that are in unicode.

Outside of specialized needs like those of Wikidata and programmers who
might want to follow the Chicago Manual of Style in context  the inability
of unicode to represent italics and bold of texts makes life harder for
average users as well. Web browsers can't offer their users the ability to
format a part of the text as italics or bold. As a result many users don't
know how to italicize or bold text when they write online as different
website use different standards. Many online systems break WYSIWYG for
italics and bold which makes it harder for non-technical users to use them
to express themselves.
If Unicode would support italics and bold, the browser could make it easy
for users to have italics or boldness. Even smartphone would have the
option to offer a user to italicize or bold a text in the menu that
currently allows copying and pasting.

Websites like https://yaytext.com/bold-italic/ get used by users to express
themselves in italics and bold on platforms like Facebook and Twitter that
use Unicode without additional formatting.

Having to use the unofficial workaround of mathematical letters is
undesirable because it means that software like screen readers is less
likely to interact well with the resulting text.

Proposal of a solution:

In today's usage italics often have semantic meaning. There are many cases
where it's desirable that a user can express such meaning but where there's
no intention to give the user control over features such as font size that
the user gets when HTML or RTF is used as format. With the symbol for
Right-to-Left text there's a precedent in unicode for having signs that
manipulate multiple following characters. At the time of the design italics
weren't used for expressing fundamentally semantic meaning such as "Homo
neanderthalensis" referring to a a species as it's used in the title of the
above paper.

Create a new unicode character for begin/end italic formatting and
begin/end bold formatting that works like the unicode character for the
Right-to-Left switch.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201211/5a35669a/attachment.htm>

From kenwhistler at sonic.net  Fri Dec 11 12:42:52 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Fri, 11 Dec 2020 10:42:52 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
Message-ID: <e0845392-ee47-e865-7cb9-709fce0211ce@sonic.net>


On 12/11/2020 4:57 AM, Christian Kleineidam via Unicode wrote:
> Create a new unicode character for begin/end italic formatting and 
> begin/end bold formatting that works like the unicode character for 
> the Right-to-Left switch.

<i>...</i> and <b>...</b>

Yeah, they are sequences of 3 (or 4) existing characters, and not single 
code points, but they accomplish what you are asking for and they work 
everywhere on the web already.

Nobody would thank you for introducing yet *another* form of scoped 
markup for the same effects that would take years to be picked up 
(inconsistently) in thousands of implementations, and which would 
introduce yet more possibilities for conflicts in dueling schemes for 
markup in text.

--Ken


From kilobyte at angband.pl  Fri Dec 11 13:42:01 2020
From: kilobyte at angband.pl (Adam Borowski)
Date: Fri, 11 Dec 2020 20:42:01 +0100
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <e0845392-ee47-e865-7cb9-709fce0211ce@sonic.net>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <e0845392-ee47-e865-7cb9-709fce0211ce@sonic.net>
Message-ID: <20201211194201.GA5630@angband.pl>

On Fri, Dec 11, 2020 at 10:42:52AM -0800, Ken Whistler via Unicode wrote:
> On 12/11/2020 4:57 AM, Christian Kleineidam via Unicode wrote:
> > Create a new unicode character for begin/end italic formatting and
> > begin/end bold formatting that works like the unicode character for the
> > Right-to-Left switch.
> 
> <i>...</i> and <b>...</b>
> 
> Yeah, they are sequences of 3 (or 4) existing characters, and not single
> code points, but they accomplish what you are asking for and they work
> everywhere on the web already.
> 
> Nobody would thank you for introducing yet *another* form of scoped markup
> for the same effects that would take years to be picked up (inconsistently)
> in thousands of implementations, and which would introduce yet more
> possibilities for conflicts in dueling schemes for markup in text.

And, despite the original recommendation, enough people use math characters
for that, so even Google considers them equivalent to basic ASCII.

So just:
echo 'Homo sapiens'|tran italic
and 'ere you go.


?!
-- 
??????? Latin:   meow 4 characters, 4 columns,  4 bytes
??????? Greek:   ???? 4 characters, 4 columns,  8 bytes
??????? Runes:   ???? 4 characters, 4 columns, 12 bytes
??????? Chinese: ?   1 character,  2 columns,  3 bytes <-- best!

From doug at ewellic.org  Fri Dec 11 16:38:07 2020
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 11 Dec 2020 15:38:07 -0700
Subject: Italics get used to express important semantic meaning,
 so unicode should support them 
In-Reply-To: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
Message-ID: <000301d6d00e$4d244330$e76cc990$@ewellic.org>

Christian Kleineidam wrote:

> "Evidence suggesting that ???? ???????????????? contributed the H2
> ???? haplotype to ???? ???????"

"Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo sapiens"

This title is completely meaningful in plain text. The convention to style the names of species and haplotypes in italics is just that, a styling convention.

> Between RTF, Markdown, SGML, HTML, XML and Wikitext there are multiple
> different formats we could use on Wikidata to potentially represent
> italics. If we would however choose any one of them that would make it
> harder for data-reusers who use another format to interact with our
> data as they would need to run a parser over the data which increases
> their code complexity and makes it harder to interact with our data.

https://xkcd.com/927/

> The inability to follow the recommendations of the Chicago Manual of
> Style to express semantic meaning in italics means that unicode fails
> in it's mission to be able to express all semantic distinctions. This
> means that it's technically impossible to follow the Chicago Manual of
> Style in code comments of programming code that are in unicode.

Style guides such as Chicago and AP and MLA cover many stylistic realms beyond this. They tell the writer how to indent certain passages and what sort of contrastive font faces and sizes should be used for quotations and how tables should be laid out. None of this is within the scope of a plain-text encoding standard either.

--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org


From richard.wordingham at ntlworld.com  Fri Dec 11 17:19:08 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 11 Dec 2020 23:19:08 +0000
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
Message-ID: <20201211231908.29035298@JRWUBU2>

On Fri, 11 Dec 2020 13:57:23 +0100
Christian Kleineidam via Unicode <unicode at unicode.org> wrote:

> At the time of the design italics weren't used for
> expressing fundamentally semantic meaning such as "Homo
> neanderthalensis" referring to a a species as it's used in the title
> of the above paper.

I just looked in a 1969 reprint of a school biology textbook published
in 1966.  It consistently italicises generic names such as _Drosophila_
within sentences, so I find your claim hard to credit.

Of course, typewritten materials had to resort to underlining to
indicate italicisation in such cases.  I think I've seen such usage,
but my memory may not be reliable.

Richard.

From jameskass at code2001.com  Fri Dec 11 19:41:31 2020
From: jameskass at code2001.com (James Kass)
Date: Sat, 12 Dec 2020 01:41:31 +0000
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
Message-ID: <d4c41629-fdcf-018b-fd1b-6e9e7ceb6883@code2001.com>


On 2020-12-11 12:57 PM, Christian Kleineidam via Unicode wrote:
> This suggests that the spirit of Unicode includes the intention to be able
> to represent semantic meaning.

That's the spirit!

The topic of italics in Unicode was last discussed extensively on this 
list in January of 2019, bleeding into February.

https://unicode.org/mail-arch/unicode-ml/y2019-m01/

As Adam Borowski points out, enough people are using the math 
alphanumerics that we have a ?? ????? method.

From indolering at gmail.com  Fri Dec 11 22:14:08 2020
From: indolering at gmail.com (Zach Lym)
Date: Fri, 11 Dec 2020 20:14:08 -0800
Subject: Normalization Generics (NFx, NFKx, NFxy)
Message-ID: <CABWuLVf5jeYTTOX9jR0VVBVWoro4YebZ0-MKF-VeeY3vZ6qudg@mail.gmail.com>

I have been tracking down the rationale behind the normalization
choices in filesystems.  One trouble spot for implementers is
interpreting strict logician terminology paired with imprecise pseudo
code.  Take the definition of Unicode's caseless matching algorithm
[D145]:

> A string X is a canonical caseless match for a string Y if and only if:
> NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))

The W3C Canonical Case Fold Normalization algorithm claims to be
compatible with [D145], but uses NFC in the last step
[w3c-charmod-norm], leading to an apparent contradiction.  Even though
Unicode explains that "case folding is closed under canonical
normalization" it took me a long time to find that passage and
convince myself that the W3C and Unicode matching algorithms are
equivalent.  I am not alone: *Linux kernel hackers couldn't figure it
out either* [linux-norm]!

I was originally going to propose additions to D145 textual
description, cross-references to the implementation section, and
adding discussion of W3C charmod-norm.  However, I don't think this
would help as the text is already quite dense and most people will
just ignore everything outside the example anyway [minimalist-manual].

I would instead like to propose normalization form generics for use in
pseudo code definitions:

    NFx = NFD|NFC
    NFKx = NFKD|NFKC
    NFxy = NFD|NFC|NFKD|NFKC

Freestanding `X`/`Y` variables should be probably be replaced to
disambiguate them from the `NFx` nomenclature.  `s1`/`s2` would work
but `foo`/`bar` is less dense:

    NFx(caseFold(NFD(foo))) = NFx(caseFold(NFD(bar)))

`NFx` does not currently appear within the Unicode standard itself,
but is used in the normalization technical note [UAX15].  However,
**UAX15 defines `NFx` twice**, first as NFD|NFC|NFKD|NFKC and later on
as NFD|NFC.  I think the proposed convention gets the most mileage out
of the nomenclature and is how I have seen `NFx` used in the real
world [linus].

Thank you!
-Zach Lym

[w3c-charmod-norm]:
https://w3c.github.io/charmod-norm/#CanonicalFoldNormalizationStep
[linux-norm]: https://lwn.net/ml/linux-fsdevel/20190318202745.5200-10-krisman%40collabora.com
[minimalist-manual]: https://dl.acm.org/doi/10.1207/s15327051hci0302_2
[UAX15]: https://unicode.org/reports/tr15/
[linus]: https://lore.kernel.org/linux-fsdevel/CAHk-=wiFtZL5rK3T-HQPm0oG4vekDJEKS47P8BbzHSXt_6SHuA at mail.gmail.com/

From sosipiuk at gmail.com  Fri Dec 11 23:58:41 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Sat, 12 Dec 2020 00:58:41 -0500
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To: <CABWuLVf5jeYTTOX9jR0VVBVWoro4YebZ0-MKF-VeeY3vZ6qudg@mail.gmail.com>
References: <CABWuLVf5jeYTTOX9jR0VVBVWoro4YebZ0-MKF-VeeY3vZ6qudg@mail.gmail.com>
Message-ID: <CAM+ijLi6tX-_so1NTaTMOvs4=RY-br8DvHLMUuVfmtBHKNSsmQ@mail.gmail.com>

On Fri, Dec 11, 2020 at 11:49 PM Zach Lym via Unicode
<unicode at unicode.org> wrote:
>
> > A string X is a canonical caseless match for a string Y if and only if:
> > NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))
>
> The W3C Canonical Case Fold Normalization algorithm claims to be
> compatible with [D145], but uses NFC in the last step
> [w3c-charmod-norm], leading to an apparent contradiction.  Even though
> Unicode explains that "case folding is closed under canonical
> normalization" it took me a long time to find that passage and
> convince myself that the W3C and Unicode matching algorithms are
> equivalent.

The more general rule is that:
NFC(X) = NFC(Y) if and only if NFD(X) = NFD(Y).
I.e. you can always replace one canonical form with the other in
equivalence comparisons. (As long as you apply the same one to both
sides, of course, but which one is up to you.)

> I would instead like to propose normalization form generics for use in
> pseudo code definitions:
>
>     NFx = NFD|NFC
>     NFKx = NFKD|NFKC
>     NFxy = NFD|NFC|NFKD|NFKC

I would prefer the last one to be:
NF(K)x = NFD|NFC|NFKD|NFKC; or perhaps
NF[K]x = NFD|NFC|NFKD|NFKC; to look a bit more like ABNF.

S?awomir Osipiuk


From wjgo_10009 at btinternet.com  Sat Dec 12 09:39:28 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Sat, 12 Dec 2020 15:39:28 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <79fb5335.d72.17657954ac1.Webtop.223@btinternet.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <79fb5335.d72.17657954ac1.Webtop.223@btinternet.com>
Message-ID: <35cce9c0.d7f.176579b5f0a.Webtop.223@btinternet.com>

Hi

You might find the following links of interest. The proposal was not 
successful and was dismissed strongly, indeed using italics for 
emphasis.

For the avoidance of doubt I did not advocate regarding the encoding of 
the mathematical italic characters as a precedent for what I proposed.

It is somewhat ironic that the refusal uses italics for emphasis and 
could, in my opinion, be reasonably regarded as supporting evidence for 
the case of what you are wanting encoding, as that emphasis cannot at 
present be expressed in plain text. If it is not a semantic difference 
then it seems to me that there is no reason whatsoever to use italics at 
all in that refusal notice. So has Unicode Inc. in fact shown in its 
refusal the very need that it is refusing to encode?

https://www.unicode.org/L2/L2019/19063-italic-vs.pdf

https://www.unicode.org/L2/L2019/19195-italic-cmt.pdf

https://forum.high-logic.com/viewtopic.php?f=10&t=7831

https://www.unicode.org/alloc/nonapprovals.html

However, such dismissals are not absolute because sometimes there is a 
U-turn later, for example with the encoding of emoji. Look at where 
emoji encoding is now, no longer about just backwards compatibility yet 
pushing forward with new designs. For the avoidance of doubt I am 
pleased that emoji are being encoded. I wish that they would not insist 
that my proposals for encoding a futuristic idea of mine are out of 
scope and refuse to allow them to be discussed in this mailing list or 
put to The Unicode Technical Committee.

I note that you mention a QID item.

There is an ongoing public review about encoding what are being called 
QID emoji.

https://www.unicode.org/review/pri408/

Although the page currently shows a closing date that has passed, the 
public review has, in fact, been reopened as listed on the following 
page.

https://www.unicode.org/review/

Best regards,

William Overington

Saturday 12 December 2020

http://www.users.globalnet.co.uk/~ngo/

My website is safe to use, it is not hosted on my own computer, but is 
hosted on a server run by Plusnet PLC, a United Kingdom company.

From christian.kleineidam at gmail.com  Sat Dec 12 13:01:05 2020
From: christian.kleineidam at gmail.com (Christian Kleineidam)
Date: Sat, 12 Dec 2020 20:01:05 +0100
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <000301d6d00e$4d244330$e76cc990$@ewellic.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
Message-ID: <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>

 On Fri, Dec 11, 2020 at 11:38 PM Doug Ewell <doug at ewellic.org> wrote:

> Christian Kleineidam wrote:
>
> > "Evidence suggesting that ???? ????????????????
> contributed the H2
> > ???? haplotype to ???? ???????"
>
> "Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT
> haplotype to Homo sapiens"
>
> This title is completely meaningful in plain text. The convention to style
> the names of species and haplotypes in italics is just that, a styling
> convention.
>

Would you also say there's no semantic difference between "Evidence
suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to
Homo sapiens" and EVIDENCE SUGGESTING THAT HOMO NEANDERTHALENSIS
CONTRIBUTED THE H2 MAPT HAPLOTYPE TO HOMO SAPIENS"? If so, why does unicode
allow those to be formatted differently?

I think that capitalization generally gets used to express semantic
meaning. Capitalizing the first character of a sentence is a way to
semantically mark the start of the sentence. Capitalizing Homo is a way to
express semantics. Homo gets capitalized here for the same reasons as it
gets italicized. In both cases it's because the semantics of a species name
dictate it if you follow official recommendations.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201212/0cbd78e0/attachment.htm>

From asmusf at ix.netcom.com  Sat Dec 12 16:32:33 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 12 Dec 2020 14:32:33 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
Message-ID: <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201212/c0a39657/attachment.htm>

From duerst at it.aoyama.ac.jp  Sat Dec 12 19:25:06 2020
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Sun, 13 Dec 2020 10:25:06 +0900
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
Message-ID: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>

Asmus gives a lot of good reasons below. Here are some more:

Children learn to write with upper case and lower case letters in 
school, and most people continue to use both as adults. (There are 
exceptions of course, some people write only with lower case, and some 
only with upper case.) On the other hand, people who distinguish upright 
and italic in handwriting are extremely rare (maybe limited to editors 
of certain journals?).

Also, case is important in names. It's Ludwig van Beethoven, not Ludwig 
Van Beethoven, and LeBron James, not Lebron James. Italics don't come 
into consideration here at all.

For all these reasons, the upper/lower case distinction was and is also 
available on typewriters and keyboards. Again not so for italic.

Regards,   Martin.

On 13/12/2020 07:32, Asmus Freytag via Unicode wrote:
> On 12/12/2020 11:01 AM, Christian Kleineidam via Unicode wrote:
>> On Fri, Dec 11, 2020 at 11:38 PM Doug Ewell <doug at ewellic.org 
>> <mailto:doug at ewellic.org>> wrote:
>>
>>     Christian Kleineidam wrote:
>>
>>     > "Evidence suggesting that ???? ????????????????
>>     contributed the H2
>>     > ???? haplotype to ???? ???????"
>>
>>     "Evidence suggesting that Homo neanderthalensis contributed the H2 MAPT
>>     haplotype to Homo sapiens"
>>
>>     This title is completely meaningful in plain text. The convention to style
>>     the names of species and haplotypes in italics is just that, a styling
>>     convention.
>>
>> Would you also say there's no semantic difference between "Evidence suggesting 
>> that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo sapiens" 
>> and EVIDENCE SUGGESTING THAT HOMO NEANDERTHALENSIS CONTRIBUTED THE H2 MAPT 
>> HAPLOTYPE TO HOMO SAPIENS"? If so, why does unicode allow those to be 
>> formatted differently?
>>
>> I think that capitalization generally gets used to express semantic meaning. 
>> Capitalizing the first character of a sentence is a way to semantically mark 
>> the start of the sentence. Capitalizing Homo is a way to express semantics. 
>> Homo gets capitalized here for the same reasons as it gets italicized. In both 
>> cases it's because the semantics of a species name dictate it if you follow 
>> official recommendations.
> 
> There are significant differences in usage as well as implication.
> 
> A style, like "italics" can be applied to nearly the entire set of Unicode
> characters, while case is limited to a comparatively tiny subset. If Unicode
> wanted to encode styles like it does for case, it would mean multiplying the
> number of characters.
> 
> But Mathalphabetics, you say. Well, in mathematical notation, certain styles are
> applied to very limited subsets. In effect, you could argue that in those
> contexts, certain stylistic variants work like case in ordinary orthographies.
> (Mathematical use of letter shapes is special, as it is almost exclusively
> using letter shapes as individual symbols, not part of words).
> 
> Styles, commonly, are applied in runs, not to isolated code points. For case,
> the default is the other way around. In both cases, the exceptions prove the
> underlying rule.
> 
> ALL UPPER CASE, as well as SMALL CAPS are more like a style than normal casing.
> As shown by the way they are supported like styles in feature-rich word
> processing apps.(The latter are not encoded: extending the arguments for
> encoding italics would force adding support for small caps as well).
> 
> Styles, unlike case when applied to selected letters, tends to not have
> orthographic use. Even if it carries meaning that goes beyond being
> "decorative". There are exceptions even here, that prove the rule.
> 
> Finally, the guiding design principle for "plain text" is that it is stateless
> (again, exceptions like bidi, are there to prove the rule). Styles, being
> applied in runs, are inherently not stateless, so are best expressed in stateful
> ways (that is, in one or the other rich-text protocols).
> 
> The use case comes from lack of support of stateful text protocols (even limited
> ones) in places such as social media. There is no inherent reason why Twitter,
> Facebook and the like could not support "markdown" or similar protocols.
> 
> On balance, all proposals for supporting some sort of "italics in Unicode"
> ignore not only the interrelationship shown in these facts, but also the well
> established historical division of "plain text" and "rich text" -- which Unicode
> has no business upsetting.
> 
> A./
> 

From prosfilaes at gmail.com  Sat Dec 12 19:59:53 2020
From: prosfilaes at gmail.com (David Starner)
Date: Sat, 12 Dec 2020 17:59:53 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
Message-ID: <CAMZ=zj6-aJ8UNhkZj2PMSMGjZT8=QSaPWGG7SQceXLgg1Fd9tw@mail.gmail.com>

There's a lot of good answers, but I'd like to circle back to what I
think is the core reason: we've had character sets for seven decades,
virtually all of which supported English, and if any have supported
italics, I've never heard of it. Unicode supports italics the most of
any character set I've heard of. Whether in some sense italics should
be encoded in plain text is not an open problem; it's been assigned to
a level above plain text, and is well supported there.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)

From indolering at gmail.com  Sat Dec 12 20:23:23 2020
From: indolering at gmail.com (Zach Lym)
Date: Sat, 12 Dec 2020 18:23:23 -0800
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To: <CAM+ijLi6tX-_so1NTaTMOvs4=RY-br8DvHLMUuVfmtBHKNSsmQ@mail.gmail.com>
References: <CABWuLVf5jeYTTOX9jR0VVBVWoro4YebZ0-MKF-VeeY3vZ6qudg@mail.gmail.com>
 <CAM+ijLi6tX-_so1NTaTMOvs4=RY-br8DvHLMUuVfmtBHKNSsmQ@mail.gmail.com>
Message-ID: <CABWuLVfUKDj-0uBmbgTt7_QMQ7fjHapU6XUWiKMKb7g5Q=yBgw@mail.gmail.com>

> The more general rule is that:
> NFC(X) = NFC(Y) if and only if NFD(X) = NFD(Y).
> I.e. you can always replace one canonical form with the other in
> equivalence comparisons. (As long as you apply the same one to both
> sides, of course, but which one is up to you.)

Yes, and a careful reading of the standard will show that this is the
case.  But we don't live in a world where people have time to read the
standard. Oh dear, I included the wrong link in my citation!  It
should have been:
https://lwn.net/ml/linux-fsdevel/20190206084752.nwjkeiixjks34vao at pali/

At any rate, someone suggested using NFC, but this objection came up:

>> Is there any case where
>>    NFC(x) == NFC(y) && NFD(x) != NFD(y)   , or
>>    NFC(x) != NFC(y) && NFD(x) == NFD(y)
>
>This is good question. And I think we should get definite answer for it
>prior inclusion of normalization into kernel.

Which was simply never followed up on.  This is a feature that was
included after years of debate and developed in an open process.  If
even Linux can't get this one right, then we need to do a better job
at explaining Unicode.

> > I would instead like to propose normalization form generics for use in
> > pseudo code definitions:
> >
> >     NFx = NFD|NFC
> >     NFKx = NFKD|NFKC
> >     NFxy = NFD|NFC|NFKD|NFKC
>
> I would prefer the last one to be:
> NF(K)x = NFD|NFC|NFKD|NFKC; or perhaps
> NF[K]x = NFD|NFC|NFKD|NFKC; to look a bit more like ABNF.

I don't care for NFxy either, but I strongly prefer sticking to C
programming conventions.

From mark at kli.org  Sat Dec 12 20:48:54 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Sat, 12 Dec 2020 21:48:54 -0500
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
Message-ID: <8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201212/ef59b249/attachment.htm>

From doug at ewellic.org  Sat Dec 12 21:20:01 2020
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 12 Dec 2020 20:20:01 -0700
Subject: Italics get used to express important semantic meaning,
 so unicode should support them
In-Reply-To: <8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org>
Message-ID: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org>

Others have covered pretty much everything I was going to respond to Christian with.

David Starner wrote:

> I'd like to circle back to what I think is the core reason: we've had
> character sets for seven decades, virtually all of which supported
> English, and if any have supported italics, I've never heard of it.

The only conceivable exception might be ISO-IR-68, which represented APL and its distinctive italic uppercase letters. The registration? for ISO-IR-68 named the letters (e.g.) "CAPITAL APL LETTER A" and noted that they were "[u]sually printed in italics," revealing that this was merely a font preference specific to APL, as the corresponding roman (non-italic) letters were not also included. All mapping tables from ISO-IR-68, including Unicode's, map the italic APL letters to normal ASCII letters.

? https://www.itscj.ipsj.or.jp/iso-ir/068.pdf

Christian wrote:

> If so, why does unicode allow those [uppercase and lowercase letters]
> to be formatted differently?

For "formatted differently" I read "encoded separately"; Unicode doesn't dictate whether characters are displayed in an upright (roman) or italic style. If one uses George Douros's Akkadian font, for example, everything comes out in italics.

I wonder if the spelling "unicode" was meant here as a statement about the semantics of initial capitals. Standard English orthography requires that trade names like "Unicode" be spelled with an initial capital, whereas no orthographic requirement exists to spell anything with italics.

We do understand that not every possible nuance of human communication, such as shades of emphasis, can be expressed in plain text. It seems that the nearly 30-year-old Unicode definition of "plain text" still has not caught on universally, since requests continue to emerge for UTC to encode things that are not plain text by that definition.

--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org


From pandey at umich.edu  Sat Dec 12 21:33:08 2020
From: pandey at umich.edu (Anshuman Pandey)
Date: Sat, 12 Dec 2020 21:33:08 -0600
Subject: Italics get used to express important semantic meaning,
 so unicode should support them
In-Reply-To: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org>
References: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org>
Message-ID: <64B08E21-8D4B-4AC1-85A9-C40D9E468178@umich.edu>

Doug basically covered everything I had to say.

<i>?</i>


> On Dec 12, 2020, at 9:20 PM, Doug Ewell via Unicode <unicode at unicode.org> wrote:
> 
> ?Others have covered pretty much everything I was going to respond to Christian with.
> 
> David Starner wrote:
> 
>> I'd like to circle back to what I think is the core reason: we've had
>> character sets for seven decades, virtually all of which supported
>> English, and if any have supported italics, I've never heard of it.
> 
> The only conceivable exception might be ISO-IR-68, which represented APL and its distinctive italic uppercase letters. The registration? for ISO-IR-68 named the letters (e.g.) "CAPITAL APL LETTER A" and noted that they were "[u]sually printed in italics," revealing that this was merely a font preference specific to APL, as the corresponding roman (non-italic) letters were not also included. All mapping tables from ISO-IR-68, including Unicode's, map the italic APL letters to normal ASCII letters.
> 
> ? https://www.itscj.ipsj.or.jp/iso-ir/068.pdf
> 
> Christian wrote:
> 
>> If so, why does unicode allow those [uppercase and lowercase letters]
>> to be formatted differently?
> 
> For "formatted differently" I read "encoded separately"; Unicode doesn't dictate whether characters are displayed in an upright (roman) or italic style. If one uses George Douros's Akkadian font, for example, everything comes out in italics.
> 
> I wonder if the spelling "unicode" was meant here as a statement about the semantics of initial capitals. Standard English orthography requires that trade names like "Unicode" be spelled with an initial capital, whereas no orthographic requirement exists to spell anything with italics.
> 
> We do understand that not every possible nuance of human communication, such as shades of emphasis, can be expressed in plain text. It seems that the nearly 30-year-old Unicode definition of "plain text" still has not caught on universally, since requests continue to emerge for UTC to encode things that are not plain text by that definition.
> 
> --
> Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
> 
> 
> 


From asmusf at ix.netcom.com  Sat Dec 12 22:03:58 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 12 Dec 2020 20:03:58 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <8d4762e3-dbe1-bc6e-dc86-b0736ddcc660@kli.org>
 <000001d6d0fe$d9860d40$8c9227c0$@ewellic.org>
Message-ID: <8168fd17-31c3-5c23-d94d-7864bbd455a9@ix.netcom.com>

On 12/12/2020 7:20 PM, Doug Ewell via Unicode wrote:
> We do understand that not every possible nuance of human communication, such as shades of emphasis, can be expressed in plain text. It seems that the nearly 30-year-old Unicode definition of "plain text" still has not caught on universally, since requests continue to emerge for UTC to encode things that are not plain text by that definition.

If you go against an established method/truth/system/consensus/anything 
and win, you'll be famous. That's the lure that keeps people up at night 
trying to create a /perpetuum mobile/.

Problem is, that chances of that winning are usually more than elusive.

Doesn't prevent people from trying.

If conservation of energy, posited by Julius von Mayer in 1842 and 
well-tested in the over 150 years since then, does not prevent people 
trying the impossible, then why should 30 years of Unicode be sufficient :)

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201212/0eab5ed5/attachment.htm>

From sosipiuk at gmail.com  Sat Dec 12 23:28:56 2020
From: sosipiuk at gmail.com (=?utf-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Sun, 13 Dec 2020 00:28:56 -0500
Subject: Italics get used to express important semantic meaning,
 so unicode should support them 
In-Reply-To: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
Message-ID: <002401d6d110$db4ac280$91e04780$@gmail.com>

I mostly agree with the general consensus, though probably not as firmly. However, I had a showerthought that, specifically in the case of Latin terms, marking them as such would be a legitimate use of the Unicode language tags. Indeed, an indication of ?this is Latin text? would be more correct and future-proof than ?this is italicized?, since the proper styling to indicate Latin text may change with the times, and because tags are default-ignorable, this approach would still be compatible with ?plain text? programs. The wiki (or whatever software) could be made to italicize Latin-within-English text that is tagged as such.

 
I know the tags are officially deprecated, but I personally think they got a bad rap. If ? and that is a big if ? a system for basic formatting (italic/bold/underlined/nonspecifically-emphasized) is ever implemented in Unicode, it should be via the default-ignorable tags.

 
S?awomir Osipiuk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201213/92f5d44f/attachment.htm>

From asmusf at ix.netcom.com  Sat Dec 12 23:45:54 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 12 Dec 2020 21:45:54 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <002401d6d110$db4ac280$91e04780$@gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
Message-ID: <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201212/fd342e8e/attachment.htm>

From richard.wordingham at ntlworld.com  Sun Dec 13 18:51:56 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 14 Dec 2020 00:51:56 +0000
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To: <CABWuLVf5jeYTTOX9jR0VVBVWoro4YebZ0-MKF-VeeY3vZ6qudg@mail.gmail.com>
References: <CABWuLVf5jeYTTOX9jR0VVBVWoro4YebZ0-MKF-VeeY3vZ6qudg@mail.gmail.com>
Message-ID: <20201214005156.6125d895@JRWUBU2>

On Fri, 11 Dec 2020 20:14:08 -0800
Zach Lym via Unicode <unicode at unicode.org> wrote:

> > A string X is a canonical caseless match for a string Y if and only
> > if: NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))  

> Even though
> Unicode explains that "case folding is closed under canonical
> normalization" it took me a long time to find that passage and
> convince myself that the W3C and Unicode matching algorithms are
> equivalent.

What does that quoted statement mean?  I'm having a hard job working
out what the meaning of full case folding is.  I'm not having any
doubts about the meaning of toCasefold(NFD(X)), so there is no issue
for 'canonical caseless matching'.

Richard.

From indolering at gmail.com  Sun Dec 13 22:08:08 2020
From: indolering at gmail.com (Zach Lym)
Date: Sun, 13 Dec 2020 20:08:08 -0800
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To: <20201214005156.6125d895@JRWUBU2>
References: <CABWuLVf5jeYTTOX9jR0VVBVWoro4YebZ0-MKF-VeeY3vZ6qudg@mail.gmail.com>
 <20201214005156.6125d895@JRWUBU2>
Message-ID: <CABWuLVeRX2eLYWGjW17-BY-mP92CRSWvHWWd_xzi+QdDKNF+sw@mail.gmail.com>

> What does that quoted statement mean?  I'm having a hard job working
> out what the meaning of full case folding is.  I'm not having any
> doubts about the meaning of toCasefold(NFD(X)), so there is no issue
> for 'canonical caseless matching'.

The "case folding is closed under canonical normalization" or the other part?

Closed as in closure: https://en.wikipedia.org/wiki/Closure_(mathematics)

Refer to page 240 of the standard, Chaper 5 "Implementation
Guidelines" Section 18 "Case Mappings":

http://www.unicode.org/versions/latest/ch05.pdf

From marius.spix at web.de  Mon Dec 14 05:26:47 2020
From: marius.spix at web.de (Marius Spix)
Date: Mon, 14 Dec 2020 12:26:47 +0100
Subject: Aw: Re: Normalization Generics (NFx, NFKx, NFxy)
Message-ID: <trinity-de9315c2-e08f-4c86-8120-c4cbcfb0fec2-1607945207708@3c-app-webde-bs08>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201214/df9383ce/attachment.htm>

From harjitmoe at outlook.com  Mon Dec 14 08:22:59 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Mon, 14 Dec 2020 14:22:59 +0000
Subject: Aw: Re: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To: <trinity-de9315c2-e08f-4c86-8120-c4cbcfb0fec2-1607945207708@3c-app-webde-bs08>
References: <trinity-de9315c2-e08f-4c86-8120-c4cbcfb0fec2-1607945207708@3c-app-webde-bs08>
Message-ID: <VI1PR0701MB24135B9802081F0E9CD0C2B9B7C70@VI1PR0701MB2413.eurprd07.prod.outlook.com>

Marius Spix via Unicode wrote:
> I understand that:
> [:toCaseFold=s:] = [sS?]
> [:toCaseFold=?:] = [???]
> But can someone explain me the following?
> [:toCaseFold=?:] = [?]
> [:toCaseFold=i:] = [iI]
> [:toCaseFold=?:] = []
> Why is it not:
> [:toCaseFold=?:] = [iI?]
> [:toCaseFold=i:] = [iI?]
> [:toCaseFold=?:] = [??]
> ?
>


? is often changed to SS in uppercase; the ? is a relatively new 
addition as an encoded character and is not consistently used.? So 
PREUSSEN and Preu?en are casings of the same word, for example.? I think 
? might have been added after ?'s casefolding was already defined, but 
I'm not sure so don't quote me on that.

"I" cannot casefold to *both* "i" and "?", it has to casefold to one of 
them.? Not sure about "?" not casefolding the same as "I", but I don't 
suppose there really exists any "good" locale-independent solution for 
case insensitivity of "I".

? Har.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201214/cc4afd8b/attachment.htm>

From sosipiuk at gmail.com  Mon Dec 14 11:02:26 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Mon, 14 Dec 2020 12:02:26 -0500
Subject: Italics get used to express important semantic meaning,
 so unicode should support them
In-Reply-To: <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>	<002401d6d110$db4ac280$91e04780$@gmail.com>	<0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
Message-ID: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>

On Sun, Dec 13, 2020 at 12:47 AM Asmus Freytag via Unicode <unicode at unicode.org> wrote:
>
> Write a killer social media app that uses these in an integral fashion and requires them for interoperability and then sit back and watch how long they stay deprecated ...

That, or perhaps something like Wikidata could use it. ;)

I slept on it, and I'm leaning to the other side now. I think of the paper books I've read, and italics often appear within the text. Are the books "plain text"? Do the italics really fall into the category of typesetting and style, like the choice of overall font? Or are they a meaningful part of the text itself? Should it be possible to fit the content of a whole novel into a .txt file without losing any semantic meaning? The "spirit of Unicode" whispers that it should. Of course some books contain charts and graphics, and Unicode can't do everything, but if a solution can cover 95% of cases, it at least deserves consideration.

On Fri, Dec 11, 2020 at 1:13 PM Christian Kleineidam via Unicode <unicode at unicode.org> wrote:
>
> Create a new unicode character for begin/end italic formatting and begin/end bold formatting that works like the unicode character for the Right-to-Left switch.

If you or someone else chooses to make a proposal, my own recommendation would be this:

- Assign a new character U+E0002 FORMAT TAG
- The syntax follows the specification for tagging (chapter 23.9)
- U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting.
- U+E0002 U+E007F CANCEL TAG to cancel all formatting
- Any use of U+E0002 overrides previous formatting (i.e. a "bold" tag alone cancels a previous "italic" tag), so format nesting must be done by combining all desired formats into a single tag.
- This method should only be used in cases where formatting is required without a higher-level protocol
- This method should not be used in instances where loss of formatting would greatly alter the meaning of the text or render it incomprehensible.
- Strikethrough and super/subscript are deliberately omitted for the above reason.

Advantages:
- Only a single new character needs definition.
- Uses an existing framework (tags)
- Formatting is ignorable, implementation is optional
- A viable method to preserve 95%+ of typical semantic formatting in plain-text
- IMO a stronger case to have this than either language tags or annotations (argument is to accurately preserve the lot of existing documents that include rudimentary formatting, rather than just invent new features).

Disadvantages:
https://xkcd.com/927/

S?awomir Osipiuk


From abrahamgross at disroot.org  Mon Dec 14 11:15:34 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Mon, 14 Dec 2020 17:15:34 +0000 (UTC)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID: <f84e2c92-3b1f-4944-842f-ff62dbb61f5f@disroot.org>

Wait till u see signwriting. now u can draw full on pictures in unicode

Dec 14, 2020 12:04:14 PM S?awomir Osipiuk via Unicode <unicode at unicode.org>:

> Of course some books contain charts and graphics, and Unicode can't do everything,
> 


From wjgo_10009 at btinternet.com  Mon Dec 14 10:29:30 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 14 Dec 2020 16:29:30 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
Message-ID: <6e3feae1.712.1766215e59a.Webtop.210@btinternet.com>

Asmus Freytag wrote:

> But you need to be successful first :)

Indeed.

An invention of mine, a container for which needs encoding into Unicode 
in order to achieve successful unambiguous interoperability, has been 
banned from being discussed in this mailing list and blocked from going 
before The Unicode Technical Committee because it has been deemed 
without explanation to be "out of scope".

Yet scope can change according to need, yet discussion of scope also has 
been blocked from being discussed in the mailing list.

My posts have been placed on permanent moderated posts status so as to 
stop such discussion taking place.

So, at present, the bar is far too high for me to be able to achieve my 
goal of successful unambiguous interoperability for the invention.

Unless discussion and fair consideration by The Unicode Technical 
Committee is allowed then that success will be impossible and a 
futuristic invention will never achieve its full potential.

Encoding into Unicode would also guarantee that the technique is applied 
in a non-proprietary manner.

William Overington

Monday 14 December 2020

From wjgo_10009 at btinternet.com  Mon Dec 14 11:19:20 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 14 Dec 2020 17:19:20 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID: <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com>

S?awomir Osipiuk wrote:

> Of course some books contain charts and graphics, and Unicode can't do 
> everything, ?

In my opinion, Unicode could include charts and graphics by encoding 
them within a plain text stream if people wanted that to be encoded.

William Overington

Monday 14 December 2020


From kenwhistler at sonic.net  Mon Dec 14 13:19:12 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Mon, 14 Dec 2020 11:19:12 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com>
Message-ID: <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net>


On 12/14/2020 9:19 AM, William_J_G Overington via Unicode wrote:
> In my opinion, Unicode could include charts and graphics by encoding 
> them within a plain text stream if people wanted that to be encoded. 

You mean, as in the following sequence, shown here as a stream of plain 
text characters in email?

<img height='24' width='auto' alt="?" src="https://www.unicode.org/images/twitter/twitter_1f60e.png">

And interpreted in the following document, in context, as HTML:

https://www.unicode.org/reports/tr51/#Major_Sources

??

--Ken

P.S. for the nitpickers... yeah, yeah, I realize that this email is 
delivered as HTML, so the "plain text" is itself using quoting 
conventions to embed in the HTML email. If you want this redelivered as 
actual plain text, I could accommodate. ?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201214/e404ccf7/attachment.htm>

From markus.icu at gmail.com  Mon Dec 14 13:41:13 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 14 Dec 2020 11:41:13 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com>
 <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net>
Message-ID: <CAN49p6r_YWsO9VSV3zkmiTc3mfcDWWNZ5PYWDuy3pMG7jn4O9w@mail.gmail.com>

On Mon, Dec 14, 2020 at 11:25 AM Ken Whistler via Unicode <
unicode at unicode.org> wrote:

> P.S. for the nitpickers... yeah, yeah, I realize that this email is
> delivered as HTML, so the "plain text" is itself using quoting conventions
> to embed in the HTML email. If you want this redelivered as actual plain
> text, I could accommodate. ?
>
No need. I can confirm that your email was sent as

*Content-Type: multipart/alternative*;
boundary="------------7278B446390F3BC66C4D83C4"


And that the first part is in

*Content-Type: text/plain*; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit


So we are all good.
(For Gmail users: Three-dot ?More? menu on the specific message, select
?Show original?)

Thanks,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201214/e4759837/attachment.htm>

From costello at mitre.org  Mon Dec 14 14:17:48 2020
From: costello at mitre.org (Roger L Costello)
Date: Mon, 14 Dec 2020 20:17:48 +0000
Subject: Is there a difference between converting a string of ASCII digits to
 an integer versus a string of non-ASCII digits to an integer?
Message-ID: <SA0PR09MB690790C12F29D4C7AC9DF663C8C70@SA0PR09MB6907.namprd09.prod.outlook.com>

Hi Folks,

As I understand it, when the C programming language was created it just used ASCII. Programs written in C used ASCII digits.

Nowadays C supports Unicode and Unicode contains more digits than just the ASCII digits. (I think) modern C programs can express numbers using strings of non-ASCII digits.

Questions:

1. Is the algorithm for converting a string that contains non-ASCII digits different than the algorithm for converting a string containing ASCII digits?

2. The C function atoi() converts a string of digits to a number. I have seen the source code for atoi(). The source code that I saw was dated around the year 2000. Can you point me to the modern source code for atoi()?

/Roger


From wjgo_10009 at btinternet.com  Mon Dec 14 15:48:07 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 14 Dec 2020 21:48:07 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <3f99313c.7c9.176624383f5.Webtop.210@btinternet.com>
 <21b03a91-7c90-a4a2-ebfc-cad6f627e04e@sonic.net>
Message-ID: <31d43ca1.a6e.176633998ad.Webtop.216@btinternet.com>

Ken Whistler wrote as follows.

> You mean, as in the following sequence, shown here as a stream of 
> plain text characters in email?

<img height='24' width='auto' alt="?" 
src="https://www.unicode.org/images/twitter/twitter_1f60e.png">

And interpreted in the following document, in context, as HTML:

https://www.unicode.org/reports/tr51/#Major_Sources

??

Actually no, for bitmap images I was thinking of a tag character 
sequence method that was proposed in a document in The Unicode Technical 
Committee Document Register some time ago that would directly embed an 
image in a plain text file, no external link. It was not authored by me, 
I cannot find it at present.

For vector graphics I was thinking of a tag character version of the 
eutographics system that I devised back in 2002. (Please note that that 
is eutographics, not  eurographics, as which it has sometimes been 
incorrectly described.)

http://www.users.globalnet.co.uk/~ngo/ast03000.htm

It worked well locally using a Java applet in a web page.

So, if The Unicode Technical Committee were to include these ideas in 
Unicode, then Unicode could enable much more information to be 
communicated unambiguously and interoperably in a plain text file.

William Overington

Monday 14 December 2020

Please note that the email address used in the listings in the 
eutographics web page is not in regular use these days.


From harjitmoe at outlook.com  Mon Dec 14 17:03:36 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Mon, 14 Dec 2020 23:03:36 +0000
Subject: Is there a difference between converting a string of ASCII digits
 to an integer versus a string of non-ASCII digits to an integer?
In-Reply-To: <SA0PR09MB690790C12F29D4C7AC9DF663C8C70@SA0PR09MB6907.namprd09.prod.outlook.com>
References: <SA0PR09MB690790C12F29D4C7AC9DF663C8C70@SA0PR09MB6907.namprd09.prod.outlook.com>
Message-ID: <VI1PR0701MB24135271EC13BAE56B4D8A1FB7C70@VI1PR0701MB2413.eurprd07.prod.outlook.com>

Roger L Costello via Unicode wrote:
> [?]
> 2. The C function atoi() converts a string of digits to a number. I have seen the source code for atoi(). The source code that I saw was dated around the year 2000. Can you point me to the modern source code for atoi()?
>
> /Roger


Here is the implementation from the FreeBSD libc: 
https://github.com/freebsd/freebsd/blob/master/lib/libc/stdlib/strtol.c

(|strtol| and |strtol_l| are defined in that source file.? |atoi| and 
|atoi_l| just wrap them, passing |NULL| for |endptr| and |10| for |base|.)

?Har.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201214/0ada775f/attachment-0001.htm>

From mark at kli.org  Mon Dec 14 18:59:33 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Mon, 14 Dec 2020 19:59:33 -0500
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID: <a9ed40b9-60e4-270e-0fd2-52ae1d01e0ff@kli.org>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201214/5ec51413/attachment.htm>

From sosipiuk at gmail.com  Mon Dec 14 22:36:03 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Mon, 14 Dec 2020 23:36:03 -0500
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <a9ed40b9-60e4-270e-0fd2-52ae1d01e0ff@kli.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <a9ed40b9-60e4-270e-0fd2-52ae1d01e0ff@kli.org>
Message-ID: <CAM+ijLhNXTeAKPHmEazqiTXKJ=k9o9b0QVD8fa91mFF90r5LNw@mail.gmail.com>

On Mon, Dec 14, 2020 at 8:05 PM Mark E. Shoulson via Unicode
<unicode at unicode.org> wrote:
>
> All TAG symbols placed between a U+E003D TAG LESS-THAN SIGN and a U+E003E TAG GREATER-THAN SIGN, inclusive, are to be treated as if they were they corresponding ASCII characters, and run that through an HTML renderer.  I guess if you wanted you could stipulate some reduced or restricted subset of HTML

I've been informed off-list that BabelPad uses this as a formatting
option. So, it's been done.
This solution technically constitutes a higher-level protocol anyway.
It's a markup language, just using unusual characters, but it's not in
any fundamental way a Unicode feature, official or not.

> If this sounds disturbing and wrong to you,

Disturbing? No. Wrong? I'd say "not my first choice". There are plenty
of things already approved that actually disturb me, but I won't go on
that tangent now.

> then other pseudo-markup ideas probably should as well.

Pseudo-markup already exists in Unicode, in multiple, inconsistent
ways. It exists because it was, at some point, by some people, deemed
useful enough and compatible enough with the aims of Unicode to be
included. I'm boggled by how annotations got in.

I'm well aware of scope creep and I'm not at all in favour of making
Unicode a Turing-complete programming language. That's why I proposed
something that fits into an already-established method that Unicode
has already defined. It even includes a bit of syntactic salt in the
way format nesting must be done that drives implementers to other
protocols for anything beyond rudimentary effects.

My guiding example is, "record fully the story text of a paperback
novel". There are things that are irrelevant for this purpose, such as
choice of font, or drop caps ("fancy first letters"), or page numbers,
or sizing of chapter titles, etc., etc.. Even something like
monospaced text is almost always used purely stylistically (to
indicate in-story things like signage, computer output, telegrams.)
and can be substituted with imagination by engaged readers. But
italics or underlines are often a meaningful part of text and
something is lost when that formatting is lost. Necessitating a
higher-level protocol for something so simple, when it can be easily
accommodated through an existing Unicode framework, is needlessly
conservative.

The thread-starter, Christian Kleineidam, gave a different use case
but I think it's a valid one as well. I think this would be an easy
win with not a whole lot of downside.

Reading the room here, not many agree. C'est la vie.

Cheers,
S?awomir Osipiuk


From beckiergb at gmail.com  Tue Dec 15 00:47:33 2020
From: beckiergb at gmail.com (Rebecca Bettencourt)
Date: Mon, 14 Dec 2020 22:47:33 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CAMZ=zj6-aJ8UNhkZj2PMSMGjZT8=QSaPWGG7SQceXLgg1Fd9tw@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <CAMZ=zj6-aJ8UNhkZj2PMSMGjZT8=QSaPWGG7SQceXLgg1Fd9tw@mail.gmail.com>
Message-ID: <CAH=y87b0Csmv90xMTafsepOzo3KC5vPzcUyirA2qRs5um5FmPg@mail.gmail.com>

On Sat, Dec 12, 2020 at 6:03 PM David Starner via Unicode <
unicode at unicode.org> wrote:

> we've had character sets for seven decades,
> virtually all of which supported English, and if any have supported
> italics, I've never heard of it.


ISCII 1991 had a mechanism called ATR Codes for applying styles and
switching character sets (see Annex E of
http://varamozhi.sourceforge.net/iscii91.pdf):

EF 30 - bold
EF 31 - italic
EF 32 - underline
EF 33 - expanded
EF 34 - highlight
EF 35 - outline
EF 36 - shadow
EF 37 - double height, top half
EF 38 - double height, bottom half
EF 39 - double height and width


Many character sets from 8-bit microcomputers had ?inverse? or ?reverse
video? characters that were treated as distinct from their ?normal video?
counterparts. When we proposed encoding these, as atomic characters or
using variation sequences or by any other means, the UTC shot down the idea
completely.


The existence of existing character sets, even when one is a government
standard, can't even get stylistic differences like italics or reverse
video into Unicode.


-- Rebecca Bettencourt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201214/918c50a2/attachment.htm>

From markus.icu at gmail.com  Tue Dec 15 11:31:36 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 15 Dec 2020 09:31:36 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CAH=y87b0Csmv90xMTafsepOzo3KC5vPzcUyirA2qRs5um5FmPg@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <CAMZ=zj6-aJ8UNhkZj2PMSMGjZT8=QSaPWGG7SQceXLgg1Fd9tw@mail.gmail.com>
 <CAH=y87b0Csmv90xMTafsepOzo3KC5vPzcUyirA2qRs5um5FmPg@mail.gmail.com>
Message-ID: <CAN49p6oQ=81zn7RRoxvWWz2oob5mct=821NXT8m-Pn8Yn4ngPQ@mail.gmail.com>

On Mon, Dec 14, 2020 at 10:54 PM Rebecca Bettencourt via Unicode <
unicode at unicode.org> wrote:

> Many character sets from 8-bit microcomputers had ?inverse? or ?reverse
> video? characters that were treated as distinct from their ?normal video?
> counterparts. When we proposed encoding these, as atomic characters or
> using variation sequences or by any other means, the UTC shot down the idea
> completely.
>

Early computing systems conflated layers of processing where modern ones
separate them. For example, a quarter of ASCII and of EBCDIC, respectively,
was used for control codes which we inherited but which are now mostly
unused because we use lower-level mechanisms instead that carry text purely
as payload.

I think the plain text / rich text distinction has been quite successful. I
don't actually personally like the math-styled characters because they seem
specific to a particular math tradition. When I was in high school, the
vector-math teacher gave us a choice between the old style of using
Fraktur/S?tterlin for vector variables vs. the new style of regular letters
with an arrow on top. "Vector" markup with different style choices seems
better for this kind of thing.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201215/ad964d01/attachment.htm>

From richard.wordingham at ntlworld.com  Tue Dec 15 13:10:01 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 15 Dec 2020 19:10:01 +0000
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To: <CABWuLVeRX2eLYWGjW17-BY-mP92CRSWvHWWd_xzi+QdDKNF+sw@mail.gmail.com>
References: <CABWuLVf5jeYTTOX9jR0VVBVWoro4YebZ0-MKF-VeeY3vZ6qudg@mail.gmail.com>
 <20201214005156.6125d895@JRWUBU2>
 <CABWuLVeRX2eLYWGjW17-BY-mP92CRSWvHWWd_xzi+QdDKNF+sw@mail.gmail.com>
Message-ID: <20201215191001.356f5795@JRWUBU2>

On Sun, 13 Dec 2020 20:08:08 -0800
Zach Lym via Unicode <unicode at unicode.org> wrote:

> > What does that quoted statement mean?  I'm having a hard job working
> > out what the meaning of full case folding is.  I'm not having any
> > doubts about the meaning of toCasefold(NFD(X)), so there is no issue
> > for 'canonical caseless matching'.  
> 
> The "case folding is closed under canonical normalization" or the
> other part?

That part.

> Closed as in closure:
> https://en.wikipedia.org/wiki/Closure_(mathematics)

That only tells me what it means for a _set_ to be closed under an
operation.  What does it mean for a _function_ (or similar) to be
closed under an operation?

If I must use the definition for a set, then I can only conclude that
for one operation to be closed under another operation, the result
should be independent of the order in which they are applied.

But for X = <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0342
COMBINING GREEK PERISPOMENI>:

NFD(toCasefold(X)) = <U+03B1 GREEK SMALL LETTER ALPHA, U+03B9 GREEK
SMALL LETTER IOTA, U+0342>

toCasefold(NFD(X)) = <U+03B1, U+0342, U+03B9>

NFC(toCasefold(X)) = <U+03B1, U+1FD6 GREEK SMALL LETTER IOTA WITH
PERISPOMENI>

toCasefold(NFC(X)) = <U+1FB6 GREEK SMALL LETTER ALPHA WITH PERISPOMENI,
U+03B9>

So either "case folding is closed under canonical normalization" means
something else, or it is simply not true.

> Refer to page 240 of the standard, Chaper 5 "Implementation
> Guidelines" Section 18 "Case Mappings":
> 
> http://www.unicode.org/versions/latest/ch05.pdf

Why?

The trick is not to be deflecting by the opening paragraph in TUS
Section 3.13, but to read on to find R4.

Richard.

From indolering at gmail.com  Tue Dec 15 14:04:00 2020
From: indolering at gmail.com (Zach Lym)
Date: Tue, 15 Dec 2020 12:04:00 -0800
Subject: Normalization Generics (NFx, NFKx, NFxy)
In-Reply-To: <20201215191001.356f5795@JRWUBU2>
References: <CABWuLVf5jeYTTOX9jR0VVBVWoro4YebZ0-MKF-VeeY3vZ6qudg@mail.gmail.com>
 <20201214005156.6125d895@JRWUBU2>
 <CABWuLVeRX2eLYWGjW17-BY-mP92CRSWvHWWd_xzi+QdDKNF+sw@mail.gmail.com>
 <20201215191001.356f5795@JRWUBU2>
Message-ID: <CABWuLVesTasCU==5nYTgR_fu2r2gfbV9+t4rZQgNXuG+b27f8w@mail.gmail.com>

Okay, so points for pedantry ... but do you have any input on adding
normalization generics to Unicode pseudocode?

Or would you like to split this discussion out into a new topic?

On Tue, Dec 15, 2020 at 11:21 AM Richard Wordingham via Unicode
<unicode at unicode.org> wrote:
>
> On Sun, 13 Dec 2020 20:08:08 -0800
> Zach Lym via Unicode <unicode at unicode.org> wrote:
>
> > > What does that quoted statement mean?  I'm having a hard job working
> > > out what the meaning of full case folding is.  I'm not having any
> > > doubts about the meaning of toCasefold(NFD(X)), so there is no issue
> > > for 'canonical caseless matching'.
> >
> > The "case folding is closed under canonical normalization" or the
> > other part?
>
> That part.
>
> > Closed as in closure:
> > https://en.wikipedia.org/wiki/Closure_(mathematics)
>
> That only tells me what it means for a _set_ to be closed under an
> operation.  What does it mean for a _function_ (or similar) to be
> closed under an operation?
>
> If I must use the definition for a set, then I can only conclude that
> for one operation to be closed under another operation, the result
> should be independent of the order in which they are applied.
>
> But for X = <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0342
> COMBINING GREEK PERISPOMENI>:
>
> NFD(toCasefold(X)) = <U+03B1 GREEK SMALL LETTER ALPHA, U+03B9 GREEK
> SMALL LETTER IOTA, U+0342>
>
> toCasefold(NFD(X)) = <U+03B1, U+0342, U+03B9>
>
> NFC(toCasefold(X)) = <U+03B1, U+1FD6 GREEK SMALL LETTER IOTA WITH
> PERISPOMENI>
>
> toCasefold(NFC(X)) = <U+1FB6 GREEK SMALL LETTER ALPHA WITH PERISPOMENI,
> U+03B9>
>
> So either "case folding is closed under canonical normalization" means
> something else, or it is simply not true.
>
> > Refer to page 240 of the standard, Chaper 5 "Implementation
> > Guidelines" Section 18 "Case Mappings":
> >
> > http://www.unicode.org/versions/latest/ch05.pdf
>
> Why?
>
> The trick is not to be deflecting by the opening paragraph in TUS
> Section 3.13, but to read on to find R4.
>
> Richard.

From abrahamgross at disroot.org  Tue Dec 15 14:52:35 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Tue, 15 Dec 2020 20:52:35 +0000 (UTC)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
Message-ID: <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>

Unicode refused to encode arabic letter variants (not counting compatibility chars), which are taught in school and adults use it, and its how arabic is written, so ur argument here doesn't hold water.

Dec 12, 2020 8:26:10 PM Martin J. D?rst via Unicode <unicode at unicode.org>:

> Children learn to write with upper case and lower case letters in school, and most people continue to use both as adults. (There are exceptions of course, some people write only with lower case, and some only with upper case.)
> 


From indolering at gmail.com  Tue Dec 15 16:28:55 2020
From: indolering at gmail.com (Zach Lym)
Date: Tue, 15 Dec 2020 14:28:55 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID: <CABWuLVfGn1sgZzYTb0MfZ=WG8hNSXUgRrFbwagKwBPY_bVhRTg@mail.gmail.com>

> If you or someone else chooses to make a proposal, my own recommendation would be this:
>
> - Assign a new character U+E0002 FORMAT TAG
> - The syntax follows the specification for tagging (chapter 23.9)
> - U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting.

How would one implement blink?  I would consider that top priority, as
it was explicitly designed for styling plaintext.

From richard.wordingham at ntlworld.com  Tue Dec 15 16:32:16 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 15 Dec 2020 22:32:16 +0000
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <000301d6d00e$4d244330$e76cc990$@ewellic.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
Message-ID: <20201215223216.339e3a0a@JRWUBU2>

On Fri, 11 Dec 2020 15:38:07 -0700
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> Christian Kleineidam wrote:
> 
> > "Evidence suggesting that ???? ???????????????? contributed the H2
> > ???? haplotype to ???? ???????"  
> 
> "Evidence suggesting that Homo neanderthalensis contributed the H2
> MAPT haplotype to Homo sapiens"
> 
> This title is completely meaningful in plain text. The convention to
> style the names of species and haplotypes in italics is just that, a
> styling convention.

Yet there are cases where meaning is completely lost.  There was a
Latin script spelling for Pali and Sanskrit that used italicised
guttural letters for palatals, and italicised letters where nowadays we
normally have a dot below.  I think this scheme was introduced by Max
Mueller.  Thus, a Sanskrit sequence meaning 'and this' is written not
'tacca' but 'ta??a'.  (I naturally misread the latter as though it were
'takka'.)  That naturally raises the question of how such italic letters
are to be italicised!

I've also seen phonetic respelling of English in the Thai script
where italicised consonants are used for English consonants for which
Thai has no equivalent.

When documenting program, there is a massive gain in readability when
the lower case names of programs and variables are written out in a
typewriter-style font like Courier.  (Some monospace fonts lack the
distinctiveness.)

Richard.


From kent.b.karlsson at bahnhof.se  Tue Dec 15 17:07:05 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Wed, 16 Dec 2020 00:07:05 +0100
Subject: Italics get used to express important semantic meaning, so
 unicode should support them 
In-Reply-To: <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
Message-ID: <F04B5881-567E-4EB3-9D5D-C2DA9E7E5261@bahnhof.se>

(Below)

> 14 dec. 2020 kl. 18:02 skrev S?awomir Osipiuk via Unicode <unicode at unicode.org>:


> If you or someone else chooses to make a proposal, my own recommendation would be this:
> 
> - Assign a new character U+E0002 FORMAT TAG
> - The syntax follows the specification for tagging (chapter 23.9)
> - U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting.
> - U+E0002 U+E007F CANCEL TAG to cancel all formatting
> - Any use of U+E0002 overrides previous formatting (i.e. a "bold" tag alone cancels a previous "italic" tag), so format nesting must be done by combining all desired formats into a single tag.
> - This method should only be used in cases where formatting is required without a higher-level protocol
> - This method should not be used in instances where loss of formatting would greatly alter the meaning of the text or render it incomprehensible.
> - Strikethrough and super/subscript are deliberately omitted for the above reason.

Now, where did I see something very much like this??? 

?

?

Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very close (especially the ?invisible by default? (?default ignorable?) IF parsed correctly). And? ECMA-48 is already a standard. And? ECMA-48 is already successful, and still used every day by very many people. Though it is primarily used in terminal emulators. (Nit: ECMA-48 does have strikethrough? And more. As does HTML/CSS, and when doing ?copy as plain text?, also that formatting disappear.)

Your U+E0002 FORMAT TAG: ECMA-48  CSI ? m
Your U+E0062 (bold): ECMA-48  CSI 1m
Your U+E0065 (emphatic): don?t know what you mean by that
Your U+E0069 (italic): ECMA-48  CSI 3m
Your U+E0079 (underlined): ECMA-48 CSI 4m
Your U+E007F CANCEL TAG: ECMA-48  CSI 0m

It is not entirely inconceivable to map all the (otherwise) printable characters used by such control sequences to TAG characters, thus making the ?default ignorable? part of this a bit easier.

Extra nit: Some markdowns (however did that name stick?) allow for strikethrough as well, as -stricken-. Though a bit intuitive, it way too often has an unexpected effect where no strikethrough was intended (try doing ?ls -l? in your Linux terminal, and paste the result into some place that have that kind of markdown).

?Math Italic? is a hack for MathML. If done right, MathML would not have needed them either. ?Math Italic? for emphasis in running text (not MathML) only ?works? (sort of, and partially) for English, nearly no other language. Please don?t use the ?Math italic/bold/etc? outside of MathML.

/Kent Karlsson

PS
First edition of ECMA-48 came in 1976. About 44 years ago.


> Advantages:
> - Only a single new character needs definition.
> - Uses an existing framework (tags)
> - Formatting is ignorable, implementation is optional
> - A viable method to preserve 95%+ of typical semantic formatting in plain-text
> - IMO a stronger case to have this than either language tags or annotations (argument is to accurately preserve the lot of existing documents that include rudimentary formatting, rather than just invent new features).
> 
> Disadvantages:
> https://xkcd.com/927/
> 
> S?awomir Osipiuk
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/c4913093/attachment.htm>

From billposer2 at gmail.com  Tue Dec 15 17:10:07 2020
From: billposer2 at gmail.com (Bill Poser)
Date: Tue, 15 Dec 2020 15:10:07 -0800
Subject: Is there a difference between converting a string of ASCII digits
 to an integer versus a string of non-ASCII digits to an integer?
In-Reply-To: <SA0PR09MB690790C12F29D4C7AC9DF663C8C70@SA0PR09MB6907.namprd09.prod.outlook.com>
References: <SA0PR09MB690790C12F29D4C7AC9DF663C8C70@SA0PR09MB6907.namprd09.prod.outlook.com>
Message-ID: <CACPRsRQ+ByEEGEVdUZU=8edChc8BAeryEjbx01ZR7YD+E8tjqw@mail.gmail.com>

What do you mean by "non-ASCII digits"? Things like superscript and
subscript versions of the usual Western "Arabic' numbers? Or are you
talking about numbers like those of Chinese, roman numerals, Tamil, etc.?
In the case of the former, once you map the digits to their standard forms,
the algorithm is the same. In the case of the latter, no, in many cases
very different algorithms are required.

On Mon, Dec 14, 2020 at 12:28 PM Roger L Costello via Unicode <
unicode at unicode.org> wrote:

> Hi Folks,
>
> As I understand it, when the C programming language was created it just
> used ASCII. Programs written in C used ASCII digits.
>
> Nowadays C supports Unicode and Unicode contains more digits than just the
> ASCII digits. (I think) modern C programs can express numbers using strings
> of non-ASCII digits.
>
> Questions:
>
> 1. Is the algorithm for converting a string that contains non-ASCII digits
> different than the algorithm for converting a string containing ASCII
> digits?
>
> 2. The C function atoi() converts a string of digits to a number. I have
> seen the source code for atoi(). The source code that I saw was dated
> around the year 2000. Can you point me to the modern source code for atoi()?
>
> /Roger
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201215/5bff9c77/attachment.htm>

From mark at kli.org  Tue Dec 15 17:26:31 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 15 Dec 2020 18:26:31 -0500
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CAM+ijLhNXTeAKPHmEazqiTXKJ=k9o9b0QVD8fa91mFF90r5LNw@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <a9ed40b9-60e4-270e-0fd2-52ae1d01e0ff@kli.org>
 <CAM+ijLhNXTeAKPHmEazqiTXKJ=k9o9b0QVD8fa91mFF90r5LNw@mail.gmail.com>
Message-ID: <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201215/4ed407d3/attachment.htm>

From markus.icu at gmail.com  Tue Dec 15 17:45:11 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 15 Dec 2020 15:45:11 -0800
Subject: Is there a difference between converting a string of ASCII digits
 to an integer versus a string of non-ASCII digits to an integer?
In-Reply-To: <CACPRsRQ+ByEEGEVdUZU=8edChc8BAeryEjbx01ZR7YD+E8tjqw@mail.gmail.com>
References: <SA0PR09MB690790C12F29D4C7AC9DF663C8C70@SA0PR09MB6907.namprd09.prod.outlook.com>
 <CACPRsRQ+ByEEGEVdUZU=8edChc8BAeryEjbx01ZR7YD+E8tjqw@mail.gmail.com>
Message-ID: <CAN49p6puTkfE-C3YATgBFyfkCU+pgb_vFe5eAsrv7o3hZ4odPw@mail.gmail.com>

I suspect that Roger is just looking at decimal digits (property gc=Nd
<https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Agc%3DNd%3A%5D%26%5B%3Anv%3D4%3A%5D&g=bc&i=>
).
I believe that they can all be parsed like strings of ASCII digits (and you
can call ICU or other libraries to get at the digit values and other
properties).
I suggest you double-check about the RTL digits (N'Ko & Adlam); please take
a look at the relevant Unicode book chapters.

What's more interesting is handling the grouping and decimal separators
which differ by both language and region.

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201215/1c1fc063/attachment.htm>

From sosipiuk at gmail.com  Tue Dec 15 18:41:09 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Tue, 15 Dec 2020 19:41:09 -0500
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <a9ed40b9-60e4-270e-0fd2-52ae1d01e0ff@kli.org>
 <CAM+ijLhNXTeAKPHmEazqiTXKJ=k9o9b0QVD8fa91mFF90r5LNw@mail.gmail.com>
 <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
Message-ID: <CAM+ijLjkNdAB15VykfVOsz7zGk-UqAd2Urzp+-O5w4d9w-LfdA@mail.gmail.com>

On Tue, Dec 15, 2020 at 6:26 PM Mark E. Shoulson <mark at kli.org> wrote:
>
> But how is that different from anything being proposed?  If this idea were accepted as part of Unicode, then it *would* be a feature of Unicode, just as whatever is being proposed would be if it were accepted.  How does it matter if italicizing something is marked by some new U+DEADBF characters or by existing tag characters?

- Rather than a completely new method, it's "just" an extension of an
existing feature. (Tag syntax, scope, and default ignorability are
already defined in the Unicode standard)
- The syntax "naturally" discourages complicated format nesting.
Unicode may formally restrict format combos.

> If you insist that Unicode-compliant text readers must show italics or bold when marked with such-and-such characters,

Absolutely not!

> Conversely, if you're okay with pseudo-markup, this should sound fine to you.  Why doesn't it?

"Not my first choice" is what I said. It's not bad, but its similarity
to HTML is not a good thing in my eyes, because it raises the question
"I can do this in HTML, why can't I do it in UnicodeML??" and push for
more and more HTML features to be included. It encourages feature
creep, which I said I'm against. Familiarity is not always a good
thing.

> (how would this markup interact with other markup, like HTML, I wonder?)

(From the Unicode Standard, page 916, with [] additions by me; notice
how little the text changes)

"The rules for Unicode conformance for the tag characters are exactly
the same as those for any other Unicode characters. A conformant
process is not required to interpret the tag characters. If it does
interpret them, it should interpret them according to the standard?
that is, as spelled-out tags. However, there is no requirement to
provide a particular interpretation of the text because it is tagged
with a given language [or formatting]. If an application does not
interpret tag characters, it should leave their values undisturbed and
do whatever it does with any other uninterpreted characters.
[...]
"Implementations of Unicode that already make use of out-of-band
mechanisms for language [or format] tagging or ?heavy-weight? in-band
mechanisms such as XML or HTML will continue to do exactly what they
are doing and will ignore the tag characters completely. They may even
prohibit their use to prevent conflicts with the equivalent markup."

S?awomir Osipiuk


From richard.wordingham at ntlworld.com  Tue Dec 15 18:42:29 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 16 Dec 2020 00:42:29 +0000
Subject: Is there a difference between converting a string of ASCII
 digits to an integer versus a string of non-ASCII digits to an integer?
In-Reply-To: <CAN49p6puTkfE-C3YATgBFyfkCU+pgb_vFe5eAsrv7o3hZ4odPw@mail.gmail.com>
References: <SA0PR09MB690790C12F29D4C7AC9DF663C8C70@SA0PR09MB6907.namprd09.prod.outlook.com>
 <CACPRsRQ+ByEEGEVdUZU=8edChc8BAeryEjbx01ZR7YD+E8tjqw@mail.gmail.com>
 <CAN49p6puTkfE-C3YATgBFyfkCU+pgb_vFe5eAsrv7o3hZ4odPw@mail.gmail.com>
Message-ID: <20201216004229.51af1612@JRWUBU2>

On Tue, 15 Dec 2020 15:45:11 -0800
Markus Scherer via Unicode <unicode at unicode.org> wrote:

> I suspect that Roger is just looking at decimal digits (property gc=Nd
> <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Agc%3DNd%3A%5D%26%5B%3Anv%3D4%3A%5D&g=bc&i=>
> ).
> I believe that they can all be parsed like strings of ASCII digits
> (and you can call ICU or other libraries to get at the digit values
> and other properties).
> I suggest you double-check about the RTL digits (N'Ko & Adlam);
> please take a look at the relevant Unicode book chapters.

It looks as though the N'ko section documents the significance by
accident!  I thought a policy was going to be documented (2012 or
slightly later) that decimal digits are stored most significant
digit first, but that doesn't seem to have happened.

Richard.

From sosipiuk at gmail.com  Tue Dec 15 19:14:42 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Tue, 15 Dec 2020 20:14:42 -0500
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <F04B5881-567E-4EB3-9D5D-C2DA9E7E5261@bahnhof.se>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <F04B5881-567E-4EB3-9D5D-C2DA9E7E5261@bahnhof.se>
Message-ID: <CAM+ijLg_Yr45OOX=f1=HLjyW+5RQrQCYmLvAhy4RBfwPgh-PPg@mail.gmail.com>

On Tue, Dec 15, 2020 at 6:07 PM Kent Karlsson
<kent.b.karlsson at bahnhof.se> wrote:
> Now, where did I see something very much like this???
> Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very close (especially the ?invisible by default? (?default ignorable?) IF parsed correctly).

ECMA-48 aka ISO 6429 was on my mind the moment I read the OP. I didn't
mention it because it's a bit outdated (even if I do have a fondness
for it) and if you're using such a thing, why not a more modern HTML
subset, or BBCode, or any number of other options in use or from the
list the OP gave? There are, after all, so many to choose from. And if
none of those satisfy, you can always make your own!

But that "if parsed correctly" is quite the nit, isn't it?

> It is not entirely inconceivable to map all the (otherwise) printable characters used by such control sequences to TAG characters, thus making the ?default ignorable? part of this a bit easier.

And this is just the BabelPad solution but applied to a different
protocol. Replacing regular markup by corresponding characters from
the tag block to gain ignorable-ness may seem like a cool idea at
first, but it's just spinning yet another markup. (With no offense
intended to BabelPad's author; it's not a bad idea except that it
starts at the bottom of the mountain just like any other.) Tag syntax
is already part of Unicode. I'd rather use it than import something
wholesale from another protocol.

Finally, what I'm envisioning ? and I'm not sure how closely this
matches Christian Kleineidam's intention (where did he go, anyway?) ?
is not Yet Another Presentation Layer or a Shiny New Toy for people to
use in their tweets, but more of a sombre hint that "in the original
source document, this text had an alternative presentation; indicate
this to the user in an appropriate way, if applicable". It's meant for
preservation, not decoration. That's why I hear the "spirit of
Unicode".

S?awomir Osipiuk


From copypaste at kittens.ph  Tue Dec 15 19:58:57 2020
From: copypaste at kittens.ph (Fredrick Brennan)
Date: Tue, 15 Dec 2020 20:58:57 -0500
Subject: =?UTF-8?B?Mcui4bWXLCAy4oG/4bWILCAzyrPhtYgsIDThtZfKsCDigKYgOeG1l8qw?=
Message-ID: <9137826.KFeHLySHN7@laptop>

Hello!

With Unicode superscript lowercase letters, dates with superscript ordinal 
indicators in English can be written in plaintext, e.g.:

1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on.

The only problem I've encountered is in font fallback; fonts are more likely to 
contain ? than the other letters due to its use in Pe?h-?e-j? and IPA. So, ? often 
appears in a different style in the word 2?? for example. This can be somewhat 
avoided by using a font which supports all the letters, such as Gentium Plus, EB 
Garamond, etc.

However, I have a feeling that this use is an abuse of the standard, but that brings 
up an interesting comparison with the ordinal indicators for Spanish, Portuguese 
(& other languages?), the masculine ? and the feminine ?.

If anyone has time to answer, why is one an abuse and the other not, if indeed 1?? is 
an abuse as I think?

If it's not an abuse, then that could perhaps be an argument for the necessity of 
encoding ????????? ???????? ?????? s???? ?, as ? is one of the few letters without a 
combining counterpart in Cyrillic Extended-A or Extended-B. (Of course, no 
breaking spaces would need to be used to write Russian 2-? if this character were 
to be encoded, e.g. as U+32 U+A0 U+XXXX, while no-break spaces aren't needed 
for Latin.

Best,
Fred Brennan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201215/30269c84/attachment.htm>

From copypaste at kittens.ph  Tue Dec 15 20:04:55 2020
From: copypaste at kittens.ph (Fredrick Brennan)
Date: Tue, 15 Dec 2020 21:04:55 -0500
Subject: =?UTF-8?B?Mcui4bWXLCAy4oG/4bWILCAzyrPhtYgsIDThtZfKsCDigKYgOeG1l8qw?=
In-Reply-To: <9137826.KFeHLySHN7@laptop>
References: <9137826.KFeHLySHN7@laptop>
Message-ID: <2171140.htiGsxgcq4@laptop>

Oh dear, my email-client was erroneously configured to use SHIFT_JIS, which 
mangled my message.

Corrections...

On Tuesday, December 15, 2020 8:58:57 PM EST I wrote:
> 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on.

1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on.

> to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA.
> So, ? often appears in a different style in the word 2?? for example.

to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA. So, ? 
often appears in a different style in the word 2?? for example.
 
> the masculine ? and the feminine ?.
the masculine ? and the feminine ?.

> if indeed 1?? is an abuse as I think?
if indeed 1?? is an abuse as I think?

> necessity of encoding ????????? ???????? ?????? s???? ?
necessity of encoding COMBINING CYRILLIC LETTER SHORT I

Very ironic :)

Best,
Fred Brennan


From mark at kli.org  Tue Dec 15 20:36:07 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 15 Dec 2020 21:36:07 -0500
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CAM+ijLjkNdAB15VykfVOsz7zGk-UqAd2Urzp+-O5w4d9w-LfdA@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <a9ed40b9-60e4-270e-0fd2-52ae1d01e0ff@kli.org>
 <CAM+ijLhNXTeAKPHmEazqiTXKJ=k9o9b0QVD8fa91mFF90r5LNw@mail.gmail.com>
 <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
 <CAM+ijLjkNdAB15VykfVOsz7zGk-UqAd2Urzp+-O5w4d9w-LfdA@mail.gmail.com>
Message-ID: <0e31ec9e-490f-191a-c912-5a56d9abb602@kli.org>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201215/f9eb1e71/attachment.htm>

From indolering at gmail.com  Tue Dec 15 21:18:41 2020
From: indolering at gmail.com (Zach Lym)
Date: Tue, 15 Dec 2020 19:18:41 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CAM+ijLg_Yr45OOX=f1=HLjyW+5RQrQCYmLvAhy4RBfwPgh-PPg@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <F04B5881-567E-4EB3-9D5D-C2DA9E7E5261@bahnhof.se>
 <CAM+ijLg_Yr45OOX=f1=HLjyW+5RQrQCYmLvAhy4RBfwPgh-PPg@mail.gmail.com>
Message-ID: <CABWuLVeffgXCmARPUb5TeUbp9rgcv7V-WAYMyn5v-Y163G=E9g@mail.gmail.com>

> Finally, what I'm envisioning ? and I'm not sure how closely this
> matches Christian Kleineidam's intention (where did he go, anyway?) ?
> is not Yet Another Presentation Layer or a Shiny New Toy for people to
> use in their tweets, but more of a sombre hint that "in the original
> source document, this text had an alternative presentation; indicate
> this to the user in an appropriate way, if applicable". It's meant for
> preservation, not decoration. That's why I hear the "spirit of
> Unicode".

For those of us that can recall the exuberance of the XHTML movement,
<i>, <b> and friends were all deemed to be insufficiently semantic and
slated to be replaced by <em> and <strong>.  Of course, this was a
distinction without a difference and now we just have extra tags that
are more verbose and less literal.

But that begs the question: if the authors of a rich text standard
can't agree on what counts as semantic, how would Unicode decide?
What about <mark>, <strikethrough>, or as I previously suggested
<blink>?  <blink> was added to HTML because it was the only
styling that could be displayed in plaintext console environments.
So if <blink> doesn't make your cutoff, then I guess the bar is personal
taste?

The line between semantics and styling is inherently fuzzy, but every
attempt at encoding similarly fuzzy semantics within Unicode is
something humanity must deal with for the rest of all time.  Take the
newline vs paragraph separators, a noble attempt at trying to encode
what essentially amounts to the plaintext/typewriter hack of using
\n\n to insert whitespace after a paragraph.  No-one uses either of
them, not even Markdown (which does use <em> and <strong>) because
most plain text doesn't make the distinction, users can't input it via
a keyboard, and no one else supports it.  Yet myself and a colleague
had to spend waaaay too much of our short lives figuring out what to
support as breaking separators in WASI text streams.

What puzzles me is why this discussion wasn't moderated to the null
bin.  This *exact* question is answered in the FAQ and is regularly
shot down.


-Zach Lym


From prosfilaes at gmail.com  Tue Dec 15 22:19:46 2020
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 15 Dec 2020 20:19:46 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CAM+ijLjkNdAB15VykfVOsz7zGk-UqAd2Urzp+-O5w4d9w-LfdA@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <a9ed40b9-60e4-270e-0fd2-52ae1d01e0ff@kli.org>
 <CAM+ijLhNXTeAKPHmEazqiTXKJ=k9o9b0QVD8fa91mFF90r5LNw@mail.gmail.com>
 <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
 <CAM+ijLjkNdAB15VykfVOsz7zGk-UqAd2Urzp+-O5w4d9w-LfdA@mail.gmail.com>
Message-ID: <CAMZ=zj6hBhzAgvd7Qx2ZUtAXCRNEdf-RpHxcNL6j1Bzdw+85Nw@mail.gmail.com>

On Tue, Dec 15, 2020 at 4:47 PM S?awomir Osipiuk via Unicode
<unicode at unicode.org> wrote:
> "Implementations of Unicode that already make use of out-of-band
> mechanisms for language [or format] tagging or ?heavy-weight? in-band
> mechanisms such as XML or HTML will continue to do exactly what they
> are doing and will ignore the tag characters completely. They may even
> prohibit their use to prevent conflicts with the equivalent markup."

So every single thing that interfaces with HTML now has to handle
Unicode italics on any plain text input, or silently dump them into
the stream, and the web browser may have to handle them or not.

> It's meant for preservation, not decoration.

I've done preservation, and don't see how this helps at all. You can
go with various preservation file formats, like TEI Lite, or various
more directly readable file formats like HTML or PDF. None of those
has any problem handling italics. Plain text willfully drops many
details, so probably isn't a realistic choice for preservation.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)


From asmusf at ix.netcom.com  Tue Dec 15 23:49:48 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 15 Dec 2020 21:49:48 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CAMZ=zj6hBhzAgvd7Qx2ZUtAXCRNEdf-RpHxcNL6j1Bzdw+85Nw@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <a9ed40b9-60e4-270e-0fd2-52ae1d01e0ff@kli.org>
 <CAM+ijLhNXTeAKPHmEazqiTXKJ=k9o9b0QVD8fa91mFF90r5LNw@mail.gmail.com>
 <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
 <CAM+ijLjkNdAB15VykfVOsz7zGk-UqAd2Urzp+-O5w4d9w-LfdA@mail.gmail.com>
 <CAMZ=zj6hBhzAgvd7Qx2ZUtAXCRNEdf-RpHxcNL6j1Bzdw+85Nw@mail.gmail.com>
Message-ID: <416d509b-b97c-5153-ec4c-aae451570919@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201215/6a4b4646/attachment.htm>

From john.w.kennedy at gmail.com  Wed Dec 16 07:13:07 2020
From: john.w.kennedy at gmail.com (John W Kennedy)
Date: Wed, 16 Dec 2020 08:13:07 -0500
Subject: Italics get used to express important semantic meaning,
 so unicode should support them
In-Reply-To: <CABWuLVeffgXCmARPUb5TeUbp9rgcv7V-WAYMyn5v-Y163G=E9g@mail.gmail.com>
References: <CABWuLVeffgXCmARPUb5TeUbp9rgcv7V-WAYMyn5v-Y163G=E9g@mail.gmail.com>
Message-ID: <C40CD884-F1F6-45BD-8A28-574D0B47B432@gmail.com>


-- 
John W. Kennedy
"Compact is becoming contract,
Man only earns and pays."
 -- Charles Williams.  "Bors to Elayne:  On the King's Coins"

> On Dec 15, 2020, at 10:25 PM, Zach Lym via Unicode <unicode at unicode.org> wrote:
> 
> ?
>> 
>> Finally, what I'm envisioning ? and I'm not sure how closely this
>> matches Christian Kleineidam's intention (where did he go, anyway?) ?
>> is not Yet Another Presentation Layer or a Shiny New Toy for people to
>> use in their tweets, but more of a sombre hint that "in the original
>> source document, this text had an alternative presentation; indicate
>> this to the user in an appropriate way, if applicable". It's meant for
>> preservation, not decoration. That's why I hear the "spirit of
>> Unicode".
> 
> For those of us that can recall the exuberance of the XHTML movement,
> <i>, <b> and friends were all deemed to be insufficiently semantic and
> slated to be replaced by <em> and <strong>.  Of course, this was a
> distinction without a difference and now we just have extra tags that
> are more verbose and less literal.

<em> and <strong> go back to HTML+ in 1993, where they replaced <hp1> and <hp2> from the original HTML, which had inherited them from IBM?s original GML (no S) of the 1970s.


From costello at mitre.org  Wed Dec 16 07:47:58 2020
From: costello at mitre.org (Roger L Costello)
Date: Wed, 16 Dec 2020 13:47:58 +0000
Subject: =?utf-8?B?VW5pY29kZSBpcyB1bml2ZXJzYWwsIHNvIGhvdyBjb21lIHRoYXQgdW5pdmVy?=
 =?utf-8?B?c2FsaXR5IGRvZXNu4oCZdCBhcHBseSB0byBkaWdpdHM/?=
Message-ID: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>

Hi Folks,
Unicode make it possible to write things in different languages.
For example, rather than this XML:
<Number_Students>42</Number_Students>
a Bengali-speaking person can write this:
<??????_?????>42</??????_?????>
Or, in a programming language, rather than this assignment statement:
              Number_Students = 42
a Bengali-speaking person can write this:
              ??????_????? = 42
That?s awesome.
But, but, but, ? how come that universality doesn?t extend to digits?
How come we can only use these digits: 0 (hex 30), 1 (hex 31), ?, 9 (hex 39)?
Why, for example, can?t a Bengali-speaking person use the Bengali digits: Bengali digit 0 (U+09E6), Bengali digit 1 (U+09E7), ?, Bengali digit 9 (U+09EF)?
Why, for example, can?t a Bengali-speaking person create XML such as this:
<??????_?????>??</??????_?????>
or write a program assignment statement like this:
              ??????_????? = ??
Let me explain why I assert that the Bengali-speaking person ?cannot? do that.
Numbers in an XML document or in a program are just strings and, to perform arithmetic operations on them, those string numbers must be converted to actual numbers. I looked at the source code for the C function (strtol) that converts strings to numbers and here is the key to how it converts a character digit to a number digit:
              digit_number = digit_character - '0?
Yikes!
That generates a number digit by treating the character digit as a number and subtracting the number corresponding to the character ?0?. For example, if the character digit is ?4? (hex 34) then when we subtract ?0? (hex 30) we get the number 4. Perfect! But ??? only if we allow European digits (0, 1, ?, 9). Clearly, if we were to subtract ?0? (hex 30) from the Bengali digit 4 we do not get the number 4.
Thus I conclude:

  *   When expressing numbers, the only digits that can be used are the European digits
  *   Unicode is universal, but that universality does not apply to digits or numbers
Obviously I am not understanding something correctly. Please help me to understand.
/Roger

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/ecff63f2/attachment.htm>

From marius.spix at web.de  Wed Dec 16 07:56:58 2020
From: marius.spix at web.de (Marius Spix)
Date: Wed, 16 Dec 2020 14:56:58 +0100
Subject: =?UTF-8?Q?Aw=3A_Re=3A_1=CB=A2=E1=B5=97=2C_2=E2=81=BF=E1=B5=88?=
 =?UTF-8?Q?=2C_3=CA=B3=E1=B5=88=2C_4?=
 =?UTF-8?Q?=E1=B5=97=CA=B0_=E2=80=A6_9=E1=B5=97=CA=B0?=
In-Reply-To: <2171140.htiGsxgcq4@laptop>
References: <9137826.KFeHLySHN7@laptop> <2171140.htiGsxgcq4@laptop>
Message-ID: <trinity-a957031b-b041-42cc-b8a6-e2ef3762caba-1608127018000@3c-app-webde-bap50>

This is similiar to the mathematical italic letters, where ? and ? have very special appearances depending on the used font, ?(?) (function of x) is a very common character sequence in mathematical context. For some reason ? (U+1D4F5) and ? (U+2113) are different characters, because the latter is also used for the unit litre.
?
?

Gesendet:?Mittwoch, 16. Dezember 2020 um 03:04 Uhr
Von:?"Fredrick Brennan via Unicode" <unicode at unicode.org>
An:?"Unicode Discussion" <unicode at unicode.org>
Betreff:?Re: 1??, 2??, 3??, 4?? ? 9??
Oh dear, my email-client was erroneously configured to use SHIFT_JIS, which
mangled my message.

Corrections...

On Tuesday, December 15, 2020 8:58:57 PM EST I wrote:
> 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on.

1?? of January, 2?? of February, 3?? of March, 4?? of April, and so on.

> to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA.
> So, ? often appears in a different style in the word 2?? for example.

to contain ? than the other letters due to its use in Pe?h-?e-j? and IPA. So, ?
often appears in a different style in the word 2?? for example.

> the masculine ? and the feminine ?.
the masculine ? and the feminine ?.

> if indeed 1?? is an abuse as I think?
if indeed 1?? is an abuse as I think?

> necessity of encoding ????????? ???????? ?????? s???? ?
necessity of encoding COMBINING CYRILLIC LETTER SHORT I

Very ironic :)

Best,
Fred Brennan


?


From harjitmoe at outlook.com  Wed Dec 16 08:50:56 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Wed, 16 Dec 2020 14:50:56 +0000
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CABWuLVeffgXCmARPUb5TeUbp9rgcv7V-WAYMyn5v-Y163G=E9g@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <F04B5881-567E-4EB3-9D5D-C2DA9E7E5261@bahnhof.se>
 <CAM+ijLg_Yr45OOX=f1=HLjyW+5RQrQCYmLvAhy4RBfwPgh-PPg@mail.gmail.com>
 <CABWuLVeffgXCmARPUb5TeUbp9rgcv7V-WAYMyn5v-Y163G=E9g@mail.gmail.com>
Message-ID: <VI1PR0701MB2413ECAF2A1433E707DB621BB7C50@VI1PR0701MB2413.eurprd07.prod.outlook.com>

> For those of us that can recall the exuberance of the XHTML movement,
> <i>, <b> and friends were all deemed to be insufficiently semantic and
> slated to be replaced by <em> and <strong>.  Of course, this was a
> distinction without a difference and now we just have extra tags that
> are more verbose and less literal.

Not strictly speaking?although <i> and <b> are back in vogue, <i> is now 
only supposed to be used for italics which set text apart in some other 
fashion as opposed to emphasising it (which should still be done with 
<em>).? The distinction may appear ?without a difference? for 
graphically displaying text in visual clients, but they can represent 
considerably different tone changes when reading it out (a relevant 
consideration if you are writing, say, an aural client for the visually 
impaired), hence using these properly is /theoretically/ more 
accessible, though I do not know to what extent that is true in practice 
since there's bound to be a lot of deployed legacy, WYSIWYG or 
generated-from-Markdown-etc HTML which doesn't make this distinction, 
which might preclude relying on it.

?Har

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/c0eb85e2/attachment.htm>

From doug at ewellic.org  Wed Dec 16 09:40:15 2020
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 16 Dec 2020 08:40:15 -0700
Subject: =?utf-8?Q?RE:_Unicode_is_universal=2C_so_how?=
 =?utf-8?Q?_come_that_universality_doesn=E2=80=99t_?=
 =?utf-8?Q?apply_to_digits=3F?=
In-Reply-To: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
Message-ID: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org>

What I don't understand here is why this is being framed implicitly as a Unicode problem, or an XML problem, or a general law of nature ("why can?t a Bengali-speaking person use the Bengali digits"), instead of an inherent limitation of that particular library function used for that particular language.

One could easily extend strtol() to accept a string of characters with a General_Category of "Nd", and use the Numeric_Value property of each character to get its numeric value instead of subtracting 48 (ASCII '0').

Of course, in order to do that, the Unicode properties General_Category and Numeric_Value must be available to the conversion function. The C language and its standard libraries are optimized for speed and size, and are still chosen to this day when speed and size are at a premium. Operating only on ASCII '0' through '9' and subtracting ASCII '0' to get the numeric value is much faster and lighter-weight than table lookup. ICU probably provides a method to do this in C.

A good follow-up question for me is why the heavier-weight C# and  .NET Framework (Core, Standard) also don't support non-ASCII digits in the Convert.ToInt32() method, even when the string of digits is all from the same script (unlike your mixed Bengali/Oriya example), and even when the appropriate locale is specified as a parameter. C# compiles to intermediate code and runs in an interpreter, and has huge libraries available to it, including all of the Unicode properties, so the "speed and size" constraints don't apply as much.

But this is still a characteristic of the code libraries, not a Unicode problem.

--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org


From wjgo_10009 at btinternet.com  Wed Dec 16 10:02:00 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 16 Dec 2020 16:02:00 +0000 (GMT)
Subject: =?UTF-8?Q?Re:_Unicode_is_universal,_so_how_come_th?=
 =?UTF-8?Q?at_universality_doesn=E2=80=99t_apply_to_digits=3F?=
In-Reply-To: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
Message-ID: <e3ae054.65b.1766c49702b.Webtop.222@btinternet.com>


Hi

Well, is the way to make progress that Unicode Inc. could make available 
a pseudo-code algorithm that can be converted to various programming 
languages that is such that the way that a digit is derived from the 
text characters is an algorithm with a structure of the form

if (digit_character >= 'A') AND (digit_character <= 'B') then 
digit_number := digit_character - 'C'

elsif (digit_character >= 'D') AND (digit_character <= 'E') then 
digit_number := digit_character - 'F'

elsif ...

.
.
.

elsif ...

end;

where A, B, C, D etc in the above are here each a placeholder for a 
Unicode character for the start and end of a range of digit characters 
as appropriate?

Would that do it? Assuming that compiler manufacturers used the 
algorithm, converted as appropriate! :-)

The algorithm written once and then updated as needed by Unicode Inc., 
then applicable throughout many programming languages.

Best regards,

William Overington

Wednesday 16 December 2020


------ Original Message ------
From: "Roger L Costello via Unicode" <unicode at unicode.org>
To: "unicode at unicode.org" <unicode at unicode.org>
Sent: Wednesday, 2020 Dec 16 At 13:47
Subject: Unicode is universal, so how come that universality doesn?t 
apply to digits?


Hi Folks,
Unicode make it possible to write things in different languages.
For example, rather than this XML:
<Number_Students>42</Number_Students>
a Bengali-speaking person can write this:
<??????_?????>42</??????_?????>
Or, in a programming language, rather than this assignment statement:
               Number_Students = 42
a Bengali-speaking person can write this:
                ??????_????? = 42
That?s awesome.
But, but, but, ? how come that universality doesn?t extend to digits?
How come we can only use these digits: 0 (hex 30), 1 (hex 31), ?, 9 (hex 
39)?
Why, for example, can?t a Bengali-speaking person use the Bengali 
digits: Bengali digit 0 (U+09E6), Bengali digit 1 (U+09E7), ?, Bengali 
digit 9 (U+09EF)?
Why, for example, can?t a Bengali-speaking person create XML such as 
this:
<??????_?????>??</??????_?????>
or write a program assignment statement like this:
                ??????_????? =  ??
Let me explain why I assert that the Bengali-speaking person ?cannot? do 
that.
Numbers in an XML document or in a program are just strings and, to 
perform arithmetic operations on them, those string numbers must be 
converted to actual numbers. I looked at the source code for the C 
function (strtol) that converts strings  to numbers and here is the key 
to how it converts a character digit to a number digit:
               digit_number = digit_character - '0?
Yikes!
That generates a number digit by treating the character digit as a 
number and subtracting the number corresponding to the character ?0?. 
For example, if the character digit is ?4? (hex 34) then when we 
subtract ?0? (hex 30) we get the number  4. Perfect! But ??? only if we 
allow European digits (0, 1, ?, 9). Clearly, if we were to subtract ?0? 
(hex 30) from the Bengali digit 4 we do not get the number 4.
Thus I conclude:
    *  When expressing numbers, the only digits that can be used are the 
European digits
    *  Unicode is universal, but that universality does not apply to 
digits or numbers

Obviously I am not understanding something correctly. Please help me to 
understand.
/Roger


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/abdf7925/attachment-0001.htm>

From frederic.grosshans at gmail.com  Wed Dec 16 11:34:55 2020
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Wed, 16 Dec 2020 18:34:55 +0100
Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?=
 =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?=
In-Reply-To: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
Message-ID: <a356bb22-4cc4-e769-57b6-d6a33a947625@gmail.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/38b3bf58/attachment.htm>

From doug at ewellic.org  Wed Dec 16 12:05:52 2020
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 16 Dec 2020 11:05:52 -0700
Subject: Italics get used to express important semantic meaning,
 so unicode should support them
In-Reply-To: <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
Message-ID: <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>

abrahamgross wrote:

>> Children learn to write with upper case and lower case letters in
>> school, and most people continue to use both as adults. (There are
>> exceptions of course, some people write only with lower case, and
>> some only with upper case.)
>
> Unicode refused to encode arabic letter variants (not counting
> compatibility chars), which are taught in school and adults use it,
> and its how arabic is written, so ur argument here doesn't hold water.

I'm not sure what to make of that sentence. That's like saying "Unicode refused to encode the capital letter A (not counting U+0041)."

The compatibility characters are exactly how one is supposed to represent Arabic letter forms outside of their normal context, as described here.

--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org


From abrahamgross at disroot.org  Wed Dec 16 12:11:52 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 16 Dec 2020 18:11:52 +0000 (UTC)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <C40CD884-F1F6-45BD-8A28-574D0B47B432@gmail.com>
References: <CABWuLVeffgXCmARPUb5TeUbp9rgcv7V-WAYMyn5v-Y163G=E9g@mail.gmail.com>
 <C40CD884-F1F6-45BD-8A28-574D0B47B432@gmail.com>
Message-ID: <f2e17512-804e-4a74-9926-17f46ca7da0d@disroot.org>

Can't unicode just make an edit and say that the mathematical italic letters can be used for regular english too? (the character names can stay as MATHEMATICAL ITALIC etc, or aliases can be added)

From frederic.grosshans at gmail.com  Wed Dec 16 12:50:40 2020
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Wed, 16 Dec 2020 19:50:40 +0100
Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?=
 =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?=
In-Reply-To: <a356bb22-4cc4-e769-57b6-d6a33a947625@gmail.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <a356bb22-4cc4-e769-57b6-d6a33a947625@gmail.com>
Message-ID: <bb502a7b-669a-3253-8c54-a0a12a941a3a@gmail.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/cd02f771/attachment.htm>

From frederic.grosshans at gmail.com  Wed Dec 16 12:55:37 2020
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Wed, 16 Dec 2020 19:55:37 +0100
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <f2e17512-804e-4a74-9926-17f46ca7da0d@disroot.org>
References: <CABWuLVeffgXCmARPUb5TeUbp9rgcv7V-WAYMyn5v-Y163G=E9g@mail.gmail.com>
 <C40CD884-F1F6-45BD-8A28-574D0B47B432@gmail.com>
 <f2e17512-804e-4a74-9926-17f46ca7da0d@disroot.org>
Message-ID: <01b67baa-d0b4-b321-4c4a-4d6c6d16fbff@gmail.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/bc6b652b/attachment.htm>

From abrahamgross at disroot.org  Wed Dec 16 12:58:51 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Wed, 16 Dec 2020 18:58:51 +0000 (UTC)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <01b67baa-d0b4-b321-4c4a-4d6c6d16fbff@gmail.com>
References: <CABWuLVeffgXCmARPUb5TeUbp9rgcv7V-WAYMyn5v-Y163G=E9g@mail.gmail.com>
 <C40CD884-F1F6-45BD-8A28-574D0B47B432@gmail.com>
 <f2e17512-804e-4a74-9926-17f46ca7da0d@disroot.org>
 <01b67baa-d0b4-b321-4c4a-4d6c6d16fbff@gmail.com>
Message-ID: <adac7416-b049-41a0-b90c-e9fa61056b18@disroot.org>

Whoop, That makes sense

Dec 16, 2020 1:56:29 PM Fr?d?ric Grosshans via Unicode <unicode at unicode.org>:

> And then, speaker of German languages will ask the encoding of italic ?, Icelandic speakers, ? and ?, French speakers, ? and ?, etc. Because special casing English is quite the opposite of the purpose of Unicode...
> 
> Fr?d?ric
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/2e69c745/attachment.htm>

From richard.wordingham at ntlworld.com  Wed Dec 16 13:23:09 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 16 Dec 2020 19:23:09 +0000
Subject: Unicode is universal, so how come that universality
 =?UTF-8?B?ZG9lc27igJl0?= apply to digits?
In-Reply-To: <e3ae054.65b.1766c49702b.Webtop.222@btinternet.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <e3ae054.65b.1766c49702b.Webtop.222@btinternet.com>
Message-ID: <20201216192309.6e9fea34@JRWUBU2>

On Wed, 16 Dec 2020 16:02:00 +0000 (GMT)
William_J_G Overington via Unicode <unicode at unicode.org> wrote:

> Hi
> 
> Well, is the way to make progress that Unicode Inc. could make
> available a pseudo-code algorithm that can be converted to various
> programming languages that is such that the way that a digit is
> derived from the text characters is an algorithm with a structure of
> the form
> 
> if (digit_character >= 'A') AND (digit_character <= 'B') then 
> digit_number := digit_character - 'C'
> 
> elsif (digit_character >= 'D') AND (digit_character <= 'E') then 
> digit_number := digit_character - 'F'
> 
> elsif ...

It looks to me as though some versions of wcstol() already accept a
sequence of decimal digits.  C-11 allows such behaviour.  The simple
algorithm sketched here won't work for 8-bit char - ISCII Indian digits
and TIS-620 Thai digits overlap but do not coincide.  Thus for strtol(),
you would need to include the locale.

As Fr?d?ric Grosshans has noticed, there is also the issue of
digit sequences spoofing, besides variations of the letter 'O' being
harmful.  Not every call of strtol() parsing a digit string actually
checks that the offered string is in the form of a number.

Richard.


From richard.wordingham at ntlworld.com  Wed Dec 16 13:57:46 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 16 Dec 2020 19:57:46 +0000
Subject: Unicode is universal, so how come that universality
 =?UTF-8?B?ZG9lc27igJl0?= apply to digits?
In-Reply-To: <a356bb22-4cc4-e769-57b6-d6a33a947625@gmail.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <a356bb22-4cc4-e769-57b6-d6a33a947625@gmail.com>
Message-ID: <20201216195746.37c2237b@JRWUBU2>

On Wed, 16 Dec 2020 18:34:55 +0100
Fr?d?ric Grosshans via Unicode <unicode at unicode.org> wrote:

> It?s quite easy to make a lbrary which parses UniccodeData.txt
> (version 13.0 here) and extract the digit ranges of the various
> scripts and convert the various strings into number for the 50
> scripts listed in table 22-3 of the standard plus the western digits
> (Unicode 13.0 pdf here), it should be reasonably furureproof, in the
> sense that parsing future unicode datafile should add stipts as they
> are encoded. However, do not forget to check the exceptions in the
> text around this table in in the relevant script pages: in Unicode
> 13.0, it concerns Arabic, which has to sets of digits, Myanmar (3
> sets), and Tai Tham (2 sets).

Or just scan UnicodeData.txt for decimal digits with the value 0.

Richard.


From marius.spix at web.de  Wed Dec 16 14:09:23 2020
From: marius.spix at web.de (Marius Spix)
Date: Wed, 16 Dec 2020 21:09:23 +0100
Subject: Aw: RE: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
Message-ID: <trinity-5bc955d3-52e6-4557-bff0-ab8dc8742e48-1608149363157@3c-app-webde-bs49>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/7a305190/attachment.htm>

From asmusf at ix.netcom.com  Wed Dec 16 15:32:10 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 16 Dec 2020 13:32:10 -0800
Subject: Aw: RE: Italics get used to express important semantic meaning,
 so unicode should support them
In-Reply-To: <trinity-5bc955d3-52e6-4557-bff0-ab8dc8742e48-1608149363157@3c-app-webde-bs49>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <trinity-5bc955d3-52e6-4557-bff0-ab8dc8742e48-1608149363157@3c-app-webde-bs49>
Message-ID: <effe34cf-0a1e-ebcd-2cdd-6ffded3bee64@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/cd3ba666/attachment.htm>

From billposer2 at gmail.com  Wed Dec 16 15:32:48 2020
From: billposer2 at gmail.com (Bill Poser)
Date: Wed, 16 Dec 2020 13:32:48 -0800
Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?=
 =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?=
In-Reply-To: <20201216195746.37c2237b@JRWUBU2>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <a356bb22-4cc4-e769-57b6-d6a33a947625@gmail.com>
 <20201216195746.37c2237b@JRWUBU2>
Message-ID: <CACPRsRRBr-5xUnrgWTZ68x8uH+SHAugEAW0CANHyWuws+s_xGg@mail.gmail.com>

It seems to me that, in spite of the superficial similarity of the way
numbers are written in many languages, this is NOT, in general, a matter of
encoding conversion or even transliteration but rather one of translation
and therefore not part of Unicode for the same reason that Unicode does not
handle the translation of text from, say, Japanese to English.

There is, actually, a library, which I have written, that handles
conversions between Unicode strings and integers for most systems of
writing numbers. (I have yet to update it to handle some of the more
recently encoded systems.) It is a C library which also has a TCL binding:

http://billposer.org/Software/libuninum.html

It handles a number of systems that require algorithms rather different
from that of atoi/strtol.

Bill


On Wed, Dec 16, 2020 at 12:04 PM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Wed, 16 Dec 2020 18:34:55 +0100
> Fr?d?ric Grosshans via Unicode <unicode at unicode.org> wrote:
>
> > It?s quite easy to make a lbrary which parses UniccodeData.txt
> > (version 13.0 here) and extract the digit ranges of the various
> > scripts and convert the various strings into number for the 50
> > scripts listed in table 22-3 of the standard plus the western digits
> > (Unicode 13.0 pdf here), it should be reasonably furureproof, in the
> > sense that parsing future unicode datafile should add stipts as they
> > are encoded. However, do not forget to check the exceptions in the
> > text around this table in in the relevant script pages: in Unicode
> > 13.0, it concerns Arabic, which has to sets of digits, Myanmar (3
> > sets), and Tai Tham (2 sets).
>
> Or just scan UnicodeData.txt for decimal digits with the value 0.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/93f331ed/attachment.htm>

From kent.b.karlsson at bahnhof.se  Wed Dec 16 18:46:33 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Thu, 17 Dec 2020 01:46:33 +0100
Subject: Italics get used to express important semantic meaning, so
 unicode should support them 
In-Reply-To: <CAM+ijLg_Yr45OOX=f1=HLjyW+5RQrQCYmLvAhy4RBfwPgh-PPg@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <F04B5881-567E-4EB3-9D5D-C2DA9E7E5261@bahnhof.se>
 <CAM+ijLg_Yr45OOX=f1=HLjyW+5RQrQCYmLvAhy4RBfwPgh-PPg@mail.gmail.com>
Message-ID: <F8EA359A-C425-4E0C-B6D7-73B4A4AF72C0@bahnhof.se>


> 16 dec. 2020 kl. 02:14 skrev S?awomir Osipiuk via Unicode <unicode at unicode.org>:
> 
> On Tue, Dec 15, 2020 at 6:07 PM Kent Karlsson
> <kent.b.karlsson at bahnhof.se> wrote:
>> Now, where did I see something very much like this???
>> Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very close (especially the ?invisible by default? (?default ignorable?) IF parsed correctly).
> 
> ECMA-48 aka ISO 6429 was on my mind the moment I read the OP. I didn't
> mention it because it's a bit outdated (even if I do have a fondness

It is certainly not outdated. It?s a long time since the last update, but it is not outdated.
It us used in EVERY terminal emulator (worthy of the name), granted to varying degrees
and varying quality of implementation (but that is another matter). Italics, bold, underline
and colouring are popular uses of the formatting part of ECMA-48 in terminals.

One could imagine completely reinventing how terminals (i.e. terminal emulators nowadays)
work. But that would face massive compatibility issues. My projection is that 1) terminal
emulators will continue to be used indefinitely, and 2) they will continue to use ECMA-48
or an extension thereof (there are already some extensions that have been implemented).

(That is opposed to Teletext, which still is very much used in practice, but I think that may
change in five or ten yers time.)

> for it) and if you're using such a thing, why not a more modern HTML
> subset, or BBCode, or any number of other options in use or from the
> list the OP gave? There are, after all, so many to choose from. And if

Because:
1) They would be incompatible with how terminals work.
2) They cannot work for terminals since there is no clear distinction between what is ?markup?
and what is not; the distinction today much relies on file type (via name suffix or other mechanism,
like document setting or view mode, or ?guessing? from reading the beginning of the document).
Those mechanisms do not exist in terminals.

> none of those satisfy, you can always make your own!

Again, if one were to invent something entirely new (not based on ECMA-48) in this area that still has
the potential to be used in terminals, that would face massive compatibility issues with how terminals
work today and are expected to work ?from the other side of the terminal? (i.e. what programs send to
the terminal side). (Yes I know about termcap.)

> But that "if parsed correctly" is quite the nit, isn't it?

If every terminal (emulator) can handle it (granted, to varying degrees of quality), it does not seem
too hard?

> 
>> It is not entirely inconceivable to map all the (otherwise) printable characters used by such control sequences to TAG characters, thus making the ?default ignorable? part of this a bit easier.
> 
> And this is just the BabelPad solution but applied to a different
> protocol. Replacing regular markup by corresponding characters from
> the tag block to gain ignorable-ness may seem like a cool idea at
> first, but it's just spinning yet another markup. (With no offense

In a sense, yes. But the idea to use TAG characters for this has popped up on this list
multiple times. So if mapping ECMA-48-ish control sequences to use TAG characters
makes ECMA-48-ish formatting control sequences more palatable, then ok.

/Kent Karlsson


From kent.b.karlsson at bahnhof.se  Wed Dec 16 18:46:47 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Thu, 17 Dec 2020 01:46:47 +0100
Subject: Italics get used to express important semantic meaning, so
 unicode should support them 
In-Reply-To: <CABWuLVeffgXCmARPUb5TeUbp9rgcv7V-WAYMyn5v-Y163G=E9g@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <F04B5881-567E-4EB3-9D5D-C2DA9E7E5261@bahnhof.se>
 <CAM+ijLg_Yr45OOX=f1=HLjyW+5RQrQCYmLvAhy4RBfwPgh-PPg@mail.gmail.com>
 <CABWuLVeffgXCmARPUb5TeUbp9rgcv7V-WAYMyn5v-Y163G=E9g@mail.gmail.com>
Message-ID: <6F848713-997C-4D81-A326-F37848F3FEF8@bahnhof.se>


> 16 dec. 2020 kl. 04:18 skrev Zach Lym <indolering at gmail.com>:

> But that begs the question: if the authors of a rich text standard
> can't agree on what counts as semantic, how would Unicode decide?

Eeh, 

file.html would be an HTML file, intending to interpret HTML tags as markup
(unless in a view tags mode of display) for programs/apps that can interpret HTML markup,
and regarding RTF or other non-HTML markup markup as plain text.

file.rtf would be an RTF file, intending to interpret RTF markup (unless in a view markup
mode of display) for programs/apps that can interpret RTF markup, and regarding 
HTML or other non-RTF markup markup as plain text.

and so on.

There are several other ways of indicating the file ?type?, but filename suffix is the
most obvious method that is used.

So what was the problem did you say?

> What about <mark>, <strikethrough>, or as I previously suggested
> <blink>?  <blink> was added to HTML because it was the only
> styling that could be displayed in plaintext console environments.

I?m not sure that history is correct. Anyhow, for terminals (emulators nowadays)
underline, bold, italic, and coloring (also in combination) is commonly available.
(Even when terminals were monochrome, underline and bold could still be done,
even if done in a non-standard way.)

Blink is often suppressed in modern terminal emulators (but then can be enabled by
a preference setting).

/Kent Karlsson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201217/50a996ea/attachment.htm>

From kent.b.karlsson at bahnhof.se  Wed Dec 16 18:47:02 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Thu, 17 Dec 2020 01:47:02 +0100
Subject: Italics get used to express important semantic meaning, so
 unicode should support them 
In-Reply-To: <416d509b-b97c-5153-ec4c-aae451570919@ix.netcom.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <a9ed40b9-60e4-270e-0fd2-52ae1d01e0ff@kli.org>
 <CAM+ijLhNXTeAKPHmEazqiTXKJ=k9o9b0QVD8fa91mFF90r5LNw@mail.gmail.com>
 <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
 <CAM+ijLjkNdAB15VykfVOsz7zGk-UqAd2Urzp+-O5w4d9w-LfdA@mail.gmail.com>
 <CAMZ=zj6hBhzAgvd7Qx2ZUtAXCRNEdf-RpHxcNL6j1Bzdw+85Nw@mail.gmail.com>
 <416d509b-b97c-5153-ec4c-aae451570919@ix.netcom.com>
Message-ID: <94D07690-1B8D-4617-9D1A-6ABD164BA07F@bahnhof.se>


> 16 dec. 2020 kl. 06:49 skrev Asmus Freytag via Unicode <unicode at unicode.org>:
> 
> On 12/15/2020 8:19 PM, David Starner via Unicode wrote:
>> On Tue, Dec 15, 2020 at 4:47 PM S?awomir Osipiuk via Unicode
>> <unicode at unicode.org> <mailto:unicode at unicode.org> wrote:
>>> "Implementations of Unicode that already make use of out-of-band
>>> mechanisms for language [or format] tagging or ?heavy-weight? in-band
>>> mechanisms such as XML or HTML will continue to do exactly what they
>>> are doing and will ignore the tag characters completely. They may even
>>> prohibit their use to prevent conflicts with the equivalent markup."
>> So every single thing that interfaces with HTML now has to handle
>> Unicode italics on any plain text input, or silently dump them into
>> the stream, and the web browser may have to handle them or not.
> ^^^That.

Let me paraphrase:

?So every single thing that interfaces with HTML now has to handle RTF italics on any plain text input,
or silently dump them into the stream, and the web browser may have to handle them or not.?

You would not use that as an argument to say that RTF (which I picked just because it is well-known)
should be wiped from the face of Earth? I would think not? (You may want to wipe RTF from the face
of the Earth, I don?t know, but you would not use that argument even if you do want that.)

Even if, in these threads, the term ?plain text formatting? is used (or worse ?Unicode formatting?), that
is a bit misleading (of course). I don?t think these proposals should be applied to text data of the ?type?
?tex/plain? (or as a filename suffix, ?.txt?), nor such things as filenames themselves, and of course not to
?text/html?/?.html?, nor to ?application/pdf?/?.pdf?, nor to ?application/rtf?/?.rtf?, etc. One should be using (a)
new file type(s), POSSIBLY (if one can agree on a single one) even apply it to ?text/plain?/?.txt? (but not
to HTML, RTF, etc., and not (I would say) to filenames or similar, such markup should not even be
permitted in filenames and similar; note: ?should...?, not ?are...").

The point being that the markup would be default-ignorable, and thus normally ?invisible? when not
interpreted, even in a ?plain? text file. Granted, the ECMA-48 approach (if not mapping to TAG
characters) would need a bit of ?extending? the default-ignorability property to certain follow-on
characters (that normally are printable) after ESC and CSI (terminal emulators do that all the time,
and have done so for decades, so it is nothing revolutionary). That is, that the markup does not ?hijack?
normal printable characters for its markup syntax; if ECMA-48 had been done today I think it would use
default-ignorable characters through-out the ESC- and CSI-sequences, not just for the lead character.
(Plus, I think that no use of out-of-band stylesheets is also a point. Plus that some argue for excessive
?bare-boned-ness?; but I don?t agree with that.)

That is my take on this issue at least.

----
> hardcoding 
> 
> visual appearance is really the least helpful, because that totally
> undercuts the the ability for style sheets to address presentation.
Yes, but? Re. ECMA-48 (which we touched on in this thread), there the styling is really
?hardcoded?, and there are no style sheets. For ECMA-48 (which is still very much in use,
and extensions are being implemented). I don?t think it would be a good idea to introduce
any (separate) style sheets of any kind. It is not at all geared for that, and re-gearing it for
that would not be a good idea to do (IMHO). Similarly for any ?plain text? (?low level?, really)
formatting proposal other than ECMA-48. But for HTML and similar, fine; stylesheets are great!

/Kent Karlsson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201217/833c0c28/attachment.htm>

From mark at kli.org  Wed Dec 16 19:16:48 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 16 Dec 2020 20:16:48 -0500
Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?=
 =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?=
In-Reply-To: <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org>
Message-ID: <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/3185d2b4/attachment.htm>

From prosfilaes at gmail.com  Wed Dec 16 23:36:53 2020
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 16 Dec 2020 21:36:53 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <94D07690-1B8D-4617-9D1A-6ABD164BA07F@bahnhof.se>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <002401d6d110$db4ac280$91e04780$@gmail.com>
 <0bd2d94e-6e98-0509-afa4-04a2d0792054@ix.netcom.com>
 <000c01d6d23a$ee6a5350$cb3ef9f0$@gmail.com>
 <a9ed40b9-60e4-270e-0fd2-52ae1d01e0ff@kli.org>
 <CAM+ijLhNXTeAKPHmEazqiTXKJ=k9o9b0QVD8fa91mFF90r5LNw@mail.gmail.com>
 <8e736480-305c-8f63-2a7a-d31d7eab8c5f@kli.org>
 <CAM+ijLjkNdAB15VykfVOsz7zGk-UqAd2Urzp+-O5w4d9w-LfdA@mail.gmail.com>
 <CAMZ=zj6hBhzAgvd7Qx2ZUtAXCRNEdf-RpHxcNL6j1Bzdw+85Nw@mail.gmail.com>
 <416d509b-b97c-5153-ec4c-aae451570919@ix.netcom.com>
 <94D07690-1B8D-4617-9D1A-6ABD164BA07F@bahnhof.se>
Message-ID: <CAMZ=zj6uKjHhmCWhAP4RSEWAkirHwBgnSXQ3_nOA-Ybuco1cOA@mail.gmail.com>

On Wed, Dec 16, 2020 at 4:54 PM Kent Karlsson via Unicode
<unicode at unicode.org> wrote:
> On 12/15/2020 8:19 PM, David Starner via Unicode wrote:
>
>> On Tue, Dec 15, 2020 at 4:47 PM S?awomir Osipiuk via Unicode
>> <unicode at unicode.org> wrote:
>
>>> "Implementations of Unicode that already make use of out-of-band
>>> mechanisms for language [or format] tagging or ?heavy-weight? in-band
>>> mechanisms such as XML or HTML will continue to do exactly what they
>>> are doing and will ignore the tag characters completely. They may even
>>> prohibit their use to prevent conflicts with the equivalent markup."
>
>> So every single thing that interfaces with HTML now has to handle
>> Unicode italics on any plain text input, or silently dump them into
>> the stream, and the web browser may have to handle them or not.
>
> Let me paraphrase:
>
> ?So every single thing that interfaces with HTML now has to handle RTF italics on any plain text input,
> or silently dump them into the stream, and the web browser may have to handle them or not.?
>
> You would not use that as an argument to say that RTF (which I picked just because it is well-known)
> should be wiped from the face of Earth? I would think not? (You may want to wipe RTF from the face
> of the Earth, I don?t know, but you would not use that argument even if you do want that.)

I wouldn't use that argument because it makes no sense. RTF and HTML
are at the same level. Plain text (and Unicode specifically, for HTML)
are at a lower, underlying level. If you want to make another rich
text format, it's no skin off my nose. It is completely off-topic on
this list, though. This list is about Unicode and changes thereto.

>  Similarly for any ?plain text? (?low level?, really)
> formatting proposal other than ECMA-48.

Exactly. They're not "plain text". So why are low-level formatting
proposals relevant to this list at all? ECMA-48 is not plain text.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)


From duerst at it.aoyama.ac.jp  Thu Dec 17 02:22:14 2020
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Thu, 17 Dec 2020 17:22:14 +0900
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
Message-ID: <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>

On 17/12/2020 03:05, Doug Ewell via Unicode wrote:
> abrahamgross wrote:
> 
>>> Children learn to write with upper case and lower case letters in
>>> school, and most people continue to use both as adults. (There are
>>> exceptions of course, some people write only with lower case, and
>>> some only with upper case.)
>>
>> Unicode refused to encode arabic letter variants (not counting
>> compatibility chars), which are taught in school and adults use it,
>> and its how arabic is written, so ur argument here doesn't hold water.
> 
> I'm not sure what to make of that sentence. That's like saying "Unicode refused to encode the capital letter A (not counting U+0041)."
> 
> The compatibility characters are exactly how one is supposed to represent Arabic letter forms outside of their normal context, as described here.

Not necessarily. The 'official' way of representing specific contextual 
Arabic letter forms outside of their usual context is to prefix or 
postfix them with the appropriate JOINER or NON-JOINER characters. So 
there is indeed a non-compatibility encoding for these letter variants 
in Unicode, even if they appear out of context.

What's of course more important is that in their usual context (and 
that's the way they are usually taught and used), these contextual 
variants don't need to be encoded because both humans and computers can 
do the shaping 'automatically'.

Neither something like JOINERS, nor context work as well for the upper 
case / lower case distinction, and that's why it's fair to say that one 
reason for encoding this distinction (in Unicode as well as in many 
predecessor encodings) is that the distinction is learned in school and 
made in handwriting.

Regards,   Martin.

From abrahamgross at disroot.org  Thu Dec 17 08:41:28 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 17 Dec 2020 14:41:28 +0000 (UTC)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
Message-ID: <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>

Microsoft Word does a very good job auto capitalizing, so the same internal dictionary that Word uses can also be used by OpenType to shape lowercase into uppercase. for the edge cases where you want uppercase when it doesn't automatically do it, you can use opentype alternate variants, or some other <captial letter goes here> char (like the joiners or something)

Dec 17, 2020 3:22:29 AM Martin J. D?rst <duerst at it.aoyama.ac.jp>:

> Neither something like JOINERS, nor context work as well for the upper case / lower case distinction, and that's why it's fair to say that one reason for encoding this distinction (in Unicode as well as in many predecessor encodings) is that the distinction is learned in school and made in handwriting.
> 


From asmusf at ix.netcom.com  Thu Dec 17 13:28:22 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 17 Dec 2020 11:28:22 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
Message-ID: <4c63b13d-59bd-fd87-4a56-bb92a242691c@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201217/6a2388ed/attachment.htm>

From chaw at eip10.org  Fri Dec 18 10:49:26 2020
From: chaw at eip10.org (Sudarshan S Chawathe)
Date: Fri, 18 Dec 2020 11:49:26 -0500
Subject: Interpretation of emoji-ordering-rules.txt
Message-ID: <13627.1608310166@localhost>

I would be grateful if someone could point me to a good reference for
the syntax and semantics of the rules used to describe the emoji
ordering at the following:

  https://www.unicode.org/emoji/charts-13.1/emoji-ordering-rules.txt

Regards,

-chaw


From markus.icu at gmail.com  Fri Dec 18 11:42:35 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 18 Dec 2020 09:42:35 -0800
Subject: Interpretation of emoji-ordering-rules.txt
In-Reply-To: <13627.1608310166@localhost>
References: <13627.1608310166@localhost>
Message-ID: <CAN49p6qc+90TLkQaU1OQRs-xHodqVauOf-trFBjKu7-xr9UHJg@mail.gmail.com>

On Fri, Dec 18, 2020 at 9:06 AM Sudarshan S Chawathe via Unicode <
unicode at unicode.org> wrote:

> I would be grateful if someone could point me to a good reference for
> the syntax and semantics of the rules used to describe the emoji
> ordering at the following:
>
>   https://www.unicode.org/emoji/charts-13.1/emoji-ordering-rules.txt


Overview: https://www.unicode.org/reports/tr51/#Sorting

Collation tailoring syntax:
https://www.unicode.org/reports/tr35/tr35-collation.html#Rules

The emoji ordering is also provided as part of CLDR:
https://github.com/unicode-org/cldr/blob/master/common/collation/root.xml#L950

And like much of CLDR that is then available via the ICU C/C++/Java
libraries, via a Collator for language tag "und-u-co-emoji".
http://site.icu-project.org/
https://unicode-org.github.io/icu/userguide/collation/
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1Collator.html
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/Collator.html

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201218/97464207/attachment.htm>

From copypaste at kittens.ph  Fri Dec 18 21:37:00 2020
From: copypaste at kittens.ph (Fredrick Brennan)
Date: Fri, 18 Dec 2020 19:37:00 -0800
Subject: Adlam
Message-ID: <176791270b7.10fb4e49546040.1392071428347243955@kittens.ph>

Often when other scripts are discussed, Adlam is used comparatively, as in, "well we want to avoid what happened with Adlam", or "we have had painful experience with this with Adlam".I know some of the issues in Adlam, but if anyone has the time, I (and hopefully others!) would benefit from a retelling of the "Adlam in Unicode" story. I know in the end it's a very happy story, but I'm especially curious about the bumps along the road.?Best,Fred Brennan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201218/9f713ebc/attachment.htm>

From otto.stolz at uni-konstanz.de  Sat Dec 19 06:42:33 2020
From: otto.stolz at uni-konstanz.de (Otto Stolz)
Date: Sat, 19 Dec 2020 13:42:33 +0100
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
Message-ID: <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>

Hello,

am 2020-12-17 um 15:41 schrieb abrahamgross--- via Unicode:
> Microsoft Word does a very good job auto capitalizing, so the same internal dictionary that Word uses can also be used by OpenType to shape lowercase into uppercase. for the edge cases where you want uppercase when it doesn't automatically do it, you can use opentype alternate variants, or some other <captial letter goes here> char (like the joiners or something)

Whatever MS Word does, it cannot decide the correct spelling in many 
cases, as the casing may well make a semantic difference.

For example, you may well serve a turkey for dinner, but never a Turkey.
A notorious German example:
   Er hat in Moskau liebe Genossen. (= He?s got dear comrades at Moskow)
   Er hat in Moskau Liebe genossen. (= He has enjoyed love at Moskow)
 ? (And I assure you, the prosody varies accordingly, hence the
 ? difference is quite clear in speech, and must be preserved
 ? in writing.)

As only the author (and no other stage, be it human or automatic) can 
know the intended meaning, Unicode is quite right when encoding the case 
distinction.

Best wishes,
   Otto

From prosfilaes at gmail.com  Sun Dec 20 01:23:31 2020
From: prosfilaes at gmail.com (David Starner)
Date: Sat, 19 Dec 2020 23:23:31 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
Message-ID: <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>

On Sat, Dec 19, 2020 at 4:49 AM Otto Stolz via Unicode
<unicode at unicode.org> wrote:
> A notorious German example:
>    Er hat in Moskau liebe Genossen. (= He?s got dear comrades at Moskow)
>    Er hat in Moskau Liebe genossen. (= He has enjoyed love at Moskow)
>    (And I assure you, the prosody varies accordingly, hence the
>    difference is quite clear in speech, and must be preserved
>    in writing.)

She _loves_ him !?! (= I can't believe her emotion towards him is love.)
She loves _him_ !?! (= I can't believe that he is the one she loves,
and not someone else.)

And the prosody varies accordingly, and any accurate preservation in
writing would need to record the difference.

> As only the author (and no other stage, be it human or automatic) can
> know the intended meaning, Unicode is quite right when encoding the case
> distinction.

Meh. I could come up with similar examples, though probably a bit more
contrived, for just about every bit of markup. Italics/emphasis has a
bunch of pretty clear meaning changes, like the example above,
possibly more than casing in English. Fraktur/Antiqua mixing allows
for any number of examples; "<fraktur>Er was</fraktur> clever." is
different from "<fraktur>Er was clever</fraktur>".* Casing certainly
had more of an argument to be encoded in the character set than
italics, historically, but I can imagine an alternate history, maybe
one the leaders in computing history used a non-casing script, where
casing was relegated to markup, and a lot of issues would be
easier--no more problems with case-insensitive matching, and the
Turkish i would be a font difference under markup.

* Italics marking in English could serve the same role in making a
bunch of examples; e.g. "The French man said to stop at the coin" and
"The French man said to stop at the <i>coin</i>." mean different
things.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)


From indolering at gmail.com  Sun Dec 20 13:55:53 2020
From: indolering at gmail.com (Zach Lym)
Date: Sun, 20 Dec 2020 11:55:53 -0800
Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?=
 =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?=
In-Reply-To: <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org>
 <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org>
Message-ID: <CABWuLVc7yzeYDgjgGbQQF5apNChxsSVygG8Q6f-zijzeevNm-g@mail.gmail.com>

I don't think it's fair to dismiss this as "not a unicode problem."  As the
OP pointed out, support for non-latin variable names is largely due to
Unicode's identity standard and extensive implementation advice.

The section on numbering (5.5) is only a page long and
essentially recommends handling decimal based numbering systems.  There
isn't nearly as much care given to this topic.  There is a standard annex
on mathematics, but that is in PDF form and is largely concerned with
parsing and display of mathematical formulas.

However, as is the answer to most questions, it is a matter of time and
money. If someone is willing to spend the time expanding 5.5 writing a new
annex, I am sure the Unicode committee would be happy to review it.  Would
you be interested in doing that legwork?

I'm actually pretty new here, what's the best way Roger could contribute to
make Unicode better in this regard?

Thanks,
-Zach Lym

On Wed, Dec 16, 2020 at 5:23 PM Mark E. Shoulson via Unicode <
unicode at unicode.org> wrote:

> On 12/16/20 10:40 AM, Doug Ewell via Unicode wrote:
>
> What I don't understand here is why this is being framed implicitly as a Unicode problem, or an XML problem, or a general law of nature ("why can?t a Bengali-speaking person use the Bengali digits"), instead of an inherent limitation of that particular library function used for that particular language.
>
> Yes, exactly.  This is "a characteristic of the code libraries, not a
> Unicode problem."
>
>
> There are probably reasonable reasons not to update the actual atol/strtol
> calls, but one could certainly write a library to do what you're talking
> about... and apparently someone has, by Bill Poser's report of his
> libuninum.  There ya go.
>
>
> ~mark
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201220/0fefc940/attachment.htm>

From doug at ewellic.org  Sun Dec 20 15:40:14 2020
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 20 Dec 2020 14:40:14 -0700
Subject: Unicode is universal, so how come that universality
 =?UTF-8?Q?doesn=E2=80=99t=20apply=20to=20digits=3F?=
Message-ID: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com>

Zach Lym wrote:
 
> I don't think it's fair to dismiss this as "not a unicode problem."
> As the OP pointed out, support for non-latin variable names is largely
> due to Unicode's identity standard and extensive implementation
> advice.
 
I don't recall Roger saying anything about non-Latin variable names. He
wrote:
 
> Why, for example, can?t a Bengali-speaking person create XML such as
> this:
> <??????_?????>??</??????_?????>
> or write a program assignment statement like this:
>             ??????_????? = ??
 
This doesn't claim that the Bengali variable name
??????_????? is not supported, but rather the
mixed Bengali/Oriya constant ??. In fact, a few lines earlier Roger
wrote:
 
> a Bengali-speaking person can write this:
>              ??????_????? = 42
 
so variable names aren't the issue.
 
> The section on numbering (5.5) is only a page long and essentially
> recommends handling decimal based numbering systems.  There isn't
> nearly as much care given to this topic.
 
Bengali and Oriya are decimal-based. (Whether they should be used
together in a single number is another matter.) The first paragraph of
Section 5.5 specifically discusses interpreting Devanagari digits as one
would interpret Basic Latin digits. I don't know what needs to be added
here.
 
> There is a standard annex on mathematics, but that is in PDF form and
> is largely concerned with parsing and display of mathematical
> formulas.
 
UTR #25 (a Technical Report, not a Standard Annex) does focus on Basic
Latin digits, at one point (2.2) claiming that Basic Latin digits are
essentially the only digits used in math, but it's true that the UTR is
about math notation and that isn't really in scope here. The fact that
the UTR is a PDF document doesn't seem pertinent.
 
> However, as is the answer to most questions, it is a matter of time
> and money. If someone is willing to spend the time expanding 5.5
> writing a new annex, I am sure the Unicode committee would be happy to
> review it.  Would you be interested in doing that legwork?
 
Again, I don't see what is lacking in Section 5.5, especially
considering its Devanagari example. The legwork that needs to be done is
to make implementations more internationalized and more Unicode-aware.
 
--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
 

From asmusf at ix.netcom.com  Sun Dec 20 17:13:01 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 20 Dec 2020 15:13:01 -0800
Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?=
 =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?=
In-Reply-To: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com>
References: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com>
Message-ID: <417fce02-95b0-dd8f-49aa-8e056c033e93@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201220/6341aee7/attachment.htm>

From duerst at it.aoyama.ac.jp  Mon Dec 21 03:08:08 2020
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Mon, 21 Dec 2020 18:08:08 +0900
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
Message-ID: <e68d936e-8158-0d47-fbb2-c713b3264563@it.aoyama.ac.jp>

Hello David, others,

On 20/12/2020 16:23, David Starner via Unicode wrote:
> On Sat, Dec 19, 2020 at 4:49 AM Otto Stolz via Unicode
> <unicode at unicode.org> wrote:
>> A notorious German example:
>>     Er hat in Moskau liebe Genossen. (= He?s got dear comrades at Moskow)
>>     Er hat in Moskau Liebe genossen. (= He has enjoyed love at Moskow)
>>     (And I assure you, the prosody varies accordingly, hence the
>>     difference is quite clear in speech, and must be preserved
>>     in writing.)
> 
> She _loves_ him !?! (= I can't believe her emotion towards him is love.)
> She loves _him_ !?! (= I can't believe that he is the one she loves,
> and not someone else.)
> 
> And the prosody varies accordingly, and any accurate preservation in
> writing would need to record the difference.

I think the above "and most be preserved in writing" is easy to 
misunderstand, as it is a bit too strong. It wouldn't have been 
preserved on very early computers (or earlier, in telegrams) that only 
used upper case. But there was a very strong expectation that it would 
be preserved on things as simple as a typewriter, and definitely also in 
handwriting.

On the other hand, there is no such expectation for your example. If 
prosody has to be reconstructed, that might happen e.g. from context 
(e.g. in a playscript), or the sentences might have been rewritten for 
clarity in the first place.

I don't think there is a single writing system that is able to denote 
every aspect of spoken language. When compared with spoken language, 
most writing systems leave something out. (Some may also add something, 
e.g. distinction of some homonyms.)


>> As only the author (and no other stage, be it human or automatic) can
>> know the intended meaning, Unicode is quite right when encoding the case
>> distinction.
> 
> Meh. I could come up with similar examples, though probably a bit more
> contrived, for just about every bit of markup. Italics/emphasis has a
> bunch of pretty clear meaning changes, like the example above,
> possibly more than casing in English. Fraktur/Antiqua mixing allows
> for any number of examples; "<fraktur>Er was</fraktur> clever." is
> different from "<fraktur>Er was clever</fraktur>".* Casing certainly
> had more of an argument to be encoded in the character set than
> italics, historically,

Exactly.


> but I can imagine an alternate history, maybe
> one the leaders in computing history used a non-casing script, where
> casing was relegated to markup, and a lot of issues would be
> easier--no more problems with case-insensitive matching, and the
> Turkish i would be a font difference under markup.

An alternate history indeed. The history we followed gave us italics 
relegated to markup, and avoided the problems with italic-insensitive 
matching. And please note that your alternate history does NOT lead to 
technology that encodes italics separately. [And that I was perfectly 
able to put stress on a word in the previous sentence without italics, 
even if the main purpose of that was just to make a point.] Also, it's 
not clear that encoders starting with a non-casing script would have 
decided to relegate casing to markup. It's pretty annoying to markup 
single letters, and to change the markup when a word moves to the start 
of a sentence, and these are the main uses for upper case.


> * Italics marking in English could serve the same role in making a
> bunch of examples; e.g. "The French man said to stop at the coin" and
> "The French man said to stop at the <i>coin</i>." mean different
> things.

The important thing here is "could". Unicode doesn't invent writing 
systems. And I have to admit that I don't understand the difference 
between these two sentences even with your italic markup. But that may 
be only me.

Regards,   Martin.

From prosfilaes at gmail.com  Mon Dec 21 03:48:02 2020
From: prosfilaes at gmail.com (David Starner)
Date: Mon, 21 Dec 2020 01:48:02 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <e68d936e-8158-0d47-fbb2-c713b3264563@it.aoyama.ac.jp>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <e68d936e-8158-0d47-fbb2-c713b3264563@it.aoyama.ac.jp>
Message-ID: <CAMZ=zj4+=+zEAU7VyzAYvMKCCuLR7KwtVoFG80jpYm9+G+GaNA@mail.gmail.com>

On Mon, Dec 21, 2020 at 1:10 AM Martin J. D?rst via Unicode
<unicode at unicode.org> wrote:
> > She _loves_ him !?! (= I can't believe her emotion towards him is love.)
> > She loves _him_ !?! (= I can't believe that he is the one she loves,
> > and not someone else.)
> >
> > And the prosody varies accordingly, and any accurate preservation in
> > writing would need to record the difference.
>
> I think the above "and most be preserved in writing" is easy to
> misunderstand, as it is a bit too strong. It wouldn't have been
> preserved on very early computers (or earlier, in telegrams) that only
> used upper case. But there was a very strong expectation that it would
> be preserved on things as simple as a typewriter, and definitely also in
> handwriting.

Er, but that's a different argument altogether. An expectation that it
be preserved is entirely different from "any accurate preservation in
writing would need to record the difference."

> On the other hand, there is no such expectation for your example. If
> prosody has to be reconstructed, that might happen e.g. from context
> (e.g. in a playscript), or the sentences might have been rewritten for
> clarity in the first place.

I'd say there's certainly an expectation that emphasis be preserved in
those statements in some way. If those were real statements, one can
not simply rewrite them, and if they were used in fiction, rewriting
would change the colloquial effect.

>  And please note that your alternate history does NOT lead to
> technology that encodes italics separately.

Sure. The response was about the silliness of the argument, not for
italics being encoded.

> [And that I was perfectly
> able to put stress on a word in the previous sentence without italics,
> even if the main purpose of that was just to make a point.]

You also could have written the sentence in all caps.

> > * Italics marking in English could serve the same role in making a
> > bunch of examples; e.g. "The French man said to stop at the coin" and
> > "The French man said to stop at the <i>coin</i>." mean different
> > things.
>
> The important thing here is "could". Unicode doesn't invent writing
> systems. And I have to admit that I don't understand the difference
> between these two sentences even with your italic markup. But that may
> be only me.

I could create many examples where the italics distinguishes the
meaning, because, like the Fraktur/Antigua example, one use of italics
in English is to denote foreign words. English "coin" and French
"coin" are false friends; the first sentence says to stop at the coin,
and the second says to stop at the corner.

-- 
The standard is written in English . If you have trouble understanding
a particular section, read it again and again and again . . . Sit up
straight. Eat your vegetables. Do not mumble. -- _Pascal_, ISO 7185
(1991)


From frederic.grosshans at gmail.com  Mon Dec 21 04:10:09 2020
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Mon, 21 Dec 2020 11:10:09 +0100
Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?=
 =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?=
In-Reply-To: <CABWuLVc7yzeYDgjgGbQQF5apNChxsSVygG8Q6f-zijzeevNm-g@mail.gmail.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org>
 <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org>
 <CABWuLVc7yzeYDgjgGbQQF5apNChxsSVygG8Q6f-zijzeevNm-g@mail.gmail.com>
Message-ID: <a7723d04-0fbb-7b6d-7c1b-df3f2cc30676@gmail.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201221/3f80e450/attachment.htm>

From asmusf at ix.netcom.com  Mon Dec 21 04:40:44 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Mon, 21 Dec 2020 02:40:44 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <e68d936e-8158-0d47-fbb2-c713b3264563@it.aoyama.ac.jp>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <e68d936e-8158-0d47-fbb2-c713b3264563@it.aoyama.ac.jp>
Message-ID: <7349b420-3a2a-2c80-9f78-bba839d9ec63@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201221/a7423b78/attachment.htm>

From lyratelle at gmx.de  Mon Dec 21 05:21:27 2020
From: lyratelle at gmx.de (Dominikus Dittes Scherkl)
Date: Mon, 21 Dec 2020 12:21:27 +0100
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
Message-ID: <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>

Am 20.12.20 um 08:23 schrieb David Starner via Unicode:
> On Sat, Dec 19, 2020 at 4:49 AM Otto Stolz via Unicode
> <unicode at unicode.org> wrote:
>> A notorious German example:
>>     Er hat in Moskau liebe Genossen. (= He?s got dear comrades at Moskow)
>>     Er hat in Moskau Liebe genossen. (= He has enjoyed love at Moskow)
>>     (And I assure you, the prosody varies accordingly, hence the
>>     difference is quite clear in speech, and must be preserved
>>     in writing.)
>
> She _loves_ him !?! (= I can't believe her emotion towards him is love.)
> She loves _him_ !?! (= I can't believe that he is the one she loves,
> and not someone else.)
>
> And the prosody varies accordingly, and any accurate preservation in
> writing would need to record the difference.
Prosody is a wholly different thing, as others already mentioned.
But in fact, you DID preserve it - in plain text - by adding an
underscore before and after the word with emphasis. You could also have
used ' or " or even * for the same effect, but nevertheless it is
already possible to preserve the special intent of the author _without_
any further additions.

Also even with italics allowed (and maybe bold or othere style features)
this does not indicate _what_ was special about the highlighted words.
Was it emphasis? Or indicated a thought? Or a special meaning of an
ambiguous word? Or whatever else? - all this would need further
agreement or conventions, which are not standardized so far.

--
                                          Dominikus Dittes Scherkl


From richard.wordingham at ntlworld.com  Mon Dec 21 05:27:44 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 21 Dec 2020 11:27:44 +0000
Subject: Unicode is universal, so how come that universality
 =?UTF-8?B?ZG9lc27igJl0?= apply to digits?
In-Reply-To: <417fce02-95b0-dd8f-49aa-8e056c033e93@ix.netcom.com>
References: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com>
 <417fce02-95b0-dd8f-49aa-8e056c033e93@ix.netcom.com>
Message-ID: <20201221112744.2c88dace@JRWUBU2>

On Sun, 20 Dec 2020 15:13:01 -0800
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> Those data may not support parsing or formatting arbitrary
> mixed-script digit combinations. That is also OK, because the data is
> geared towards getting the ordinary use of numbers correct for as
> many locales and languages, not to deal with fancyful stuff that
> doesn't have a real-life user community using it in daily life.

I can imagine a few situations where mixed sequences may occur.
Firstly, the early non-Indian Unicode usage of Tamil script place
notation would have required that the 'digit zero' come from another
script, as Unicode initially only supported Indian Tamil script usage,
which lacks a zero.

Secondly, but not strictly an example, it seems that the Lao-style of
the Tai Tham script will mix the use of the two digit sets.

I wouldn't be surprised at the use of eclectic mixes of Arabic digits
at the eastern end of the Arabic script domain.  The glyph shapes of
the EXTENDED ARABIC-INDIC digits are language-dependent, and
language-dependence has only recently hit mainstream rendering for the
masses.

I wouldn't be surprised to find mixed selections in use in the Union of
Burma.  That could be a big nuisance, because the three series of
digits provide some opportunity for digits to spoof digits!

Richard.

From wjgo_10009 at btinternet.com  Mon Dec 21 08:17:23 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 21 Dec 2020 14:17:23 +0000 (GMT)
Subject: Expressing thoughts in plain text (from Re: Italics get used to
 express important semantic meaning, so unicode should support them)
In-Reply-To: <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
Message-ID: <6960bdc0.5ad.17685a97278.Webtop.220@btinternet.com>


Dominikus Dittes Scherkl wrote as follows.
> Also even with italics allowed (and maybe bold or othere style 
> features) this does not indicate _what_ was special about the 
> highlighted words. Was it emphasis? Or indicated a thought? Or a 
> special meaning of an ambiguous word? Or whatever else? - all this 
> would need further agreement or conventions, which are not 
> standardized so far.
In my novels I express a character's thoughts, as contrasted with his or 
her spoken words, by using single quotes for thoughts and double quotes 
for spoken words. In fact, the desktop publishing software that I use 
automatically substitutes smart quotes, provided that the font has them: 
The font that I usually use for text has smart quotes.
As far as I am aware this is just my own way of writing, though it is 
possible that I saw it somewhere years ago and it was in my memory 
somewhere and that memory influenced me.
It may perhaps be non-standard, but it seems to work fine. I publish my 
novels myself in pure electronic format.
William Overington
Monday 21 December 2020


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201221/e7211e73/attachment.htm>

From abrahamgross at disroot.org  Mon Dec 21 13:05:55 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Mon, 21 Dec 2020 19:05:55 +0000 (UTC)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
Message-ID: <cdedd7e8-7a64-4b34-8e86-1a02491d84c0@disroot.org>

Its also possible to come up with a system to use double letters or lets say exclamation point+letter instead of any uppercase letters. I can write !belgium or bbelgium instead of Belgium and get ppl to agree to do it and then I wouldn't neet italics.
The only reason why things like _italics_ or *italics* are around is because of the lack of real italics. I would go as far as to say that the very existence of *italics* in plain text shows that theres a real need for italics when writing plain text.
This is a workaround around a real problem of the lack of italics if I've ever seen one?

Dec 21, 2020 6:22:26 AM Dominikus Dittes Scherkl via Unicode <unicode at unicode.org>:

> Prosody is a wholly different thing, as others already mentioned.
> But in fact, you DID preserve it - in plain text - by adding an
> underscore before and after the word with emphasis. You could also have
> used ' or " or even * for the same effect, but nevertheless it is
> already possible to preserve the special intent of the author _without_
> any further additions.
> 


From sosipiuk at gmail.com  Mon Dec 21 13:20:25 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Mon, 21 Dec 2020 14:20:25 -0500
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
Message-ID: <CAM+ijLgfJQxnjDVKP=8XnJ8damfe+WK2AEV7WacSx0irGCOgkg@mail.gmail.com>

On Mon, Dec 21, 2020 at 6:23 AM Dominikus Dittes Scherkl via Unicode
<unicode at unicode.org> wrote:
>
> But in fact, you DID preserve it - in plain text - by adding an
> underscore before and after the word with emphasis. You could also have
> used ' or " or even * for the same effect, but nevertheless it is
> already possible to preserve the special intent of the author _without_
> any further additions.

This doesn't hold water. People can cobble together methods of
conveying meaning. It doesn't mean they're ideal, good, or even
acceptable. The use of underscores, asterisks, and whatnot to indicate
emphasis is a hack to fit with the limitations imposed by technology.
By the same logic, one could argue that ? and ? didn't need to be
encoded for the benefit of Spanish users, because they COULD just use
ordinary ? and ! and they would still be understood.

I can use an axe to bang nails into a wall, but it's silly to say I
don't REALLY need a hammer.

As a mildly interesting aside, technical limitations of print have
driven changes to language before. It's partly the reason why ?
(thorn) is no longer part of the English alphabet. It's still not an
excuse for doing similar things today. In a few more decades
underscores and asterisks may become fully accepted punctuation,
resulting from the limits we currently have in plain text. Technology
should adapt to us, not the other way around.

Indeed, I would argue that the use of such "human-readable markup" is
evidence FOR the inclusion of basic formatting in plain text. There is
such demand for it that people are willing to settle for inelegant
hacks to get their meaning across.

> Also even with italics allowed (and maybe bold or othere style features)
> this does not indicate _what_ was special about the highlighted words.
> Was it emphasis? Or indicated a thought? Or a special meaning of an
> ambiguous word? Or whatever else? - all this would need further
> agreement or conventions, which are not standardized so far.

Newspapers often italicize words and they're clearly following some
(possibly internal) standard. The example of novels has already been
given. There are conventions for such things, often varying by medium
and language. The precise meaning of formatting does NOT need to be
standardised by Unicode to make it available as a tool.

S?awomir Osipiuk


From kenwhistler at sonic.net  Mon Dec 21 13:42:54 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Mon, 21 Dec 2020 11:42:54 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <cdedd7e8-7a64-4b34-8e86-1a02491d84c0@disroot.org>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
 <cdedd7e8-7a64-4b34-8e86-1a02491d84c0@disroot.org>
Message-ID: <cfdcbef3-59dd-670e-fc70-d481057762fe@sonic.net>


On 12/21/2020 11:05 AM, abrahamgross--- via Unicode wrote:
> The only reason why things like_italics_  or*italics*  are around is because of the lack of real italics. I would go as far as to say that the very existence of*italics*  in plain text shows that theres a real need for italics when writing plain text.
> This is a workaround around a real problem of the lack of italics if I've ever seen one?

Actually, simple markup conventions like that mostly date from early 
days of email, when plain text (and usually just ASCII at that) were all 
you got. (By the way, the most usual interpretation of those is 
_underscore_, /italic/, and *bold*, but whatever.)

Nowadays, presto chango, most email clients support rich text (in HTML, 
usually), and you get to _underscore_, /italicize/, and *bold* your text 
correctly whenever you want to, and even change the font size to SHOUT, 
if you want.

Some folks here seem to be viewing the "problem" here the wrong way 
round. The issue isn't that plain text cannot preserve all the "meaning" 
conveyed in writing systems. When dealing with meaning conveyed with 
conventions that involve styling, font change, color and such, you 
simply depend on properly tiered text architecture and build support for 
that in rich text and markup. It is ass-backwards to try to continue to 
clot up plain text as the backbone of text interchange by trying to 
import all the complications of styling directly into it as if that 
representation were a plain text issue -- it isn't.

Instead the *real* problem here is that in some communication contexts 
that should be supporting rich text, implementations are still 
restricting people to plain text when what they really want is easily 
accessible and dependable rich text to convey more nuances accurately 
(or just to be more expressive). If Twitter is half-assed about 
supporting text styling, then direct your concerns in the proper 
direction. You don't fix Twitter's or texting apps' use of text by 
trying to force styling into the Unicode encoding of plain text.

--Ken


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201221/68476ac8/attachment.htm>

From asmusf at ix.netcom.com  Mon Dec 21 14:58:04 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Mon, 21 Dec 2020 12:58:04 -0800
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <CAM+ijLgfJQxnjDVKP=8XnJ8damfe+WK2AEV7WacSx0irGCOgkg@mail.gmail.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
 <CAM+ijLgfJQxnjDVKP=8XnJ8damfe+WK2AEV7WacSx0irGCOgkg@mail.gmail.com>
Message-ID: <2408ce9d-aa27-af56-09cf-3a0a5fc80e24@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201221/5da3df23/attachment.htm>

From jameskass at code2001.com  Mon Dec 21 16:08:50 2020
From: jameskass at code2001.com (James Kass)
Date: Mon, 21 Dec 2020 22:08:50 +0000
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <2408ce9d-aa27-af56-09cf-3a0a5fc80e24@ix.netcom.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
 <CAM+ijLgfJQxnjDVKP=8XnJ8damfe+WK2AEV7WacSx0irGCOgkg@mail.gmail.com>
 <2408ce9d-aa27-af56-09cf-3a0a5fc80e24@ix.netcom.com>
Message-ID: <3a3f8743-efdd-6d66-d4ce-788d53017bad@code2001.com>


On 2020-12-21 8:58 PM, Asmus Freytag via Unicode wrote:
> On 12/21/2020 11:20 AM, S?awomir Osipiuk via Unicode wrote:
> > I can use an axe to bang nails into a wall, but it's silly to say I
> > don't REALLY need a hammer.
>
> To paraphrase Ken: if you need rich text, you really need rich text, so go out
> and tackle those that force you to use plain text instead.
>
> A./
>

Some of us choose to use plain-text rather than to be forced into using 
rich-text.

Written communication is of, by, and for human beings.? Regardless of 
the media used to exchange that communication or the tools used to 
produce it.? As human beings (the inventors and owners of the graphic 
symbols used in written communication), it is our birthright to insert 
any graphic character whatsoever in our written communication for any 
purpose we deem fit.? Earnestly or whimsically.

It?s fair use ? never abuse.

We don?t need anyone?s permission to exercise that birthright.

People have been using and repurposing each other?s graphic symbols 
since day one.? That?s how writing evolves.

Twitter users are already using the Latin italic letters (which had been 
repurposed as math symbols) encoded in Unicode to convey the notion that 
their authorial intention was to deploy Latin italic letters.? Their 
(the Twitter users) needs are being well served by the existing Unicode 
repertoire.? If Twitter thought that this was some kind of problem, or 
if Twitter users were /really/ clamoring for rich-text, then Twitter 
would have acted long ago.


From indolering at gmail.com  Mon Dec 21 19:00:12 2020
From: indolering at gmail.com (Zach Lym)
Date: Mon, 21 Dec 2020 17:00:12 -0800
Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?=
 =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?=
In-Reply-To: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com>
References: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com>
Message-ID: <CABWuLVed=WUYk5PGf-r4A93CSygWBtOVvWJ98NYUsFxoJcJtCg@mail.gmail.com>

> I don't recall Roger saying anything about non-Latin variable names.

We agree that non-latin variable names are not the issue, I just
worded my response clumsily ?\_(?)_/??

So ... why isn't the treatment of parsing numbers as good as variable
names?  Well, to cite Conway's Law,  "Any organization that designs a
system (defined broadly) will produce a design whose structure is a
copy of the organization's communication structure."

The identifier standard annex is ~30 pages of polished hand holding
for a language implementor: it provides examples, gets into parsing,
gives advice on customization, and explains tricky issues such as
handling zero-width-joiners.

I assume UAX 31 has received a disproportionate level of attention
thanks to hammering out DNS and URL standards, but maybe that's just
because I have a background in DNS.

>
> > The section on numbering (5.5) is only a page long and essentially
> > recommends handling decimal based numbering systems.  There isn't
> > nearly as much care given to this topic.
>
> Bengali and Oriya are decimal-based. (Whether they should be used
> together in a single number is another matter.) The first paragraph of
> Section 5.5 specifically discusses interpreting Devanagari digits as one
> would interpret Basic Latin digits. I don't know what needs to be added
> here.

As Fr?d?ric points in his reply, section 22.3 has a lengthier
treatment (which I totally missed).  At a minimum, 5.5 should cross
reference 22.3.

> > There is a standard annex on mathematics, but that is in PDF form and
> > is largely concerned with parsing and display of mathematical
> > formulas.
>
> UTR #25 (a Technical Report, not a Standard Annex) does focus on Basic
> Latin digits, at one point (2.2) claiming that Basic Latin digits are
> essentially the only digits used in math, but it's true that the UTR is
> about math notation and that isn't really in scope here.

I think it's significant to answering Roger's question.  How much
demand is there for using native numeric literals when most
control-flow logic is going to be in English?

> The fact that the UTR is a PDF document doesn't seem pertinent.

PDFs do not rank well on Google, you can't deeply link to specific
sections, and they are generally a PITA to work with.  The Unicode
standard publishes PDFs *not* because it is a good idea, but because
it's inconvenient to change a 30-year-old publishing workflow.

> > However, as is the answer to most questions, it is a matter of time
> > and money. If someone is willing to spend the time expanding 5.5
> > writing a new annex, I am sure the Unicode committee would be happy to
> > review it.  Would you be interested in doing that legwork?
>
> Again, I don't see what is lacking in Section 5.5, especially
> considering its Devanagari example. The legwork that needs to be done is
> to make implementations more internationalized and more Unicode-aware.

Yes: it's ultimately on implementers and Unicode != i18n.

And: couldn't we do a better job at transitioning people to resources
on how to handle i18n in a more comprehensive fashion?

But also: Unicode is hella confusing, even to world-class programmers.
  Shouldn't we try to recruit suckers like Roger and I into making it
better?

?
-Zach Lym


From kenwhistler at sonic.net  Mon Dec 21 20:08:32 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Mon, 21 Dec 2020 18:08:32 -0800
Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?=
 =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?=
In-Reply-To: <CABWuLVed=WUYk5PGf-r4A93CSygWBtOVvWJ98NYUsFxoJcJtCg@mail.gmail.com>
References: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com>
 <CABWuLVed=WUYk5PGf-r4A93CSygWBtOVvWJ98NYUsFxoJcJtCg@mail.gmail.com>
Message-ID: <9e70ebf1-7f81-5730-1964-ad86d425a82e@sonic.net>


On 12/21/2020 5:00 PM, Zach Lym via Unicode wrote:
>> The fact that the UTR is a PDF document doesn't seem pertinent.
> PDFs do not rank well on Google, you can't deeply link to specific
> sections,

Actually, you can, if you set them up correctly:

https://www.unicode.org/versions/Unicode13.0.0/ch22.pdf#G12146

That links right to Table 22-3, Script-Specific Decimal Digits on p. 
829, in Section 22.3 of the latest version of the core specification.

>   and they are generally a PITA to work with.
Well, your mileage may vary. HTML has its own PITA aspects.
>    The Unicode
> standard publishes PDFs*not*  because it is a good idea, but because
> it's inconvenient to change a 30-year-old publishing workflow.

20, not 30, actually. Prior to Unicode 3.0, the Unicode Standard was 
done with a different family of editorial tooling. But yeah, it is 
inconvenient to change, especially since the document is riddled with 
hand-tweaked figures and hacked up fonts. And it's a thousand pages 
long, and it has internal indexing and the sections, figures, and tables 
are all cross-referenced in the document. And oh, did I mention? It's a 
thousand pages long.

Various folks have wanted to reformat it to something more web-friendly 
and searchable over the years, but they have tended to discover other 
things that they needed to do when faced with the actual amount of work 
involved. ;-)

--Ken

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201221/c4d6e3dd/attachment.htm>

From junicode at jcbradfield.org  Tue Dec 22 04:15:09 2020
From: junicode at jcbradfield.org (Julian Bradfield)
Date: Tue, 22 Dec 2020 10:15:09 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
 <cdedd7e8-7a64-4b34-8e86-1a02491d84c0@disroot.org>
 <cfdcbef3-59dd-670e-fc70-d481057762fe@sonic.net>
Message-ID: <slrnru3hpd.4mp.jcb@home.stevens-bradfield.com>

Veering further off-topic...but why not :-?

On 2020-12-21, Ken Whistler via Unicode <unicode at unicode.org> wrote:
> Actually, simple markup conventions like that mostly date from early 
> days of email, when plain text (and usually just ASCII at that) were all 
> you got. (By the way, the most usual interpretation of those is 
> _underscore_, /italic/, and *bold*, but whatever.)

But underscore is just the manuscript equivalent of italic
print...both naively (when you underline a word in a letter, you would
now italicize it in a typeset letter) and as formalized in
copy-editing markup.
So for many of us, _italic_ has always been natural, and /italic/
always looks a bit weird. 
It's curious that nobody (as far as I know) adapts copy-editing markup
and writes ~bold~ .


From wjgo_10009 at btinternet.com  Tue Dec 22 03:52:25 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Tue, 22 Dec 2020 09:52:25 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <3a3f8743-efdd-6d66-d4ce-788d53017bad@code2001.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
 <CAM+ijLgfJQxnjDVKP=8XnJ8damfe+WK2AEV7WacSx0irGCOgkg@mail.gmail.com>
 <2408ce9d-aa27-af56-09cf-3a0a5fc80e24@ix.netcom.com>
 <3a3f8743-efdd-6d66-d4ce-788d53017bad@code2001.com>
Message-ID: <386343f0.bc8.17689dd3b10.Webtop.51@btinternet.com>


Hi

James Kass wrote as follows.

> Written communication is of, by, and for human beings.
Wow, nominative, genitive, ablative and dative all in one short 
sentence.
Can you express that sentence with emoji?
Maybe if these abstract emoji were encoded in regular Unicode one could 
encode that sentence and lots of other sentences too.
http://www.users.globalnet.co.uk/ <http://www.users.globalnet.co.uk/> 
~ngo/abstract_emoji.htm
Yet are abstract emoji acceptable to Unicode Inc. for encoding into The 
Unicode Standard?
Best regards,
William Overington
Tuesday 22 December 2020

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201222/915dcf39/attachment.htm>

From wjgo_10009 at btinternet.com  Tue Dec 22 04:31:31 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Tue, 22 Dec 2020 10:31:31 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <slrnru3hpd.4mp.jcb@home.stevens-bradfield.com>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
 <cdedd7e8-7a64-4b34-8e86-1a02491d84c0@disroot.org>
 <cfdcbef3-59dd-670e-fc70-d481057762fe@sonic.net>
 <slrnru3hpd.4mp.jcb@home.stevens-bradfield.com>
Message-ID: <2ab4a4c9.c23.1768a010515.Webtop.51@btinternet.com>


Hi

Julian Bradfield wrote as follows.

> It's curious that nobody (as far as I know) adapts copy-editing markup
and writes ~bold~ .

Well, that seems like a very good idea of a stylish way to express bold 
type. That can be put into practice now if people choose to use it.

Best regards,

William Overington

Tuesday 22 December 2020


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201222/a08cb3dc/attachment.htm>

From wjgo_10009 at btinternet.com  Tue Dec 22 04:50:25 2020
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Tue, 22 Dec 2020 10:50:25 +0000 (GMT)
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <cfdcbef3-59dd-670e-fc70-d481057762fe@sonic.net>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
 <cdedd7e8-7a64-4b34-8e86-1a02491d84c0@disroot.org>
 <cfdcbef3-59dd-670e-fc70-d481057762fe@sonic.net>
Message-ID: <31808f37.c6b.1768a125280.Webtop.51@btinternet.com>


Hi
Ken Whistler wrote as follows.
> Actually, simple markup conventions like that mostly date from early 
> days of email, when plain text (and usually just ASCII at that) were 
> all you got.
Back in the early 1990s when only ASCII was available, the circumflex 
accented characters needed for Esperanto were often expressed by using a 
lowercase letter x after the ASCII base of the character.
Namely as follows,
Cx cx Gx gx Hx hx Jx jx Sx sx
I do not remember whether the U breve and the u breve were expressed as 
Ux or ux at all.
This seemed to me at first to be quite strange, but I got used to 
reading it.
The method was suitable for unambiguous use because the Esperanto 
language does not use the letter x in its alphabet.
Best regards,
William Overington
Tuesday 22 December 2020

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201222/a591d9c5/attachment.htm>

From indolering at gmail.com  Tue Dec 22 16:52:32 2020
From: indolering at gmail.com (Zach Lym)
Date: Tue, 22 Dec 2020 14:52:32 -0800
Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?=
 =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?=
In-Reply-To: <9e70ebf1-7f81-5730-1964-ad86d425a82e@sonic.net>
References: <20201220144014.665a7a7059d7ee80bb4d670165c8327d.10caafc26a.wbe@email15.godaddy.com>
 <CABWuLVed=WUYk5PGf-r4A93CSygWBtOVvWJ98NYUsFxoJcJtCg@mail.gmail.com>
 <9e70ebf1-7f81-5730-1964-ad86d425a82e@sonic.net>
Message-ID: <CABWuLVdRGvvE1KAGQKmRKoEpJRNcGE8FpFaF6wqjdy92wYPFcw@mail.gmail.com>

On Mon, Dec 21, 2020 at 6:08 PM Ken Whistler <kenwhistler at sonic.net> wrote:
> On 12/21/2020 5:00 PM, Zach Lym via Unicode wrote:
> PDFs do not rank well on Google, you can't deeply link to specific
> sections,
>
> Actually, you can, if you set them up correctly:
>
> https://www.unicode.org/versions/Unicode13.0.0/ch22.pdf#G12146
>
> That links right to Table 22-3, Script-Specific Decimal Digits on p. 829, in Section 22.3 of the latest version of the core specification.

I don't want to be rude ... but protips just enable user abuse ?.
Being an expert in something insulates you from the harsh realities
faced by your users ?.

During my review of filename normalization decisions ... experts were
confused and made poor choices at virtually every step.  Usability
engineering views confused end-users as a signal that important
information isn't being surfaced in an appropriate manner.  When you
are failing to meet your target demographic of
smart-people-in-a-hurry, what hope is there for us idiots?

Not much, **because no one reads manuals.**  That is one of the most
reliable findings of technical documentation research stretching back
to 1987 [1].

>  and they are generally a PITA to work with.
>
> Well, your mileage may vary. HTML has its own PITA aspects.

Everything involves trade-offs, but Unicode's PDFs are even worse than
the IETF's typewriter emulator ?.

>   The Unicode
> standard publishes PDFs *not* because it is a good idea, but because
> it's inconvenient to change a 30-year-old publishing workflow.
>
> 20, not 30, actually. Prior to Unicode 3.0, the Unicode Standard was done with a different family of editorial tooling.

So what are you using, DocBook?

> But yeah, it is inconvenient to change, especially since the document is riddled with hand-tweaked figures and hacked up fonts. And it's a thousand pages long, and it has internal indexing and the sections, figures, and tables are all cross-referenced in the document. And oh, did I mention? It's a thousand pages long.

Oh, no.  That sounds terrible ... for someone who isn't a print ?!

My father taught graphic design, so I grew up messing around with
PageMaker and doing table based HTML layouts.  My post high-school job
involved Vietmanese and Amharic print work.  The Ethiopian history
books weren't a thousand pages long, but they had reference indexes,
figures, and codepage mixing ... the whole nine ?.

> Various folks have wanted to reformat it to something more web-friendly and searchable over the years,

I only suggested additional cross-references within the standard and
possibly a new technical report, which would be more *user* friendly.

The UX rabbit hole started based on my assertion that the
disproportionate amount of effort put into the identifier
documentation and widespread support for i18n variable names is more
than *just* a correlation.

> but they have tended to discover other things that they needed to do when faced with the actual amount of work involved. ;-)

That is not something an outsider could do, as the primary audience
for any product are the people who make it.  And if the "insiders"
don't see a problem....

Hence my invocation of Conway's law: the standard reflects the
particular bureaucratic mould in which it is formed.

?
- Zach Lym
[1]: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=minimal+manual&btnG=


From duerst at it.aoyama.ac.jp  Tue Dec 22 18:14:59 2020
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Wed, 23 Dec 2020 09:14:59 +0900
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <cfdcbef3-59dd-670e-fc70-d481057762fe@sonic.net>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
 <cdedd7e8-7a64-4b34-8e86-1a02491d84c0@disroot.org>
 <cfdcbef3-59dd-670e-fc70-d481057762fe@sonic.net>
Message-ID: <3bce8fcb-cd55-aa09-fe24-a27b21101816@it.aoyama.ac.jp>

Hello everybody,

Just for what it's worth, here are a few details on how at least some 
email clients handle ASCII email styling conventions (/ for italics, _ 
for underscore, and * for boldface).

On 22/12/2020 04:42, Ken Whistler via Unicode wrote:
> 
> On 12/21/2020 11:05 AM, abrahamgross--- via Unicode wrote:
>> The only reason why things like_italics_? or*italics*? are around is 
>> because of the lack of real italics. I would go as far as to say that 
>> the very existence of*italics*? in plain text shows that theres a real 
>> need for italics when writing plain text.
>> This is a workaround around a real problem of the lack of italics if 
>> I've ever seen one?

The two places above are displayed without styling in plaintext, 
probably because they are quoted. They show up styled in HTML because 
that contains additional tags (<b>,...).

> Actually, simple markup conventions like that mostly date from early 
> days of email, when plain text (and usually just ASCII at that) were all 
> you got. (By the way, the most usual interpretation of those is 
> _underscore_, /italic/, and *bold*, but whatever.)

These show up styled in plaintext display, but not in HTML, presumably 
because Ken entered the styling characters by hand (in the HTML version, 
there is no markup).

> Nowadays, presto chango, most email clients support rich text (in HTML, 
> usually), and you get to _underscore_, /italicize/, and *bold* your text 
> correctly whenever you want to, and even change the font size to SHOUT, 
> if you want.

These show up styled in HTML, most probably because Ken used the text 
editor to style them that way. The plaintext version contains ASCII 
email styling characters (but the HTML version doesn't), and my guess is 
that they were added when the mailer produced the plaintext version.

Your mailer and your mileage may vary.

Regards,   Martin.

> Some folks here seem to be viewing the "problem" here the wrong way 
> round. The issue isn't that plain text cannot preserve all the "meaning" 
> conveyed in writing systems. When dealing with meaning conveyed with 
> conventions that involve styling, font change, color and such, you 
> simply depend on properly tiered text architecture and build support for 
> that in rich text and markup. It is ass-backwards to try to continue to 
> clot up plain text as the backbone of text interchange by trying to 
> import all the complications of styling directly into it as if that 
> representation were a plain text issue -- it isn't.
> 
> Instead the *real* problem here is that in some communication contexts 
> that should be supporting rich text, implementations are still 
> restricting people to plain text when what they really want is easily 
> accessible and dependable rich text to convey more nuances accurately 
> (or just to be more expressive). If Twitter is half-assed about 
> supporting text styling, then direct your concerns in the proper 
> direction. You don't fix Twitter's or texting apps' use of text by 
> trying to force styling into the Unicode encoding of plain text.
> 
> --Ken


From kent.b.karlsson at bahnhof.se  Tue Dec 22 18:55:15 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Wed, 23 Dec 2020 01:55:15 +0100
Subject: Italics get used to express important semantic meaning, so
 unicode should support them
In-Reply-To: <3bce8fcb-cd55-aa09-fe24-a27b21101816@it.aoyama.ac.jp>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
 <cdedd7e8-7a64-4b34-8e86-1a02491d84c0@disroot.org>
 <cfdcbef3-59dd-670e-fc70-d481057762fe@sonic.net>
 <3bce8fcb-cd55-aa09-fe24-a27b21101816@it.aoyama.ac.jp>
Message-ID: <BE1DA165-B5CD-4CAE-BA7A-536E3E66CC19@bahnhof.se>


> 23 dec. 2020 kl. 01:14 skrev Martin J. D?rst via Unicode <unicode at unicode.org>:
> 
> ...
> 
>> Nowadays, presto chango, most email clients support rich text (in HTML, usually), and you get to _underscore_, /italicize/, and *bold* your text correctly whenever you want to, and even change the font size to SHOUT, if you want.
> 
> These show up styled in HTML, most probably because Ken used the text editor to style them that way. The plaintext version contains ASCII email styling characters (but the HTML version doesn't), and my guess is that they were added when the mailer produced the plaintext version.
> 
> Your mailer and your mileage may vary.

That kind of markup now goes by the name ?markdown? (apparently, I don?t like that name, the pun only(?) works in English), and each system has their own variant. Wikipedia has one, various chat platforms have theirs (all likely slightly different), Trac has its variant, Jira has its variant of this, etc. etc. Some have bullet lists, some not, some have headings perhaps allowing different levels, some allow for strike-over. At which point the ?markdown? is converted to (e.g) HTML (or other more robust markup) may vary. It?s the Wild Wild West. And it is not at all robust, mishaps easily happen and may be hard to get out of. But I agree it is handy (easy to type on the keyboard), most often it works as intended but not always?

/Kent K


> Regards,   Martin.


From sosipiuk at gmail.com  Tue Dec 22 19:01:59 2020
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Tue, 22 Dec 2020 20:01:59 -0500
Subject: Italics get used to express important semantic meaning,
 so unicode should support them
In-Reply-To: <3bce8fcb-cd55-aa09-fe24-a27b21101816@it.aoyama.ac.jp>
References: <CA+jfYh=V-KMz5jzmz7ezMC7CNV7Pwpk3NQ7gcYX=UNXO7j-Ksw@mail.gmail.com>
 <000301d6d00e$4d244330$e76cc990$@ewellic.org>
 <CA+jfYhm77rFhfQ1o-uPZznVaig4RB0-cB5h_Gkx+_U9eCrddsQ@mail.gmail.com>
 <8c25bdef-89d9-76ba-2b66-eb1a12990b20@ix.netcom.com>
 <7868386b-d705-4139-c106-bb2eab58fe39@it.aoyama.ac.jp>
 <3afeaab8-3871-4062-bf32-478d0bcfb98c@disroot.org>
 <001501d6d3d6$18ee97c0$4acbc740$@ewellic.org>
 <08cafd92-45b8-9d2e-3ed2-be187175639f@it.aoyama.ac.jp>
 <48e0e7bb-d041-41f8-8a34-0be3035bb75e@disroot.org>
 <17c33dcd-ccec-2e69-4911-ee0c471d21af@uni-konstanz.de>
 <CAMZ=zj7BAfYHF=gbb4pKtDJDVcbEdGQbw9RRPrCjT1ik9Ba7xA@mail.gmail.com>
 <b2398a5f-8138-bfb3-b7fb-950fa70ee549@gmx.de>
 <cdedd7e8-7a64-4b34-8e86-1a02491d84c0@disroot.org>
 <cfdcbef3-59dd-670e-fc70-d481057762fe@sonic.net>
 <3bce8fcb-cd55-aa09-fe24-a27b21101816@it.aoyama.ac.jp>
Message-ID: <000701d6d8c7$37f71700$a7e54500$@gmail.com>

I think you forgot the most important part: Which email clients?
None of the markup has the intended effect for me, and in pure plaintext none of it (currently) can. Whatever client you're using is interpreting the markup and applying the formatting.

-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of Martin J. D?rst via Unicode
Sent: Tuesday, December 22, 2020 7:15 PM
To: unicode at unicode.org
Subject: Re: Italics get used to express important semantic meaning, so unicode should support them

Hello everybody,

Just for what it's worth, here are a few details on how at least some email clients handle ASCII email styling conventions (/ for italics, _ for underscore, and * for boldface).

On 22/12/2020 04:42, Ken Whistler via Unicode wrote:
> 
> On 12/21/2020 11:05 AM, abrahamgross--- via Unicode wrote:
>> The only reason why things like_italics_  or*italics*  are around is 
>> because of the lack of real italics. I would go as far as to say that 
>> the very existence of*italics*  in plain text shows that theres a 
>> real need for italics when writing plain text.
>> This is a workaround around a real problem of the lack of italics if 
>> I've ever seen one?

The two places above are displayed without styling in plaintext, probably because they are quoted. They show up styled in HTML because that contains additional tags (<b>,...).

> Actually, simple markup conventions like that mostly date from early 
> days of email, when plain text (and usually just ASCII at that) were 
> all you got. (By the way, the most usual interpretation of those is 
> _underscore_, /italic/, and *bold*, but whatever.)

These show up styled in plaintext display, but not in HTML, presumably because Ken entered the styling characters by hand (in the HTML version, there is no markup).

> Nowadays, presto chango, most email clients support rich text (in 
> HTML, usually), and you get to _underscore_, /italicize/, and *bold* 
> your text correctly whenever you want to, and even change the font 
> size to SHOUT, if you want.

These show up styled in HTML, most probably because Ken used the text editor to style them that way. The plaintext version contains ASCII email styling characters (but the HTML version doesn't), and my guess is that they were added when the mailer produced the plaintext version.

Your mailer and your mileage may vary.

Regards,   Martin.

> Some folks here seem to be viewing the "problem" here the wrong way 
> round. The issue isn't that plain text cannot preserve all the "meaning"
> conveyed in writing systems. When dealing with meaning conveyed with 
> conventions that involve styling, font change, color and such, you 
> simply depend on properly tiered text architecture and build support 
> for that in rich text and markup. It is ass-backwards to try to 
> continue to clot up plain text as the backbone of text interchange by 
> trying to import all the complications of styling directly into it as 
> if that representation were a plain text issue -- it isn't.
> 
> Instead the *real* problem here is that in some communication contexts 
> that should be supporting rich text, implementations are still 
> restricting people to plain text when what they really want is easily 
> accessible and dependable rich text to convey more nuances accurately 
> (or just to be more expressive). If Twitter is half-assed about 
> supporting text styling, then direct your concerns in the proper 
> direction. You don't fix Twitter's or texting apps' use of text by 
> trying to force styling into the Unicode encoding of plain text.
> 
> --Ken


From doug at ewellic.org  Wed Dec 23 16:40:43 2020
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 23 Dec 2020 15:40:43 -0700
Subject: Italics get used to express important semantic meaning, so unicode
 should support them
Message-ID: <20201223154043.665a7a7059d7ee80bb4d670165c8327d.9832ff6ba9.wbe@email15.godaddy.com>

Replying to a bunch of messages at once; the impending holidays and that
have limited my available time for extended posts. Some of these topics
may be ?resolved? by now, so enjoy the nostalgia.
 
S?awomir Osipiuk wrote:
 
>> All TAG symbols placed between a U+E003D TAG LESS-THAN SIGN and a
>> U+E003E TAG GREATER-THAN SIGN, inclusive, are to be treated as if
>> they were they corresponding ASCII characters, and run that through
>> an HTML renderer. I guess if you wanted you could stipulate some
>> reduced or restricted subset of HTML
>
> I've been informed off-list that BabelPad uses this as a formatting
> option. So, it's been done.
 
I do use this feature in BabelPad at times -- in fact, just today while
copying the Unicode subsection on ?plain text? from Section 2.2 of
the PDF, and not feeling inclined at the moment to open Word. But it?s
a bit like using PUA characters, or even SCSU: I know this usage is not
part of the standard and unlikely to be supported by anything else, so
absent an explicit agreement, I?d better keep it to myself.
 
> My guiding example is, "record fully the story text of a paperback
> novel".
 
So here is the salient part I gathered from the TUS definition, with
BabelPad formatting (hee hee) removed. Apologies if this passage is too
lengthy to qualify as fair use:
 
<quote>
Plain text represents character content only, not its appearance. It can
be displayed in a variety of ways and requires a rendering process to
make it visible with a particular appearance. If the same plain text
sequence is given to disparate rendering processes, there is no
expectation that rendered text in each instance should have the same
appearance. Instead, the disparate rendering processes are simply
required to make the text legible according to the intended reading.
This legibility criterion constrains the range of possible appearances.
The relationship between appearance and content of plain text may be
summarized as follows:
 
Plain text must contain enough information to permit the text to be
rendered legibly, and nothing more.
</quote>
 
The emphasis on ?legibility? seems important here. Despite the focus
on ?semantic meaning? in this thread, neither of those words appear
anywhere in the TUS definition of plain text.
 
Kent Karlsson wrote:
 
> Now, where did I see something very much like [S?awomir?s original
> suggestion with U+E0002 FORMAT TAG]??? 
>
> Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very
> close (especially the ?invisible by default? (?default ignorable?) IF
> parsed correctly). And? ECMA-48 is already a standard.
 
Perhaps surprisingly, or perhaps not, ECMA-48 is actually my favorite
mechanism for low-level styling of plain text, mostly for the reasons
Kent cites here and elsewhere: it?s lightweight, it?s been a
standard for a long time, and it?s already in extensive use by at
least one sector of text processing.
 
Kent might be one of the surprised ones, because I haven?t been a fan
of some of the ?updates? to ECMA-48 that he has recommended, in
particular those that I feel extend, restrict, or invent too much. But I
like the standard in general, and some modest amount of updating is
probably inevitable to keep it current.
 
S?awomir:
 
> I didn't mention it because it's a bit outdated
 
?Outdated? is just generally a big red flag for me. If a standard
doesn?t meet modern needs, and can?t reasonably be made to do so,
that?s one thing, but the fact that it was developed some arbitrary
number of years ago is not something I care about. Unicode itself is
about 30 years old and I hope nobody sees that as evidence it needs
imminent replacing.
 
> But that "if parsed correctly" is quite the nit, isn't it?
 
This is true for any such mechanism. I remember early HTML authors being
upset when browsers stopped accepting <b>text <i>like</b> this</i>. Some
of the emoji mechanisms involving combinations of ZWSP, variation
selectors, Fitzpatrick swatches, and toupees might boggle some
implementers? minds, but to play the game, you?ve got to learn the
rules.
 
David Starner wrote:
 
> ECMA-48 is not plain text.
 
Exactly so, but it?s a VERY thin layer above plain text, which is part
of what I like about it.
 
--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
 

From doug at ewellic.org  Wed Dec 23 17:59:59 2020
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 23 Dec 2020 16:59:59 -0700
Subject: Is there a difference between converting a string of ASCII digits
 to an
 integer versus a string of non-ASCII digits to an =?UTF-8?Q?integer=3F?=
Message-ID: <20201223165959.665a7a7059d7ee80bb4d670165c8327d.c68e32ad5b.wbe@email15.godaddy.com>

Richard Wordingham wrote:
 
>> I suggest you double-check about the RTL digits (N'Ko & Adlam);
>> please take a look at the relevant Unicode book chapters.
>
> It looks as though the N'ko section documents the significance by
> accident!  I thought a policy was going to be documented (2012 or
> slightly later) that decimal digits are stored most significant
> digit first, but that doesn't seem to have happened.
 
It happened for N?Ko anyway:
 
?N?Ko uses decimal digits specific to the script. These digits have
strong right-to-left directionality. Numbers are stored in text in
logical order with most significant digit first; when displayed,
numerals are then laid out in right-to-left order, with the most
significant digit at the rightmost side, as illustrated for the numeral
144 in Figure 19-3. This situation differs from how numerals are handled
in Hebrew and Arabic, where numerals are laid out in left-to-right
order, even though the overall text direction is right to left.?
 
--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
 

From doug at ewellic.org  Wed Dec 23 18:16:44 2020
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 23 Dec 2020 17:16:44 -0700
Subject: Is there a difference between converting a string of ASCII digits
 to an
 integer versus a string of non-ASCII digits to an =?UTF-8?Q?integer=3F?=
Message-ID: <20201223171644.665a7a7059d7ee80bb4d670165c8327d.ee5c25e810.wbe@email15.godaddy.com>

>> I thought a policy was going to be documented (2012 or
>> slightly later) that decimal digits are stored most significant
>> digit first, but that doesn't seem to have happened.
>
> It happened for N?Ko anyway:

Ohh, you mean a formal policy of the kind found on
https://www.unicode.org/policies/policies.html .
 
No, there doesn?t appear to be such a policy, although there also
don?t appear to be any sets of decimal digits that deviate from it.

--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
 

From doug at ewellic.org  Wed Dec 23 18:42:10 2020
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 23 Dec 2020 17:42:10 -0700
Subject: =?UTF-8?Q?=31=CB=A2=E1=B5=97=2C=20=32=E2=81=BF=E1=B5=88=2C=20=33?=
 =?UTF-8?Q?=CA=B3=E1=B5=88=2C=20=34=E1=B5=97=CA=B0=20=E2=80=A6=20=39?=
 =?UTF-8?Q?=E1=B5=97=CA=B0?=
Message-ID: <20201223174210.665a7a7059d7ee80bb4d670165c8327d.b4f303433f.wbe@email15.godaddy.com>

Fredrick Brennan wrote:
 
> With Unicode superscript lowercase letters, dates with superscript
> ordinal indicators in English can be written in plaintext, e.g.:
>
> 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so
> on.
>
> [...]
>
> However, I have a feeling that this use is an abuse of the standard,
> but that brings up an interesting comparison with the ordinal
> indicators for Spanish, Portuguese (& other languages?), the masculine
> ? and the feminine ?.
>
> If anyone has time to answer, why is one an abuse and the other not,
> if indeed 1?? is an abuse as I think?
 
I suppose it is, and the best answer to ?why? is definitional:
because ? and ? were encoded (in legacy standards, and consequently
brought into Unicode) for the purpose of being ordinal indicators,
whereas ? and ? and ? and ? and ? and ? were encoded for the
purpose of being phonetic modifiers.  (Even ?, encoded alongside the
superscript digits, ?functions as a modifier letter? according to
the note in the code chart.)
 
I know that 1st and 2nd and 3rd and 4th (no superscripts) are generally
considered legible in English (back to the ?plain text is for
legibility? definition). I don?t know if 1o and 2a are considered
equally legible in Spanish and Portuguese; if they are not, that might
help explain why dedicated characters for ? and ? were prioritized in
earlier character sets.
 
There are two types of people: those who are bothered by ?Unicode
abuse? and those who are not.
 
--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
 

From kent.b.karlsson at bahnhof.se  Wed Dec 23 19:20:42 2020
From: kent.b.karlsson at bahnhof.se (Kent Karlsson)
Date: Thu, 24 Dec 2020 02:20:42 +0100
Subject: =?utf-8?B?UmU6IDHLouG1lywgMuKBv+G1iCwgM8qz4bWILCA04bWXyrAg4oCm?=
 =?utf-8?B?IDnhtZfKsA==?=
In-Reply-To: <20201223174210.665a7a7059d7ee80bb4d670165c8327d.b4f303433f.wbe@email15.godaddy.com>
References: <20201223174210.665a7a7059d7ee80bb4d670165c8327d.b4f303433f.wbe@email15.godaddy.com>
Message-ID: <C53B36AB-18A2-4C25-97A7-5653C216F25B@bahnhof.se>


> 24 dec. 2020 kl. 01:42 skrev Doug Ewell via Unicode <unicode at unicode.org>:
> 
> Fredrick Brennan wrote:
> 
>> With Unicode superscript lowercase letters, dates with superscript
>> ordinal indicators in English can be written in plaintext, e.g.:
>> 
>> 1?? of January, 2?? of February, 3?? of March, 4?? of April, and so
>> on.
>> 
>> [...]
>> 
>> However, I have a feeling that this use is an abuse of the standard,
>> but that brings up an interesting comparison with the ordinal
>> indicators for Spanish, Portuguese (& other languages?), the masculine
>> ? and the feminine ?.
>> 
>> If anyone has time to answer, why is one an abuse and the other not,
>> if indeed 1?? is an abuse as I think?
> 
> I suppose it is, and the best answer to ?why? is definitional:
> because ? and ? were encoded (in legacy standards, and consequently
> brought into Unicode) for the purpose of being ordinal indicators,
> whereas ? and ? and ? and ? and ? and ? were encoded for the
> purpose of being phonetic modifiers.  (Even ?, encoded alongside the
> superscript digits, ?functions as a modifier letter? according to
> the note in the code chart.)
> 
> I know that 1st and 2nd and 3rd and 4th (no superscripts) are generally
> considered legible in English (back to the ?plain text is for
> legibility? definition). I don?t know if 1o and 2a are considered
> equally legible in Spanish and Portuguese;

I think they are. At least it is not uncommon to write them without
superscripting them, and I don?t think that causes any confusion.

> if they are not, that might
> help explain why dedicated characters for ? and ? were prioritized in
> earlier character sets.

There may be some stronger preference to superscript them, but
not more than that. (And that France did not insist on ?/? in Latin-1?)

Note that superscript o and superscript a are doubly encoded.
AFAICT, I think the explanation for that is the following:
The ordinal indicators are optionally underlined (varies by font) at the
superscript level, whereas the modifier letters are not underlined.
(And I know of no current styling mechanism, or font feature, to underline
them at the superscript level; underlining would underline them at the
normal letter baseline level.)
 
> There are two types of people: those who are bothered by ?Unicode
> abuse? and those who are not.


Nit: I submitted to CLDR RBNF rules for numeric ordinals in several languages
using the superscript letters several years ago. After a year or two the CLDR
committee replaced the superscript letters by ordinary letters, citing lack of
(consistent) font support for the superscript letters. Even now, looking at this
email, I see superscript letters of inconsistent sizes and positions, and some
superscript letters (even if only looking for a-z) might not be supported in ?all? fonts.

/Kent K

> --
> Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
> 
> 
> 


From richard.wordingham at ntlworld.com  Thu Dec 24 09:50:29 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 24 Dec 2020 15:50:29 +0000
Subject: Is there a difference between converting a string of ASCII
 digits to an integer versus a string of non-ASCII digits to an integer?
In-Reply-To: <20201223165959.665a7a7059d7ee80bb4d670165c8327d.c68e32ad5b.wbe@email15.godaddy.com>
References: <20201223165959.665a7a7059d7ee80bb4d670165c8327d.c68e32ad5b.wbe@email15.godaddy.com>
Message-ID: <20201224155029.39e3f212@JRWUBU2>

On Wed, 23 Dec 2020 16:59:59 -0700
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham wrote:
>  
> >> I suggest you double-check about the RTL digits (N'Ko & Adlam);
> >> please take a look at the relevant Unicode book chapters.  
> >
> > It looks as though the N'ko section documents the significance by
> > accident!  I thought a policy was going to be documented (2012 or
> > slightly later) that decimal digits are stored most significant
> > digit first, but that doesn't seem to have happened.  
>  
> It happened for N?Ko anyway:
>  
> ?N?Ko uses decimal digits specific to the script. These digits have
> strong right-to-left directionality. Numbers are stored in text in
> logical order with most significant digit first; when displayed,
> numerals are then laid out in right-to-left order, with the most
> significant digit at the rightmost side, as illustrated for the
> numeral 144 in Figure 19-3. This situation differs from how numerals
> are handled in Hebrew and Arabic, where numerals are laid out in
> left-to-right order, even though the overall text direction is right
> to left.? 

As you later noted, the third expresses not a policy, but a rule for
N'ko 'decimal digits'.

The last sentence is simply appalling:

1. Hebrew numerals are written with the most significant element on the
right.  For Unicode, what is significant is that as the elements
are letters, they follow the normal presentation rule for sequences of
Hebrew letters.

2. I would expect the components of Arabic letter numerals to follow
the same rules as when the elements are being used as letters.  I can
find examples of both biggest first and smallest first.

3. The 'decimal digits' for Arabic 'five and twenty' are laid out in the
order sounded, i.e. the digit 5 is on the right and the digit 2 is on
the left.  As with N'ko, the most significant digit is stored first.

Richard.


From mark at kli.org  Thu Dec 24 10:41:36 2020
From: mark at kli.org (Mark E. Shoulson)
Date: Thu, 24 Dec 2020 11:41:36 -0500
Subject: Is there a difference between converting a string of ASCII digits
 to an integer versus a string of non-ASCII digits to an integer?
In-Reply-To: <20201224155029.39e3f212@JRWUBU2>
References: <20201223165959.665a7a7059d7ee80bb4d670165c8327d.c68e32ad5b.wbe@email15.godaddy.com>
 <20201224155029.39e3f212@JRWUBU2>
Message-ID: <3552ad0b-60d1-11a9-0293-12dd07e50eab@kli.org>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201224/05cc15a1/attachment.htm>

From indolering at gmail.com  Tue Dec 29 12:37:41 2020
From: indolering at gmail.com (Zach Lym)
Date: Tue, 29 Dec 2020 10:37:41 -0800
Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?=
 =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?=
In-Reply-To: <CAFYQx+BY9_NhyLrsi6k1x+W0tu_fe-BRkXF68A4DKjxcVDoCzQ@mail.gmail.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org>
 <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org>
 <CABWuLVc7yzeYDgjgGbQQF5apNChxsSVygG8Q6f-zijzeevNm-g@mail.gmail.com>
 <CAFYQx+BY9_NhyLrsi6k1x+W0tu_fe-BRkXF68A4DKjxcVDoCzQ@mail.gmail.com>
Message-ID: <CABWuLVcx6cxcx8GaLGkDcBMTCXLbdfZP3AY_GzNMhQr6cMX25Q@mail.gmail.com>

Trying to reboot this conversation, what *demand* is there for supporting
non-latin digits?

AFAICT [1], most literate adults use latin digits (0-9) for basic math in
North America, South America, Europe, Australia, and countries using a CJK
script.  Online stores and license plates for the UAE and India lists
prices using latin digits.  By process of elimination [2], language groups
that don't use latin numerals drops to <100 million.

If those numbers are accurate, then there isn't enough of a critical mass
to justify the implementation effort.  Not if you aren't also going to
translate keywords....

Thank you,
-Zach Lym

[1]:
https://linguistics.stackexchange.com/questions/37899/how-prevalent-are-western-hindu-arabic-numerals-digits-0-9-in-cultures-with-si?noredirect=1#comment87069_37899
[2]:
https://en.wikipedia.org/wiki/List_of_writing_systems#List_of_writing_scripts_by_adoption

On Wed, Dec 23, 2020 at 7:30 AM Steven R. Loomis <srl295 at gmail.com> wrote:

> For much more on localized numbers, see CLDR,
> https://www.unicode.org/reports/tr35/tr35-numbers.html#Contents
>
> -s
>
> On Sun, Dec 20, 2020 at 1:57 PM Zach Lym via Unicode <unicode at unicode.org>
> wrote:
>
>> I don't think it's fair to dismiss this as "not a unicode problem."  As
>> the OP pointed out, support for non-latin variable names is largely due to
>> Unicode's identity standard and extensive implementation advice.
>>
>> The section on numbering (5.5) is only a page long and
>> essentially recommends handling decimal based numbering systems.  There
>> isn't nearly as much care given to this topic.  There is a standard annex
>> on mathematics, but that is in PDF form and is largely concerned with
>> parsing and display of mathematical formulas.
>>
>> However, as is the answer to most questions, it is a matter of time and
>> money. If someone is willing to spend the time expanding 5.5 writing a new
>> annex, I am sure the Unicode committee would be happy to review it.  Would
>> you be interested in doing that legwork?
>>
>> I'm actually pretty new here, what's the best way Roger could contribute
>> to make Unicode better in this regard?
>>
>> Thanks,
>> -Zach Lym
>>
>> On Wed, Dec 16, 2020 at 5:23 PM Mark E. Shoulson via Unicode <
>> unicode at unicode.org> wrote:
>>
>>> On 12/16/20 10:40 AM, Doug Ewell via Unicode wrote:
>>>
>>> What I don't understand here is why this is being framed implicitly as a Unicode problem, or an XML problem, or a general law of nature ("why can?t a Bengali-speaking person use the Bengali digits"), instead of an inherent limitation of that particular library function used for that particular language.
>>>
>>> Yes, exactly.  This is "a characteristic of the code libraries, not a
>>> Unicode problem."
>>>
>>>
>>> There are probably reasonable reasons not to update the actual
>>> atol/strtol calls, but one could certainly write a library to do what
>>> you're talking about... and apparently someone has, by Bill Poser's report
>>> of his libuninum.  There ya go.
>>>
>>>
>>> ~mark
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201229/df06be52/attachment.htm>

From markus.icu at gmail.com  Tue Dec 29 13:28:23 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 29 Dec 2020 11:28:23 -0800
Subject: =?UTF-8?Q?Re=3A_Unicode_is_universal=2C_so_how_come_that_universal?=
 =?UTF-8?Q?ity_doesn=E2=80=99t_apply_to_digits=3F?=
In-Reply-To: <CABWuLVcx6cxcx8GaLGkDcBMTCXLbdfZP3AY_GzNMhQr6cMX25Q@mail.gmail.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org>
 <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org>
 <CABWuLVc7yzeYDgjgGbQQF5apNChxsSVygG8Q6f-zijzeevNm-g@mail.gmail.com>
 <CAFYQx+BY9_NhyLrsi6k1x+W0tu_fe-BRkXF68A4DKjxcVDoCzQ@mail.gmail.com>
 <CABWuLVcx6cxcx8GaLGkDcBMTCXLbdfZP3AY_GzNMhQr6cMX25Q@mail.gmail.com>
Message-ID: <CAN49p6pewXKWnKS2ZhEA_Gm7ydV6H=144X9t2xuJPP_4-3GO=w@mail.gmail.com>

On Tue, Dec 29, 2020 at 10:41 AM Zach Lym via Unicode <unicode at unicode.org>
wrote:

> Trying to reboot this conversation, what *demand* is there for supporting
> non-latin digits?
>

There are hundreds of millions of people in the Arabic-speaking world, in &
near Iran, in parts of India, ... that routinely use and prefer their
native digits.

If those numbers are accurate, then there isn't enough of a critical mass
> to justify the implementation effort.
>

What effort? Given basic Unicode support in many programming languages and
libraries, it takes minutes to go from parsing ASCII digits to parsing any
& all decimal digits.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201229/3cd5020e/attachment-0001.htm>

From jameskass at code2001.com  Tue Dec 29 13:30:00 2020
From: jameskass at code2001.com (James Kass)
Date: Tue, 29 Dec 2020 19:30:00 +0000
Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?=
 =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?=
In-Reply-To: <CABWuLVcx6cxcx8GaLGkDcBMTCXLbdfZP3AY_GzNMhQr6cMX25Q@mail.gmail.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org>
 <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org>
 <CABWuLVc7yzeYDgjgGbQQF5apNChxsSVygG8Q6f-zijzeevNm-g@mail.gmail.com>
 <CAFYQx+BY9_NhyLrsi6k1x+W0tu_fe-BRkXF68A4DKjxcVDoCzQ@mail.gmail.com>
 <CABWuLVcx6cxcx8GaLGkDcBMTCXLbdfZP3AY_GzNMhQr6cMX25Q@mail.gmail.com>
Message-ID: <ca07b8e3-9660-5c06-5c4b-57cc29b996fe@code2001.com>


On 2020-12-29 6:37 PM, Zach Lym via Unicode wrote:
> If those numbers are accurate, then there isn't enough of a critical mass
> to justify the implementation effort.  Not if you aren't also going to
> translate keywords....

Figure 4 from N5076 shows an Adlam calculator app.

https://unicode.org/wg2/docs/n5076-19119r-adlam-font-repl.pdf

Third party developers exist and members of less used writing systems 
learn to code.? Even in the absence of critical mass, gaps get filled 
and needs get met.

Unicode?s r?le is to provide a standard means of exchanging and storing 
non-Western digits as well as assigning properites and so forth to 
them.? Libraries such as CLDR exist to help implementers, and members of 
the actual user communities seem happy to help with keyword translation.

From richard.wordingham at ntlworld.com  Tue Dec 29 13:58:05 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 29 Dec 2020 19:58:05 +0000
Subject: Unicode is universal, so how come that universality
 =?UTF-8?B?ZG9lc27igJl0?= apply to digits?
In-Reply-To: <CAN49p6pewXKWnKS2ZhEA_Gm7ydV6H=144X9t2xuJPP_4-3GO=w@mail.gmail.com>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org>
 <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org>
 <CABWuLVc7yzeYDgjgGbQQF5apNChxsSVygG8Q6f-zijzeevNm-g@mail.gmail.com>
 <CAFYQx+BY9_NhyLrsi6k1x+W0tu_fe-BRkXF68A4DKjxcVDoCzQ@mail.gmail.com>
 <CABWuLVcx6cxcx8GaLGkDcBMTCXLbdfZP3AY_GzNMhQr6cMX25Q@mail.gmail.com>
 <CAN49p6pewXKWnKS2ZhEA_Gm7ydV6H=144X9t2xuJPP_4-3GO=w@mail.gmail.com>
Message-ID: <20201229195805.4c45425c@JRWUBU2>

On Tue, 29 Dec 2020 11:28:23 -0800
Markus Scherer via Unicode <unicode at unicode.org> wrote:

> What effort? Given basic Unicode support in many programming
> languages and libraries, it takes minutes to go from parsing ASCII
> digits to parsing any & all decimal digits.

I think you've overlooked the paperwork.

There's probably code that relies on non-ASCII digits not being treated
the same way as ASCII digits.

Richard.

From jameskass at code2001.com  Tue Dec 29 14:01:25 2020
From: jameskass at code2001.com (James Kass)
Date: Tue, 29 Dec 2020 20:01:25 +0000
Subject: Adlam
In-Reply-To: <176791270b7.10fb4e49546040.1392071428347243955@kittens.ph>
References: <176791270b7.10fb4e49546040.1392071428347243955@kittens.ph>
Message-ID: <3a4f6836-f18f-d495-f410-25750a608712@code2001.com>


On 2020-12-19 3:37 AM, Fredrick Brennan via Unicode wrote:
> Often when other scripts are discussed, Adlam is used comparatively, as in, "well we want to avoid what happened with Adlam", or "we have had painful experience with this with Adlam".I know some of the issues in Adlam, but if anyone has the time, I (and hopefully others!) would benefit from a retelling of the "Adlam in Unicode" story. I know in the end it's a very happy story, but I'm especially curious about the bumps along the road.?Best,Fred Brennan

(I would also welcome some insider insight on this history.)

Perusing this document might give a rough sketch:

https://unicode.org/wg2/docs/n5076-19119r-adlam-font-repl.pdf

Briefly, Adlam is a dynamic and developing writing system.? After 
publication, shapes of many of the letter forms were changed.? Which 
means that revisions were needed not only in on-line charts, but also 
the fonts used to produce those charts, and software bundles containing 
those fonts.? As any programmer knows, revision can be costly.

This should be regarded as ?par for the course? when striving to support 
a developing writing system.


From asmusf at ix.netcom.com  Tue Dec 29 15:08:33 2020
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 29 Dec 2020 13:08:33 -0800
Subject: =?UTF-8?Q?Re=3a_Unicode_is_universal=2c_so_how_come_that_universali?=
 =?UTF-8?Q?ty_doesn=e2=80=99t_apply_to_digits=3f?=
In-Reply-To: <20201229195805.4c45425c@JRWUBU2>
References: <SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com>
 <000201d6d3c1$c13ede40$43bc9ac0$@ewellic.org>
 <94e3f83b-ef13-b442-c5ff-c827211194fa@kli.org>
 <CABWuLVc7yzeYDgjgGbQQF5apNChxsSVygG8Q6f-zijzeevNm-g@mail.gmail.com>
 <CAFYQx+BY9_NhyLrsi6k1x+W0tu_fe-BRkXF68A4DKjxcVDoCzQ@mail.gmail.com>
 <CABWuLVcx6cxcx8GaLGkDcBMTCXLbdfZP3AY_GzNMhQr6cMX25Q@mail.gmail.com>
 <CAN49p6pewXKWnKS2ZhEA_Gm7ydV6H=144X9t2xuJPP_4-3GO=w@mail.gmail.com>
 <20201229195805.4c45425c@JRWUBU2>
Message-ID: <47370a52-bdc7-fab2-7458-8334f8fd8bee@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201229/aa22bf51/attachment.htm>

From marcelpauluk at ufpr.br  Wed Dec 30 14:37:29 2020
From: marcelpauluk at ufpr.br (Prof. Pauluk)
Date: Wed, 30 Dec 2020 17:37:29 -0300
Subject: =?UTF-8?Q?Origins_of_=E2=8C=9A_U=2B231A_WATCH_and_=E2=8C=9B_U=2B231B_HOURGLASS?=
Message-ID: <CANzej34uBKVCLSh0DrUNLS_Vb3nu+59YoT-GjoRjSH-oz0s=aw@mail.gmail.com>

Ol? a todos,

I am trying to do some kind of "provenance history" of a series of
proleptic emoji, and I am stuck with these two here: ? U+231A WATCH and ?
U+231B HOURGLASS.

While, for example, ? U+232B ERASE TO THE LEFT and ? U+2328 KEYBOARD could
be easily seen as motivated by the symbols 2023
<https://www.iso.org/obp/ui#iso:grs:7000:2023> BACKWARD ERASE and 5991
<https://www.iso.org/obp/ui#iec:grs:60417:5991> KEYBOARD from ISO7000/
IEC60417 Graphical Symbols for Use on Equipment, it is difficult to
see ? U+231A WATCH and ? U+231B HOURGLASS being originated from, let's say,
5184 <https://www.iso.org/obp/ui#iec:grs:60417:5184> CLOCK and 1366
<https://www.iso.org/obp/ui#iso:grs:7000:1366> ELAPSED OPERATING HOURS from
the same standard. I am well acquainted with ISO/TC145 Graphical Symbols
collection of signs, and I am almost sure that those two symbols didn't
come from any technical standard from ISO or IEC.

Does anyone remember why these two Miscellaneous Technical Symbols were
added, back then in the 1990s? Could it be because of Xerox Star/ Apple
Lisa's HOURGLASS and Susan Kare's WRISTWATCH icon for the 1984 Macintosh?

Regards,
Marcel Pauluk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201230/fa2d315a/attachment.htm>

From abrahamgross at disroot.org  Wed Dec 30 18:18:51 2020
From: abrahamgross at disroot.org (abrahamgross at disroot.org)
Date: Thu, 31 Dec 2020 00:18:51 +0000 (UTC)
Subject: =?UTF-8?Q?Origins_of_=E2=8C=9A_U+231A_WATC?=
 =?UTF-8?Q?H_and_=E2=8C=9B_U+231B_HOURGLASS?=
In-Reply-To: <CANzej34uBKVCLSh0DrUNLS_Vb3nu+59YoT-GjoRjSH-oz0s=aw@mail.gmail.com>
References: <CANzej34uBKVCLSh0DrUNLS_Vb3nu+59YoT-GjoRjSH-oz0s=aw@mail.gmail.com>
Message-ID: <ab8bb157-0303-451a-b76f-c6e79a8005c0@disroot.org>

Id assume these emoji are from the original japanese set
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201231/9a43f089/attachment.htm>

From kenwhistler at sonic.net  Wed Dec 30 19:50:23 2020
From: kenwhistler at sonic.net (Ken Whistler)
Date: Wed, 30 Dec 2020 17:50:23 -0800
Subject: =?UTF-8?Q?Re=3a_Origins_of_=e2=8c=9a_U+231A_WATCH_and_=e2=8c=9b_U+2?=
 =?UTF-8?Q?31B_HOURGLASS?=
In-Reply-To: <ab8bb157-0303-451a-b76f-c6e79a8005c0@disroot.org>
References: <CANzej34uBKVCLSh0DrUNLS_Vb3nu+59YoT-GjoRjSH-oz0s=aw@mail.gmail.com>
 <ab8bb157-0303-451a-b76f-c6e79a8005c0@disroot.org>
Message-ID: <e5eb715a-5800-f0b9-ab82-e4d58830315e@sonic.net>

Nope. Check their Age (see DerivedAge.txt in the UCD). Their Age is 1.1. 
And in fact, they go back even further -- they were published in Unicode 
1.0 in 1991. They predate the Japanese telcom vendor sets that were 
incorporated in Unicode 6.0 in 2010.

They were later mapped to KDDI and DoCoMo emoji in 2007 (see L2/07-257), 
so WATCH and HOURGLASS did exist in those sets, but that wasn't their 
original source for encoding in Unicode.

I don't think they were in XCCS (the Xerox character set) or in IBM 
sets. They might have been picked up as well-known computer interface 
symbols from the 80's.

--Ken

On 12/30/2020 4:18 PM, abrahamgross--- via Unicode wrote:
> Id assume these emoji are from the original japanese set
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201230/90aa3f30/attachment.htm>

From marcelpauluk at ufpr.br  Wed Dec 30 20:45:29 2020
From: marcelpauluk at ufpr.br (M. Pauluk)
Date: Wed, 30 Dec 2020 23:45:29 -0300
Subject: =?UTF-8?Q?Re=3A_Origins_of_=E2=8C=9A_U=2B231A_WATCH_and_=E2=8C=9B_U=2B231B_HOURG?=
 =?UTF-8?Q?LASS?=
In-Reply-To: <e5eb715a-5800-f0b9-ab82-e4d58830315e@sonic.net>
References: <CANzej34uBKVCLSh0DrUNLS_Vb3nu+59YoT-GjoRjSH-oz0s=aw@mail.gmail.com>
 <ab8bb157-0303-451a-b76f-c6e79a8005c0@disroot.org>
 <e5eb715a-5800-f0b9-ab82-e4d58830315e@sonic.net>
Message-ID: <CANzej37UaR+T9j3rh0sAk27mGVp9tax4GV6ABrZU8iKf-pNVow@mail.gmail.com>

Thanks Ken! I had already checked XCCS and IBM code pages too, ? U+231A
WATCH and ? U+231B HOURGLASS really couldn't have originated there. Is
there any documentation of this selection process? I would also very much
like to know why some symbols like U+262E PEACE SYMBOL or U+2668 HOT
SPRINGS were added right from the beginning, before there were even any
kind of pressure to encode pictographs! Those initial blocks of symbols
remain the most obscure for me...

On Wed, Dec 30, 2020 at 10:56 PM Ken Whistler via Unicode <
unicode at unicode.org> wrote:

> Nope. Check their Age (see DerivedAge.txt in the UCD). Their Age is 1.1.
> And in fact, they go back even further -- they were published in Unicode
> 1.0 in 1991. They predate the Japanese telcom vendor sets that were
> incorporated in Unicode 6.0 in 2010.
>
> They were later mapped to KDDI and DoCoMo emoji in 2007 (see L2/07-257),
> so WATCH and HOURGLASS did exist in those sets, but that wasn't their
> original source for encoding in Unicode.
>
> I don't think they were in XCCS (the Xerox character set) or in IBM sets.
> They might have been picked up as well-known computer interface symbols
> from the 80's.
>
> --Ken
> On 12/30/2020 4:18 PM, abrahamgross--- via Unicode wrote:
>
> Id assume these emoji are from the original japanese set
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201230/cebe98d0/attachment.htm>