From unicode at unicode.org  Mon Jan  7 02:13:37 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Mon, 7 Jan 2019 09:13:37 +0100
Subject: A last missing link for interoperable representation
Message-ID: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>

Previous discussions have already brought up how Unicode is supporting
those languages that despite being old in Unicode still require special
attention for their peculiar way of spacing punctuation or indicating
abbreviations. Now I wonder whether s?t?r?e?s?s? can likewise be noted
in plain text without non-traditional markup such as *?* or ?'? when a
language does not accept extra acute accents for that purpose.

One character we can think of is the combining underline.
Like everything else?new letters, narrow no-break space, superscripts?
the quality of the rendering depends on the fonts used on the computer.

Strings containing U+0332 COMBINING LOW LINE to denote stress, as a
replacement of italic, may be postprocessed to apply formatting, or
used as-is if interoperability matters along with semantic accuracy.

Best wishes,

Marcel

From unicode at unicode.org  Mon Jan  7 21:46:57 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 8 Jan 2019 03:46:57 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
Message-ID: <3b885166-58c3-f970-829e-e6d521259670@gmail.com>


Living languages and writing systems evolve.

Using the combining low line to show stress seems reasonable to me, 
perhaps because it was a typewriting convention I'm old enough to 
remember.? People unfamiliar with that convention should be able to 
figure out what's up from the c?o?n?t?e?x?t?.? Drawing a line under a 
word or a phrase certainly draws attention to it!

(Apparently there's a recently evolved practice to use periods between 
words. To. Add. Emphasis.? Almost as if one is speaking v-e-r-y 
s-l-o-w-l-y in order to make a point.)

End users probably consider the entire Unicode set to be their tool 
kit.? I've seen plain text screen names in both cursive and fraktur, 
thanks to the math alphanumerics.? The carefree user community seems 
unconcerned with the technical insistence that *those* characters should 
only be used in formulae.

If, for example, ?????????????????? ?????????????????? can input her 
screen name in cursive, there's nothing stopping me from using 
??????????????, if I'm so inclined.

Making recommendations for the post processing of strings containing the 
combining low line strikes me as being outside the scope of Unicode, 
though.? Some users might prefer that such strings be rendered in *bold* 
and other users might prefer /italics/.? This user would prefer that 
combining low line always be rendered as combining low line.


From unicode at unicode.org  Mon Jan  7 23:32:41 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 7 Jan 2019 21:32:41 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
Message-ID: <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190107/d9fd1c66/attachment.html>

From unicode at unicode.org  Tue Jan  8 00:40:51 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 8 Jan 2019 07:40:51 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
Message-ID: <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>

On 08/01/2019 06:32, Asmus Freytag via Unicode wrote:
> On 1/7/2019 7:46 PM, James Kass via Unicode wrote:
>> Making recommendations for the post processing of strings containing the combining low line strikes me as being outside the scope of Unicode, though.
>
> Agreed.
>
> Those kinds of things are effectively "mark down" languages, a name chosen to define them as lighter weight alternatives to formal, especially SGML derived mark-up languages.
>
> Neither mark-up nor mark down languages are in scope.
>
My hinting about post processing was only a door open to those tagging my suggestion as a dirty hack. I was so anxious about angry feedback that I inverted the order of the two possible usages despite my preference for keeping the combining underline while using proper fonts, fully agreeing with James Kass. I was pointing that unlike rich text, enhanced capabilities of plain text do not hold the user captive. With rich text we need to stay in rich text, whereas the goal of this thread is to point ways of ensuring interoperability.

The pitch is that if some languages are still considered ?needing? rich text where others are correctly represented in plain text (stress, abbreviations), the Standard needs to be updated in a way that it fully supports actually all languages.

Having said that, still unsupported minority languages are top priority.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190108/04563d0d/attachment.html>

From unicode at unicode.org  Tue Jan  8 01:18:10 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 7 Jan 2019 23:18:10 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
Message-ID: <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190107/dd8475ab/attachment.html>

From unicode at unicode.org  Tue Jan  8 04:00:38 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 8 Jan 2019 10:00:38 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
Message-ID: <cb8058ec-7b2f-6012-f551-6107a8938619@gmail.com>


Marcel Schneider wrote,

 > With rich text we need to stay in rich text, whereas the goal of
 > this thread is to point ways of ensuring interoperability.

Both interoperability and legibility are factors.? The question might 
be:? How legible should Unicode be for Latin?barely legible, moderately 
legible, or extremely legible?

The boundaries of plain text have advanced since the concept originated 
and will probably continue to do so.? Stress can currently be 
represented in plain text with conventions used in lieu of existing 
typographic practice.? Unicode can preserve texts created using the 
plain text kludges/conventions for marking stress, but cannot preserve 
printed texts which use standard publishing conventions for marking 
stress, such as italics.

If Latin were a dead script being proposed for encoding now, it?s 
possible that certain script features currently considered to be merely 
stylistic variants best reserved for mark-up would be encoded atomically.

Scripts added more recently to Unicode appear to have been encoded with 
the idea of preserving the standard writing and publishing conventions 
of the users.? It's only natural if some Latin script users want to push 
back the boundaries of Latin computer plain text accordingly.


From unicode at unicode.org  Tue Jan  8 15:11:07 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 8 Jan 2019 21:11:07 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
Message-ID: <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>


Asmus Freytag wrote,

 > ...
 > (for an extreme example there's an orthography
 > out there that uses @ as a letter -- we know that
 > won't work well with email addresses and duplicate
 > encoding of the @ shape is a complete non-starter).

Everything's a non-starter.? Until it begins.

Is this a casing orthography?? (Please see attached image.)

We've seen where typewriter kludges enabled users to represent the 
glottal stop with a question mark (or a digit seven).? Unicode makes 
those kludges unnecessary.

But we're still using typewriter kludges to represent stress in Latin 
script because there is no Unicode plain text solution.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: NaturalCase.png
Type: image/png
Size: 3325 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20190108/6ab937cd/attachment.png>

From unicode at unicode.org  Tue Jan  8 15:28:46 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Tue, 8 Jan 2019 13:28:46 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
Message-ID: <51e22fd8-8478-d901-39e3-36f43d757eeb@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190108/4211648f/attachment.html>

From unicode at unicode.org  Tue Jan  8 15:43:08 2019
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Tue, 8 Jan 2019 13:43:08 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
Message-ID: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>

James,

On 1/8/2019 1:11 PM, James Kass via Unicode wrote:
> But we're still using typewriter kludges to represent stress in Latin 
> script because there is no Unicode plain text solution.

O.k., that one needs a response.

We are still using kludges to represent stress in the Latin script 
because *orthographies* for most languages customarily written with the 
Latin script don't have clear conventions for indicating stress as a 
part of the orthography.

When an orthography has a well-developed convention for indicating 
stress, then we can look at how that convention is represented in the 
plain text representation of that orthography. An obvious case is 
notational systems for the representation of pronunciation of English 
words in dictionaries. Those conventions *do* then have plain text 
representations in Unicode, because, well, they just have various 
additional characters and/or combining marks to clearly indicate lexical 
stress. But standard written English orthography does *not*. (BTW, that 
is in part because marking stress in written English would usually 
*decrease* legibility and the usefulness of the writing, rather than 
improving it.)

Furthermore, there is nothing inherent about *stress* per se in the 
Latin script (or any other script, for that matter). Lexical stress is a 
phonological system, not shared or structured the same way in all 
languages. And there are *thousands* of languages written with the Latin 
script -- with all kinds of phonological systems associated with them. 
Some have lexical tones, some do not. Some have other kinds of 
phonological accentuation systems that don't count as lexical stress, 
per se.

And there are differences between lexical stress (and its indication), 
and other kinds of "stress". Contrastive stress, which is way more 
interesting to consider as a part of writing, IMO, than lexical stress, 
is a *prosodic* phenomenon, not a lexical one. (And I have been using 
the email convention of asterisks here to indicate contrastive stress in 
multiple instances.) And contrastive stress is far from the only kind of 
communicatively significant pitch phenomenon in speech that typically 
isn't formally represented in standard orthographies. There are numerous 
complex scoring systems for linguistic prosody that have been developed 
by linguists interested in those phenomenon -- which include issues of 
pace and rhythm, and not merely pitch contours and loudness.

It isn't the job of the Unicode Consortium or the Unicode Standard to 
sort that stuff out or to standardize characters to represent it. When 
somebody brings to the UTC written examples of established orthographies 
using character conventions that cannot be clearly conveyed in plain 
text with the Unicode characters we already have, *then* perhaps we will 
have something to talk about.

--Ken


From unicode at unicode.org  Tue Jan  8 23:33:21 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Tue, 8 Jan 2019 21:33:21 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <cb8058ec-7b2f-6012-f551-6107a8938619@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <cb8058ec-7b2f-6012-f551-6107a8938619@gmail.com>
Message-ID: <CAMZ=zj4kW3cPS6o-8y2_dRbYqMXb2ii3ks=--g1LK1FMFv9=xw@mail.gmail.com>

On Tue, Jan 8, 2019 at 2:03 AM James Kass via Unicode <unicode at unicode.org>
wrote:

> The boundaries of plain text have advanced since the concept originated
> and will probably continue to do so.  Stress can currently be
> represented in plain text with conventions used in lieu of existing
> typographic practice.  Unicode can preserve texts created using the
> plain text kludges/conventions for marking stress, but cannot preserve
> printed texts which use standard publishing conventions for marking
> stress, such as italics.
>

Is there any way to preserve The Art of Computer Programming except as a
PDF or its TeX sources? Grabbing a different book near me, I don't see any
way to preserve them except as full-color paged reproductions. Looking at
one data format, it uses bold, italics, and inversion (white on black), in
sans-serif, serif and script fonts; certainly in lines like
"<b>Treasure</b> standard (+1 <i>starknife</i>)", offering "Treasure
standard (+1 <i>starknife</i>)" is completely insufficient.

Can some books be mostly handled with Unicode plain text and italics? Sure.
HTML can handle them quite nicely. I'd say even them will have headers that
are typographically distinguished and should optimally be marked in a
transcription.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190108/75fb21a1/attachment.html>

From unicode at unicode.org  Wed Jan  9 00:58:51 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 9 Jan 2019 06:58:51 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
Message-ID: <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>


Ken Whistler wrote,

 > It isn't the job of the Unicode Consortium or the Unicode Standard
 > to sort that stuff out or to standardize characters to represent it.

Agreed, it isn?t.

 > When somebody brings to the UTC written examples of established
 > orthographies using character conventions that cannot be clearly
 > conveyed in plain text with the Unicode characters we already have,
 > *then* perhaps we will have something to talk about.

If a text is published in all italics, that?s style/font choice.? If a 
text is published using italics and roman contrastively and 
consistently, and everybody else is doing it pretty much the same way, 
that?s a convention.

Typewriting is mechanical writing.? Computer keyboards, input methods, 
and Unicode are technological advances in mechanical writing.? 
Typesetting for publishing is mechanical writing for the purpose of mass 
production and distribution of texts.

 From a printed Webster?s,
lexicon (lek? si k?n) [ < Gr. ??????????, word. ]? 1.? a dictionary? 2.? 
a special vocabulary

There?s a convention in English writing to express foreign words using 
italics.? Not just in published dictionaries, but also in running text 
where foreign words and phrases are deployed.

Other italics conventions include ship names such as the SS ????????? 
????????, or titles such as ???? ?????????? ???????? ??????, which is 
properly spelled with a ??? in ??a?.? (Math kludge fail.)? Of course, 
since that song title is in a foreign language, it should be italicized 
anyway.

Quoting from,
http://navalmarinearchive.com/research/ship_names.html
?Names of specific ships and other vessels are both capitalized and 
italicized (or capitalized entirely - "all caps" - in text documents 
denying italics such as email, use of a mechanical typewriter.)?

There were technological constraints denying italics in mechanical 
typewriters.? There?s a technical consortium denying italics in Latin 
computer plain text, for better or worse.? (Trying to state the obvious 
here without being judgmental.)

The use of italics in English writing to mark stress is another existing 
convention.? Italics don?t interfere with legibility in English fiction 
when used to indicate stress in dialogue between the characters.? 
Rather, the italics add information enabling the reader to approximate 
how the author intended the dialogue to be *spoken*. And ??????? 
information cannot be preserved in Unicode plain text without the math 
kludge or using asterisks and slashes as ???? ?????????? mark-up.

????????????? is important? vs. ?Stress ???? important?.

I look forward to the continuing evolution of plain text and would 
welcome the ability to use italics in plain text without kludges. <i>But 
I?m not holding my breath.</i>

Anybody making a formal proposal for italics encoding can be assured 
that the proposal would be received with something less than 
enthusiasm.? But stranger things have happened.

Many of us here are old enough to remember when something like <PICTURE 
OF A COW> was a non-starter because in-line pictures were out of scope 
for a computer plain text standard.? But now I could plop a picture of a 
cow (or worse) right into this plain text e-mail, if I were so 
inclined.? That?s progress for you.

It?s too bad they called it ????? ????????????? ???????????? ???? 
?????????? instead of ?The Chicago Manual of Correct American English 
Orthographic Conventions for Text Publishing?, eh?? Maybe ?Style? 
sounded more classy.? But it *does* tend to make it simpler for people 
to dismiss such distinctions as being merely stylistic.

But if the distinction is merely stylistic, we wouldn?t have needed to 
develop typewriter or computer plain text kludges for them in order to 
express ourselves properly.

(Apologies for length and Happy New Year!)


From unicode at unicode.org  Wed Jan  9 01:30:26 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Tue, 8 Jan 2019 23:30:26 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
Message-ID: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190108/9eeb046c/attachment.html>

From unicode at unicode.org  Wed Jan  9 01:56:23 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 9 Jan 2019 07:56:23 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <CAMZ=zj4kW3cPS6o-8y2_dRbYqMXb2ii3ks=--g1LK1FMFv9=xw@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <cb8058ec-7b2f-6012-f551-6107a8938619@gmail.com>
 <CAMZ=zj4kW3cPS6o-8y2_dRbYqMXb2ii3ks=--g1LK1FMFv9=xw@mail.gmail.com>
Message-ID: <43ac21d8-2e2e-2223-2698-09c7663481e8@gmail.com>


David Starner wrote,

 > Can some books be mostly handled with Unicode plain text
 > and italics? Sure. HTML can handle them quite nicely. ...

Yes, many books can be handled very well with HTML using simple 
mark-up.? If I were producing a computer file to reproduce an old 
fiction novel, that's how I'd do it.? Not because it's better or simpler 
than plain text, but because it can't really be done in plain text at 
this time.? But if a section of the text is copy/pasted from the screen 
into an editor, some of the original information may be lost.

As you point out, there's a lot of published material best viewed 
digitally as full color page scans.? As it should be.? That seems 
unlikely to change.


From unicode at unicode.org  Wed Jan  9 03:06:26 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 9 Jan 2019 09:06:26 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
Message-ID: <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>


Asmus Freytag wrote,

 > Still, not supported in plain text (unless you abuse the
 > math alphabets for things they were not intended for).

The unintended usage of math alphanumerics in the real world is fairly 
widespread, at least in screen names.

(I still get a kick out of this:)
http://www.ewellic.org/mathtext.html

I wonder how many times Doug's program has been downloaded.

Whether it's "abuse" or not might depend on whether one considers the 
user community of the machines which process the texts to be more 
important than the user community of human beings who author, exchange, 
and read the texts.

Real humans are the user community of the UCS.? It's up to the user 
community to determine how its letters and symbols get used.? That's the 
general rule-of-thumb Unicode applies to the subset user communities, 
and it should apply to the complete superset as well.


From unicode at unicode.org  Wed Jan  9 03:25:54 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Wed, 9 Jan 2019 01:25:54 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <43ac21d8-2e2e-2223-2698-09c7663481e8@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <cb8058ec-7b2f-6012-f551-6107a8938619@gmail.com>
 <CAMZ=zj4kW3cPS6o-8y2_dRbYqMXb2ii3ks=--g1LK1FMFv9=xw@mail.gmail.com>
 <43ac21d8-2e2e-2223-2698-09c7663481e8@gmail.com>
Message-ID: <CAMZ=zj6_60yq_vaR__t5unN4du+HvsPuKy++R+-v3eeBwdiSYQ@mail.gmail.com>

On Tue, Jan 8, 2019 at 11:58 PM James Kass via Unicode <unicode at unicode.org>
wrote:

>
> David Starner wrote,
>
>  > Can some books be mostly handled with Unicode plain text
>  > and italics? Sure. HTML can handle them quite nicely. ...
>
> Yes, many books can be handled very well with HTML using simple
> mark-up.  If I were producing a computer file to reproduce an old
> fiction novel, that's how I'd do it.  Not because it's better or simpler
> than plain text, but because it can't really be done in plain text at
> this time.  But if a section of the text is copy/pasted from the screen
> into an editor, some of the original information may be lost.
>

Looking at the Encyclopedia Brown book at hand, you'd lose any marking that
"The Case of the Headless Ghost" is the chapter header. While the picture
of the treasure chest may be gratuitous, but "he hung his sign outside the
garage:" is followed by an image of said sign that says "BROWN DETECTIVE
AGENCY...". If you copy/paste that without carrying the original image
along, some of the original information will be lost.

In the Gmail editor, I see buttons to make the text bold, italic, or
underlined, and to change the color, text size and font. English users tend
to see italics as part and parcel of the text formatting. One can argue
that's part of history, that italics is somehow different from bold and
underline and font and text size changes, but when the standard perception
conveniently matches how Unicode encodes the script, there doesn't seem
much point in changing things, especially with terabytes of text that
encodes italics separately from the plain text matter.

Frequently, copy/pasting material does preserve non-plain text features; if
I paste a title from Wikipedia into here, it will show up much larger then
the rest of the text. It's a pain, because I want the underlying text, not
how it was displayed in the context.

Honestly, I could argue that case should not be encoded. It would simplify
so much processing of Latin script text, and most of the time
case-sensitive operations are just wrong. Case is clearly a headache that
has to be dealt with in plain text, but it certainly doesn't encourage me
to add another set of characters that are basically the same but not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190109/2e7e1c04/attachment.html>

From unicode at unicode.org  Wed Jan  9 03:37:53 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Wed, 9 Jan 2019 01:37:53 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
Message-ID: <000901d4a7fe$ff8f9a40$feaecec0$@xencraft.com>

 
   James Kass wrote:


If a text is published in all italics, that?s style/font choice.  If a text is published using italics and roman contrastively and consistently, and everybody else is doing it pretty much the same way, that?s a convention. 

 
   Asmus Freytag responded:

But not all conventions are deemed worth of plaintext encoding.

What are the criteria for ?worth??

Way back when, when plain text was very very plain, arguments about not including text styling seemed reasonable. But with the inclusion of numerous emoji as James mentioned, it seems odd to be protesting a few characters that would enhance ?plain text? considerably. Plain text editors today support bold, italic, and other styles as a fundamental requirement for usability. More text editors support styling than support bidi or interlinear annotation.

If there was support for the handful of text features used by most plain text editors (bold, italic, strikethrough, underline, superscript, subscript, et al) (perhaps using more generalized names such as emphasis, stress, deleted?) then many of the redundant (bold, italic, ?)  characters in Unicode would not have been needed. HTML seemed to do very well with a very few styling elements. HTML is of course rich text, but I am just demonstrating that a very small number of control characters would bring plain text into the modern state of text editing. Editors that don?t have the capability for bolding, underlining, etc. could ignore these controls or convert them to another convention.

As James requested, it would also provide interoperability.

Arguments about all of the conventions that Unicode does not support doesn?t seem compelling to me, as it seems increasingly random as what is accepted and what isn?t, or at least the rationales seem inconsistent.

A case in point is the addition of the ?SS? character which made implementation complex with little benefit.

Interlinear annotation is perhaps another example.

I don?t want to enter into a debate about why these deserved inclusion. I am only saying they seem less useful than some other cases which seem deserving. 

**And right now, Dr. Strangelove style, my right hand is restraining my other hand from typing on the keyboard, to avoid saying anything about emoji.**

Ken distinguished numerous variations of stress, which of course have their place, representations and uses. But perhaps for plain text we only need a way to indicate ?stress here?, leave it to the text editor to have some form of rendering. For more distinctions the user needs to use rich text. Surely there is an 80/20 rule that motivates a solution rather than letting the one percent prevent a capability that 99% would enjoy.

(Yes I mixed metaphors. I feel an Occupy Unicode movement coming on. J )

I don?t see how adding a few text style controls would be a burden to most implementers. Given ideographic variation sequences, skin tones, hair styles, and the other requirements for proper Unicode support, arguing against a few text styling capabilities seems very last century. (Or at least 1990s?) And it might save having to add a few more bold, italic, superscript, et al compatibility characters?

tex

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190109/eee63304/attachment.html>

From unicode at unicode.org  Wed Jan  9 05:03:52 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 9 Jan 2019 03:03:52 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
Message-ID: <14543537-d957-64fc-a221-9172c5e22035@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190109/dca3a0d7/attachment.html>

From unicode at unicode.org  Wed Jan  9 05:04:18 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 9 Jan 2019 03:04:18 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <000901d4a7fe$ff8f9a40$feaecec0$@xencraft.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <000901d4a7fe$ff8f9a40$feaecec0$@xencraft.com>
Message-ID: <5d7a2a1b-e508-dc06-f9a0-5b5996f5610e@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190109/c681501f/attachment.html>

From unicode at unicode.org  Wed Jan  9 04:29:36 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Wed, 9 Jan 2019 10:29:36 +0000 (GMT)
Subject: A last missing link for interoperable representation
In-Reply-To: <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
Message-ID: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>

I suggest that a solution to the problem would be to encode a COMBINING 
ITALICIZER character, such that it only applies to the character that it 
immediately follows. So, for example, to make the word apricot become 
displayed in italics one would use seven COMBINING ITALICIZER 
characters, one after each letter of the word apricot. The display could 
be sorted out using an OpenType font by treating each pair of a letter 
and a COMBINING ITALICIZER as a ligature. If, say, the glyph name of 
COMBINING ITALICIZER were italic then the glyph for c italic could be 
c_italic and so plain text might well be copyable from a PDF (Portable 
Document Format) document and pasted to WordPad as plain text retaining 
the COMBINING ITALICIZER character, depending upon which application 
program is used to produce the PDF document and which PDF reader is in 
use.

This would seem a workable solution. Many years ago I suggested having 
characters that would have been comparable in use in plain text as to 
how italics is switched on and off in HTML (Hypertext Markup Language) 
yet was advised that such an encoding would make plain text stateful and 
thus would not be agreed for encoding. That objection might well still 
be the case today. So using a COMBINING ITALICIZER character would avoid 
that objection and would also provide a solution that could be 
straightforwardly implemented using existing OpenType technology.

William Overington
Wednesday 9 January 2019


From unicode at unicode.org  Wed Jan  9 13:58:35 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 9 Jan 2019 19:58:35 +0000
Subject: Where is my character @?
In-Reply-To: <51e22fd8-8478-d901-39e3-36f43d757eeb@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <51e22fd8-8478-d901-39e3-36f43d757eeb@ix.netcom.com>
Message-ID: <f24559bd-ac71-5e21-873c-c5409f95cca7@gmail.com>


There was a post in an unrelated thread remarking that an unnamed 
writing system used the "at" sign (@) as a letter, and that optimal 
encoding for that orthography was a non-starter.

A question as to whether that writing system was casing went unanswered, 
but a kind list member offered some pointers privately.

The language in question is Koalib, which is spoken in the Sudan. It is 
a casing script and the upper case form uses an upper case "A" with a 
wrap around as in the lower case "@".

The current "solution" is for the users to use the P.U.A. for both upper 
and lower case letters, and fonts such as Doulos SIL support that P.U.A. 
encoding.

A Google search for "Koalib Unicode" finds the following:

Wikipedia:
https://en.wikipedia.org/wiki/Koalib_language

2004-08-25 Lorna A. Priest, Public Review Issue # 40
Revised Proposal to Encode...
http://www.unicode.org/review/pr-40-atsigns.pdf

2004-10-20 Doug Ewell, L2/04-365
The case against encoding the Koalib @-letters
http://unicode.org/L2/L2004/04365-pr40-ewell.pdf

2012-04-17 Karl Pentzlin, L2/12-116
"Capitalized Commercial At" proposal
http://unicode.org/L2/L2012/12116-capital-at.pdf

2018-12-26 Eduardo Mar?n Silva, L2/19-006
Proposal to encode...
http://www.unicode.org/L2/L2019/19006-capital-at.pdf

It's probably old-fashioned to say that technology should be forced to 
accomodate people rather than the other way around.? But it's good to 
note that efforts are still being made on behalf of the users to make 
progress towards U.C.S. inclusion.


From unicode at unicode.org  Wed Jan  9 15:33:02 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Wed, 09 Jan 2019 14:33:02 -0700
Subject: A last missing link for interoperable representation
Message-ID: <20190109143302.665a7a7059d7ee80bb4d670165c8327d.791f9e387d.wbe@email03.godaddy.com>

James Kass wrote:
 
> (I still get a kick out of this:)
> http://www.ewellic.org/mathtext.html
>
> I wonder how many times Doug's program has been downloaded.
 
I?ll never know, since I never attached a web counter of any sort to
it.
 
Andrew West?s online ?Unicode Text Styler? includes non-math
characters (like circled and fullwidth) as well, and is probably better,
although it doesn't include the ransom-note option:
 
http://www.babelstone.co.uk/Unicode/text.html
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Wed Jan  9 16:03:25 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Wed, 09 Jan 2019 15:03:25 -0700
Subject: [OT] Digest supports only ASCII (was: Re: A last missing link...)
Message-ID: <20190109150325.665a7a7059d7ee80bb4d670165c8327d.0b0aa3fdf6.wbe@email03.godaddy.com>

As reported in Unicode Digest, Vol 61, Issue 3, James Kass wrote:

> And ???????
> information cannot be preserved in Unicode plain text without the math
> kludge or using asterisks and slashes as ???? ?????????? mark-up.
>
> ????????????? is important? vs. ?Stress ???? important?.

I know this is an old argument and this will probably never be fixed,
but I wish the Unicode email digest could be updated to support, you
know, Unicode.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Wed Jan  9 16:15:08 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Wed, 09 Jan 2019 15:15:08 -0700
Subject: Where is my character =?UTF-8?Q?=40=3F?=
Message-ID: <20190109151508.665a7a7059d7ee80bb4d670165c8327d.452f32d7a9.wbe@email03.godaddy.com>

James Kass wrote:
 
> It's probably old-fashioned to say that technology should be forced to
> accomodate people rather than the other way around.  But it's good to
> note that efforts are still being made on behalf of the users to make
> progress towards U.C.S. inclusion.
 
I'm as opposed to this proposal as I was in 2004, if not more so, and
I'm working on a brief response document for next week's UTC.
 
Among other things, it's not at all clear that the orthography using @,
cited in three works from a single publisher in 1998, has been adopted
or become particularly widespread within the Koalib community. (And no,
this does not constitute "disdain for the small community.")
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Wed Jan  9 18:41:05 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Wed, 9 Jan 2019 19:41:05 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
Message-ID: <b051233b-7b29-8ca2-35dd-7205d03b4f09@kli.org>

On 1/9/19 2:30 AM, Asmus Freytag via Unicode wrote:
>
> English use of italics on isolated words to disambiguate the reading 
> of some sentences is a convention. Everybody who does it, does it the 
> same way. Not supported in plain text.
>
> German books from the Fraktur age used Antiqua for Latin and other 
> foreign terms. Definitely a convention that was rather universally 
> applied (in books at least). Not supported in plain text.
>
Aren't there printing conventions that indicate this type of 
"contrastive stress" using letterspacing instead of font style?? I'm 
s?u?r?e I've seen it in German and other Latin-written languages, and 
also even occasionally in Hebrew, whose experiments with italics tend 
not to be encouraging.

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190109/ecc00a34/attachment.html>

From unicode at unicode.org  Wed Jan  9 18:45:31 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Wed, 9 Jan 2019 19:45:31 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <CAMZ=zj4kW3cPS6o-8y2_dRbYqMXb2ii3ks=--g1LK1FMFv9=xw@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <cb8058ec-7b2f-6012-f551-6107a8938619@gmail.com>
 <CAMZ=zj4kW3cPS6o-8y2_dRbYqMXb2ii3ks=--g1LK1FMFv9=xw@mail.gmail.com>
Message-ID: <cdd02931-c2ee-0f77-ff09-54fa96eeb688@kli.org>

On 1/9/19 12:33 AM, David Starner via Unicode wrote:
>
>
> Is there any way to preserve The Art of Computer Programming except as 
> a PDF or its TeX sources? Grabbing a different book near me, I don't 
> see any way to preserve them except as full-color paged reproductions. 
> Looking at one data format, it uses bold, italics, and inversion 
> (white on black), in sans-serif, serif and script fonts; certainly in 
> lines like "<b>Treasure</b> standard (+1 <i>starknife</i>)", offering 
> "Treasure standard (+1 <i>starknife</i>)" is completely insufficient.
>
> Can some books be mostly handled with Unicode plain text and italics? 
> Sure. HTML can handle them quite nicely. I'd say even them will have 
> headers that are typographically distinguished and should optimally be 
> marked in a transcription.

The line I used to say about this is ?there?s no such thing as plain 
text on paper.?? The concept of ?plain text? vs markup or styling is 
purely in the digital domain.? On physical artifacts, it?s just ink on 
wood-pulp, and the only ?real? description of the page is a graphic image.

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190109/2dca22f9/attachment.html>

From unicode at unicode.org  Wed Jan  9 18:49:38 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Wed, 9 Jan 2019 19:49:38 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <CAMZ=zj6_60yq_vaR__t5unN4du+HvsPuKy++R+-v3eeBwdiSYQ@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <cb8058ec-7b2f-6012-f551-6107a8938619@gmail.com>
 <CAMZ=zj4kW3cPS6o-8y2_dRbYqMXb2ii3ks=--g1LK1FMFv9=xw@mail.gmail.com>
 <43ac21d8-2e2e-2223-2698-09c7663481e8@gmail.com>
 <CAMZ=zj6_60yq_vaR__t5unN4du+HvsPuKy++R+-v3eeBwdiSYQ@mail.gmail.com>
Message-ID: <af5cb008-238b-a321-4684-7b849f578a24@kli.org>

On 1/9/19 4:25 AM, David Starner via Unicode wrote:
>
>
> Honestly, I could argue that case should not be encoded. It would 
> simplify so much processing of Latin script text, and most of the time 
> case-sensitive operations are just wrong. Case is clearly a headache 
> that has to be dealt with in plain text, but it certainly doesn't 
> encourage me to add another set of characters that are basically the 
> same but not.

I completely agree.? Casing of letters (in general, I mean) was a 
horrible mistake and is way more trouble than it?s worth.? Too late to 
fix it, and given how entrenched it is it did kind of have to be 
encoded, but it?s such a bad idea.? And then other alphabets see it and 
think ?hey, we need capitals too!? and you get capitals for all the IPA 
extensions and Cherokee and so on... Ugh.

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190109/1c2c4de5/attachment.html>

From unicode at unicode.org  Wed Jan  9 20:31:10 2019
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Thu, 10 Jan 2019 03:31:10 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <20190109143302.665a7a7059d7ee80bb4d670165c8327d.791f9e387d.wbe@email03.godaddy.com>
References: <20190109143302.665a7a7059d7ee80bb4d670165c8327d.791f9e387d.wbe@email03.godaddy.com>
Message-ID: <20190110023110.jgglofux535kvqzn@angband.pl>

On Wed, Jan 09, 2019 at 02:33:02PM -0700, Doug Ewell via Unicode wrote:
> James Kass wrote:
> > (I still get a kick out of this:)
> > http://www.ewellic.org/mathtext.html

> Andrew West?s online ?Unicode Text Styler? includes non-math
> characters (like circled and fullwidth) as well, and is probably better,
> although it doesn't include the ransom-note option:
>  
> http://www.babelstone.co.uk/Unicode/text.html

And for the command line, there's my https://github.com/kilobyte/tran

No ransom-note as I pretend the tool's primary use is tran{scrib,literat}ing
between actual human scripts -- but it's remarkably easier to automate a
command line tool...


Meow!
-- 
??????? Hans 1 was born and raised in Johannesburg, then moved to Boston,
??????? and has just became a naturalized citizen.  Hans 2's grandparents
??????? came from Melanesia to D?sseldorf, and he hasn't ever been outside
??????? Germany until yesterday.  Which one is an African-American?

From unicode at unicode.org  Wed Jan  9 21:00:43 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 9 Jan 2019 19:00:43 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <b051233b-7b29-8ca2-35dd-7205d03b4f09@kli.org>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <b051233b-7b29-8ca2-35dd-7205d03b4f09@kli.org>
Message-ID: <36610912-dcdd-917b-7ddc-ced595be76b8@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190109/a5f7e5b4/attachment.html>

From unicode at unicode.org  Thu Jan 10 09:16:42 2019
From: unicode at unicode.org (Arthur Reutenauer via Unicode)
Date: Thu, 10 Jan 2019 16:16:42 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
Message-ID: <20190110151642.dz7r6pvhhqh2nay6@phare.normalesup.org>

On Wed, Jan 09, 2019 at 09:06:26AM +0000, James Kass via Unicode wrote:
> The unintended usage of math alphanumerics in the real world is fairly
> widespread, at least in screen names.

  On this topic, I was just pointed to

	https://twitter.com/kentcdodds/status/1083073242330361856

  ?You ?????????? it's ??????? to ?????????? your tweets and usernames ???????? ??????. But
have you ???????????????? to what it ???????????? ???????? with assistive technologies
like ????????????????????

	Best,

		Arthur

From unicode at unicode.org  Thu Jan 10 10:24:59 2019
From: unicode at unicode.org (Shriramana Sharma via Unicode)
Date: Thu, 10 Jan 2019 21:54:59 +0530
Subject: Excessive emoji usage and TTS (was Re: A last missing link)
Message-ID: <CAH-HCWW0Q3g42WdkaUt0zdYsUN_RbEVq0gbsOg7_ODx0x10FTQ@mail.gmail.com>

On Thu 10 Jan, 2019, 20:49 Arthur Reutenauer via Unicode <
unicode at unicode.org wrote:

>
>   On this topic, I was just pointed to
>
>         https://twitter.com/kentcdodds/status/1083073242330361856
>
>   ?You ?????????? it's ??????? to ?????????? your tweets and usernames
> ???????? ??????. But
> have you ???????????????? to what it ???????????? ???????? with assistive
> technologies
> like ????????????????????


Something similar:

https://twitter.com/aaronreynolds/status/1083098920132071424?s=20

"This is what it?s like to get texts from my fourteen year old while
driving."

https://t.co/s8949bmgZI
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190110/3e59f268/attachment.html>

From unicode at unicode.org  Thu Jan 10 10:41:29 2019
From: unicode at unicode.org (Khaled Hosny via Unicode)
Date: Thu, 10 Jan 2019 18:41:29 +0200
Subject: Excessive emoji usage and TTS (was Re: A last missing link)
In-Reply-To: <CAH-HCWW0Q3g42WdkaUt0zdYsUN_RbEVq0gbsOg7_ODx0x10FTQ@mail.gmail.com>
References: <CAH-HCWW0Q3g42WdkaUt0zdYsUN_RbEVq0gbsOg7_ODx0x10FTQ@mail.gmail.com>
Message-ID: <20190110164129.GC28761@macbook.localdomain>

On Thu, Jan 10, 2019 at 09:54:59PM +0530, Shriramana Sharma via Unicode wrote:
> On Thu 10 Jan, 2019, 20:49 Arthur Reutenauer via Unicode <
> unicode at unicode.org wrote:
> 
> >
> >   On this topic, I was just pointed to
> >
> >         https://twitter.com/kentcdodds/status/1083073242330361856
> >
> >   ?You ?????????? it's ??????? to ?????????? your tweets and usernames
> > ???????? ??????. But
> > have you ???????????????? to what it ???????????? ???????? with assistive
> > technologies
> > like ????????????????????
> 
> 
> Something similar:
> 
> https://twitter.com/aaronreynolds/status/1083098920132071424?s=20
> 
> "This is what it?s like to get texts from my fourteen year old while
> driving."
> 
> https://t.co/s8949bmgZI

That is pretty good actually and even a positive point for emoji (if
these were mere images you would get nothing out of it without extra
tagging, and it would still lack the standardization). Nothing like what
one gets from the math symbols abuse.

Regards,
Khaled

From unicode at unicode.org  Thu Jan 10 13:35:40 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 10 Jan 2019 19:35:40 +0000
Subject: Excessive emoji usage and TTS (was Re: A last missing link)
In-Reply-To: <20190110164129.GC28761@macbook.localdomain>
References: <CAH-HCWW0Q3g42WdkaUt0zdYsUN_RbEVq0gbsOg7_ODx0x10FTQ@mail.gmail.com>
 <20190110164129.GC28761@macbook.localdomain>
Message-ID: <e25c71fc-097c-6f43-cfc2-19eebbd1d489@gmail.com>


On 2019-01-10 4:41 PM, Khaled Hosny wrote:

 > That is pretty good actually and even a positive
 > point for emoji (if these were mere images you
 > would get nothing out of it without extra tagging,
 > and it would still lack the standardization).
 > Nothing like what one gets from the math symbols
 > abuse.

Yes, it's quite a difference.? I can read the text with math character 
use and can skip the texts with emoji.

Mathematicians borrowed these letters from writers.? Now writers are 
borrowing them back.? Seems fair.


From unicode at unicode.org  Thu Jan 10 17:43:46 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 10 Jan 2019 23:43:46 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
Message-ID: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>

On 2019-01-10 11:27 PM, wjgo_10009 at btinternet.com wrote:
> Yesterday I wrote as follows.
>
>> I suggest that a solution to the problem would be to encode a 
>> COMBINING ITALICIZER character, such that it only applies to the 
>> character that it immediately follows. So, for example, to make the 
>> word apricot become displayed in italics one would use seven 
>> COMBINING ITALICIZER characters, one after each letter of the word 
>> apricot.
>
> I have now made a test font. I used a Private Use Area code point and 
> a visible glyph for this test. It works well.
>
> https://forum.high-logic.com/viewtopic.php?f=10&t=7831
>
> Would it be a good idea to encode such a character into Unicode? The 
> first step would be to persuade the "powers that be" that italics are 
> needed.? That seems presently unlikely.? There's an entrenched mindset 
> which seems to derive from the fact that pre-existing character sets 
> were based on mechanical typewriting technology and were limited by 
> the maximum number of glyphs in primitive computer fonts.

The first step would be to persuade the "powers that be" that italics 
are needed.? That seems presently unlikely.? There's an entrenched 
mindset which seems to derive from the fact that pre-existing character 
sets were based on mechanical typewriting technology and were further 
limited by the maximum number of glyphs in primitive computer fonts.

The second step would be to persuade Unicode to encode a new character 
rather than simply using an existing variation selector character to do 
the job.


From unicode at unicode.org  Thu Jan 10 17:46:42 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 10 Jan 2019 23:46:42 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
Message-ID: <b879cdd8-48a1-05f8-e8ae-79d90fa879b0@gmail.com>

Oops.? Sorry for the inadvertent copy/paste duplication.

From unicode at unicode.org  Thu Jan 10 17:27:10 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Thu, 10 Jan 2019 23:27:10 +0000 (GMT)
Subject: A last missing link for interoperable representation
In-Reply-To: <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
Message-ID: <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>

Yesterday I wrote as follows.

> I suggest that a solution to the problem would be to encode a 
> COMBINING ITALICIZER character, such that it only applies to the 
> character that it immediately follows. So, for example, to make the 
> word apricot become displayed in italics one would use seven COMBINING 
> ITALICIZER characters, one after each letter of the word apricot.

I have now made a test font. I used a Private Use Area code point and a 
visible glyph for this test. It works well.

https://forum.high-logic.com/viewtopic.php?f=10&t=7831

Would it be a good idea to encode such a character into Unicode?

William Overington
Thursday 10 January 2019

-------------- next part --------------
A non-text attachment was scrubbed...
Name: italicizer_maquette_example.png
Type: image/png
Size: 19268 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20190110/4cb8fba7/attachment.png>

From unicode at unicode.org  Thu Jan 10 18:28:08 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Thu, 10 Jan 2019 19:28:08 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
Message-ID: <965db628-a917-9083-7bb0-910c571d4441@kli.org>

On 1/10/19 6:43 PM, James Kass via Unicode wrote:
>
> The first step would be to persuade the "powers that be" that italics 
> are needed.? That seems presently unlikely.? There's an entrenched 
> mindset which seems to derive from the fact that pre-existing 
> character sets were based on mechanical typewriting technology and 
> were further limited by the maximum number of glyphs in primitive 
> computer fonts.
>
> The second step would be to persuade Unicode to encode a new character 
> rather than simply using an existing variation selector character to 
> do the job.

A perhaps more affirmative step, not necessarily first but maybe, would 
be to write up a proposal and submit it through channels so the "powers 
that be" can respond officially.

~mark


From unicode at unicode.org  Thu Jan 10 18:37:11 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 11 Jan 2019 00:37:11 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <965db628-a917-9083-7bb0-910c571d4441@kli.org>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <965db628-a917-9083-7bb0-910c571d4441@kli.org>
Message-ID: <081f819e-ca23-4008-55c3-10bc33dfefac@gmail.com>


Mark E. Shoulson wrote,

 > A perhaps more affirmative step, not necessarily first
 > but maybe, would be to write up a proposal and submit
 > it through channels so the "powers that be" can
 > respond officially.

Indeed.? And a preliminary step might be to float the concept on the 
public list and see how well it is received.? Such discussion can often 
lead to more robust proposals, or an alternative use for one's time.? 
(smiles)


From unicode at unicode.org  Thu Jan 10 19:14:45 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 11 Jan 2019 01:14:45 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
Message-ID: <20190111011445.1773182d@JRWUBU2>

On Thu, 10 Jan 2019 23:43:46 +0000
James Kass via Unicode <unicode at unicode.org> wrote:
 
> The second step would be to persuade Unicode to encode a new
> character rather than simply using an existing variation selector
> character to do the job.

Actually, this might be a superior option.

Richard.


From unicode at unicode.org  Thu Jan 10 19:48:23 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 11 Jan 2019 01:48:23 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <20190111011445.1773182d@JRWUBU2>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
Message-ID: <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>


Richard Wordingham responded,

 >> ... simply using an existing variation
 >> selector character to do the job.
 >
 > Actually, this might be a superior option.

For the V.S. option there should be a provision for consistency and 
open-endedness to keep it simple.? Start with VS14 and work backwards 
for italic, fraktur, antiqua...? (whatever the preferred order works out 
to be).? Or (better yet) start at VS17 and move forward (and change the 
rule that seventeen and up is only for CJK).

Is it true that many of the CJK variants now covered were previously 
considered by the Consortium to be merely stylistic variants?


From unicode at unicode.org  Fri Jan 11 01:13:18 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 11 Jan 2019 07:13:18 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
Message-ID: <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>


I've been advised off-list that my attempt to make an analogy with CJK 
doesn't sit well.

It's fair to say that ideographic variation sequences are for plain-text 
representation of material which isn't suitable for atomic encoding.? An 
analogy can be drawn from that situation to the situation of other 
scripts, such as Latin (or Khmer).

The ideographic variation sequences also represent an anomaly:? if it's 
not suitable for plain-text encoding, it doesn't *need* plain-text 
representation.? Except that it does.

It's the demands of the CJK user community which drive the plain-text 
representation, which is proper.? This method should apply to non-CJK 
scripts as well.

Styled Latin text is being simulated with math alphanumerics now, which 
means that data is being interchanged and archived.? That's the user 
demand illustrated.

Whether the users are doing it Chicago style or just plain willy-nilly 
doesn't matter; it's being done.? User communities drive their own 
script development and advancement using the tools available.


From unicode at unicode.org  Fri Jan 11 01:29:42 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Fri, 11 Jan 2019 07:29:42 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
Message-ID: <61e13a9b-1ade-c89b-8cae-07c965194c01@it.aoyama.ac.jp>

On 2019/01/11 10:48, James Kass via Unicode wrote:

> Is it true that many of the CJK variants now covered were previously 
> considered by the Consortium to be merely stylistic variants?

What is a stylistic variant or not is quite a bit more complicated for 
CJK than for scripts such as Latin. In some contexts, something may be 
just a stylistic variant, whereas in other contexts (e.g. person 
registries,...), it may be more than a stylistic distinction.

Also, in contrast to the issue discussed in the current thread, there's 
no consistent or widely deployed solution for such CJK variants in rich 
text scenarios such as HTML.

Regards,    Martin.


From unicode at unicode.org  Fri Jan 11 02:13:44 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Fri, 11 Jan 2019 08:13:44 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
Message-ID: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>

On 2019/01/11 16:13, James Kass via Unicode wrote:

> Styled Latin text is being simulated with math alphanumerics now, which 
> means that data is being interchanged and archived.? That's the user 
> demand illustrated.

Almost by definition, styled text isn't plain text, even if it's 
simulated by something else. And the simulation is highly limited, as 
the voicing examples and the fact that the math alphanumerics only cover 
basic Latin have shown.

Regards,   Martin.


From unicode at unicode.org  Fri Jan 11 04:43:55 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Fri, 11 Jan 2019 02:43:55 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
Message-ID: <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>

Martin,

James is making the case there is demand or a user need and that the proof is that users are using inconsistent tactics to simulate a solution to their problem.

The response that:
"Almost by definition, styled text isn't plain text, even if it's simulated by something else." 
is a bit like Humpty Dumpty saying words mean what I want them to mean. 

Most of the emoji aren't plain text and Unicode has them in abundance. Ruby text is also not plain text. Their inclusion was the user need for consistency and interoperability. The original emoji had inconsistent encodings and were a problem for interchange as well as search and rendering. Their existence and popularity became their own problem requiring further styling (e.g. coloring) and greatly expanded enumeration (foods, animals, et al.) Let's be honest and admit the actual demand for some of these latter objects in plain text is marginal and certainly is less than the prevalence of italics.

The response that:
"the simulation is highly limited, as the voicing examples and the fact that the math alphanumerics only cover basic Latin have shown." unless I misunderstand your meaning, is the argument that we encoded only these therefore the use case is limited to these.

In a different message you say:
"Also, in contrast to the issue discussed in the current thread, there's no consistent or widely deployed solution for such CJK variants in rich text scenarios such as HTML."
I don't see how a rich text solution has any bearing on plain text. We could take the point that if there was no need in HTML to solve the problem than there wasn't demand justifying the need in Unicode. :-)
 I understand your actual intent to say there was a need for CJK variants and there was no other solution. However, the fact that there is a rich text solution for italics isn't helpful to plain text users.
HTML had bidirectional isolates and after the fact Unicode encoded them as well.

The fact that there isn't a consistent way to represent stress or the other uses for italics (or obliques, and bold, etc.) does make certain searches across large numbers of plain texts problematic. In the same way it is sometimes important to distinguish capitalized text when searching (polish vs Polish) it would be helpful to do the same for italicized text. For example, if I am searching for the movie title "Contact" vs. all the places where texts reference a personal "Contact", distinguishing italicized titles would help. And to the extent that users are inserting non-standardized punctuation or other characters for "styling" it makes reliable searching difficult. As James mentioned it helps with interoperability as well.

In the '90s it made sense to resist styling plain text. In the 2020's, with more than 100k characters, numerous pictures and character adornments, it seems anachronistic to be arguing against a handful of control characters that would standardize a common text requirement. Most rendering systems will handle it easily and any plain text editor or other software that supports a combining strikethrough character would easily adapt a combining italicize or a combining bold character.

tex


From unicode at unicode.org  Fri Jan 11 16:28:40 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Fri, 11 Jan 2019 14:28:40 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
Message-ID: <CAMZ=zj5+t3JPjCJypyCVo=pkuHsNYAa=XvjFM3kzDG6HC7s0-Q@mail.gmail.com>

Emoji were being encoded as characters, as codepoints in private use
areas. That inherently called for a Unicode response. Bidirectional
support is a headache; the amount of confusion and outright exploits
from them is way higher then we like.The HTML support probably doesn't
help that. However, properly mixing Hebrew and English (e.g.) is
pretty clearly a plain text problem.

There are terabytes of Latin text out there, most of it encoded in
formats that already support italics. Whereas emoji, encoded as
characters in a then limited number of systems, could be subsumed into
Unicode easily, much of that text will never be edited and those
formats will never exclude the existing means of marking italics out
of bounds, offering multiple ways to do italics in perpetuity.

-- 
Kie ekzistas vivo, ekzistas espero.

From unicode at unicode.org  Fri Jan 11 16:10:25 2019
From: unicode at unicode.org (via Unicode)
Date: Fri, 11 Jan 2019 23:10:25 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
Message-ID: <be7c7f887eaa851b5ed1190b027b2c77@posteo.net>


On 11.01.2019 11:43, Tex via Unicode wrote:
> Martin,
> 
> James is making the case there is demand or a user need and that the
> proof is that users are using inconsistent tactics to simulate a
> solution to their problem.
<snip>

The use of math characters is mostly to get around limitations of 
Twitter (and some other platforms). There are plenty of rich text 
formats like Markdown and Html existing already.

I am rather doubtful that it should be Unicode's responsibility to get 
around lack of rich text support via special characters and fonts, 
especially since many platforms do not allow users to freely change the 
fonts (and if these platforms installed such fonts, they could just as 
easily support markup/rich text instead). Even if they do, the 
programs/platforms involved would not necessarily enable these fonts by 
default: if the wanted rich text, they would be supporting it already.

Also, any Unicode-based rich text standard would not really be standard 
compared to the vast amount of HTML out there already.

David Faulks

From unicode at unicode.org  Fri Jan 11 17:17:24 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 11 Jan 2019 23:17:24 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
Message-ID: <e0d994d3-c0ee-5746-326a-beb9e0ef7635@gmail.com>


Martin J. D?rst wrote,

 > Almost by definition, styled text isn't plain text, even if it's
 > simulated by something else.

By an earlier definition, in-line pictures weren't plain text, until 
people started exchanging them as though they were.? In this case, 
people are exchanging plain text as plain text.

 > And the simulation is highly limited, as
 > the voicing examples and the fact that the math alphanumerics
 > only cover basic Latin have shown.

The voicing examples are software shortcomings which could be overcome.? 
The software people might seize the opportunity to accommodate their 
users and vocalize bold *loudly*, italics with /stress/, and fraktur 
with a Boris Karloff (or Bela Lugosi) voice. That would be up to them.? 
But the voicing examples aren't really about reading and writing and how 
they relate to the character encoding.? (Not saying that the voicing 
examples aren't interesting and relevant to the overall topic.)

The fact that the math alphanumerics are incomplete may have been part 
of what prompted Marcel Schneider to start this thread.

If stringing encoded italic Latin letters into words is an abuse of 
Unicode, then stringing punctuation characters to simulate a "smiley" 
(?) is an abuse of ASCII - because that's not what those punctuation 
characters are *for*.? If my brain parses such italic strings into 
recognizable words, then I guess my brain is non-compliant.


From unicode at unicode.org  Fri Jan 11 17:54:17 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 11 Jan 2019 23:54:17 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
Message-ID: <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>


Tex Texin wrote,

 > ... However, the fact that there is a rich text solution for italics
 > isn't helpful to plain text users.

Truer words were never spoken.

 > In the '90s it made sense to resist styling plain text. In the 2020's,
 > with more than 100k characters, numerous pictures and character
 > adornments, it seems anachronistic to be arguing against a handful
 > of control characters that would standardize a common text
 > requirement. Most rendering systems will handle it easily and any
 > plain text editor or other software that supports a combining
 > strikethrough character would easily adapt a combining italicize or
 > a combining bold character.

Exactly.? William Overington has already posted a proof-of-concept here:
https://forum.high-logic.com/viewtopic.php?f=10&t=7831
... using a P.U.A. character /in lieu/ of a combining formatting or VS 
character.? The concept is straightforward and works properly with 
existing technology.


From unicode at unicode.org  Sat Jan 12 04:57:26 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Sat, 12 Jan 2019 10:57:26 +0000 (GMT)
Subject: A last missing link for interoperable representation
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
Message-ID: <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>

On 2019-01-11, James Kass via Unicode <unicode at unicode.org> wrote:
> Exactly.? William Overington has already posted a proof-of-concept here:
> https://forum.high-logic.com/viewtopic.php?f=10&t=7831
> ... using a P.U.A. character /in lieu/ of a combining formatting or VS 
> character.? The concept is straightforward and works properly with 
> existing technology.

It does not work with much existing technology. Interspersing extra
codepoints into what is otherwise plain text breaks all the existing
software that has not been, and never will be updated to deal with
arbitrarily complex algorithms required to do Unicode searching.
Somebody who need to search exotic East Asian text will know that they
need software that understands VS, but a plain ordinary language user
is unlikely to have any idea that VS exist, or that their searches
will mysteriously fail if they use this snazzy new pseudo-plain-text
italicization technique

It's also fundamentally misguided. When I _italicize_ a word, I am
writing a word composed of (plain old) letters, and then styling the
word; I am not composing a new and different word ("_italicize_") that
is distinct from the old word ("italicize") by virtue of being made up
of different letters.

I think the VS or combining format character approach *would* have
been a better way to deal with the mess of mathematical alphabets,
because for mathematicians, *b* is a distinct symbol from b, and while
there may be correlated use of alphabets, there need be no connection
whatever between something notated b and something notated *b*.

But for plain text, it's crazy.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Sat Jan 12 06:29:39 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 12 Jan 2019 12:29:39 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
Message-ID: <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com>


Julian Bradford wrote,

"It does not work with much existing technology. Interspersing extra
codepoints into what is otherwise plain text breaks all the existing
software that has not been, and never will be updated to deal with
arbitrarily complex algorithms required to do Unicode searching.
Somebody who need to search exotic East Asian text will know that they
need software that understands VS, but a plain ordinary language user
is unlikely to have any idea that VS exist, or that their searches
will mysteriously fail if they use this snazzy new pseudo-plain-text
italicization technique"

Sounds like you didn't try it.? VS characters are default ignorable.

First one is straight, the second one has VS2 characters interspersed 
and after the "t":
apricot
a?p?r?i?c?o?t?
Notepad finds them both if you type the word "apricot" into the search box.

"..."

Regardless of how you input italics in rich-text, you are putting italic 
forms into the display.

"I think the VS or combining format character approach *would* have
been a better way to deal with the mess of mathematical alphabets, ..."

I think so, too, but since I'm not a member of *that* user community, my 
opinion hasn't much value.? Plus VS characters were encoded after the 
math stuff.

"But for plain text, it's crazy."

Are you a member of the plain-text user community?


From unicode at unicode.org  Sat Jan 12 07:21:29 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 12 Jan 2019 13:21:29 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com>
Message-ID: <91e983a0-a257-6bdb-66af-0622e9a85233@gmail.com>


> Julian Bradford wrote,

* Bradfield, sorry.


From unicode at unicode.org  Sat Jan 12 07:22:21 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 12 Jan 2019 13:22:21 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
Message-ID: <20190112132221.7497fdea@JRWUBU2>

On Sat, 12 Jan 2019 10:57:26 +0000 (GMT)
Julian Bradfield via Unicode <unicode at unicode.org> wrote:

> It's also fundamentally misguided. When I _italicize_ a word, I am
> writing a word composed of (plain old) letters, and then styling the
> word; I am not composing a new and different word ("_italicize_") that
> is distinct from the old word ("italicize") by virtue of being made up
> of different letters.

And what happens when you capitalise a word for emphasis or to begin a
sentence?  Is it no longer the same word?

> I think the VS or combining format character approach *would* have
> been a better way to deal with the mess of mathematical alphabets,
> because for mathematicians, *b* is a distinct symbol from b, and while
> there may be correlated use of alphabets, there need be no connection
> whatever between something notated b and something notated *b*.

Perhaps the influence of school has lingered too well, but I would be
very uncomfortable with such a lack of connection.  The idea that *b*
is a vector and _b_ is its magnitude has stuck well.  Italicisation on
the other hand, is a confirmation that something is a symbol, and
naturally disappears in handwriting.

Richard.

From unicode at unicode.org  Sat Jan 12 08:21:19 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 12 Jan 2019 14:21:19 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <20190112132221.7497fdea@JRWUBU2>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
Message-ID: <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>


Reading & writing & 'rithmatick...

This is a math formula:
a + b = b + a
... where the estimable "mathematician" used Latin letters from ASCII as 
though they were math alphanumerics variables.

This is an italicized word:
????????????????????????
... where the "geek" hacker used Latin italics letters from the math 
alphanumeric range as though they were Latin italics letters.

Where's the harm?

FWIW, the math formula:
a + b # ?? + ??
... becomes invalid if normalized NFKD/NFKC.? (Or if copy/pasted from an 
HTML page using marked-up ASCII into a plain-text editor.)

Yet the italicized word "kakistocracy" is still legible if normalized.? 
If copy/pasted from an HTML page using the math alphanumeric characters, 
it survives intact.? If copy/pasted from markupped ASCII, it's still 
legible.


From unicode at unicode.org  Sat Jan 12 10:21:43 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 12 Jan 2019 16:21:43 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
Message-ID: <20190112162143.06d36c69@JRWUBU2>

On Sat, 12 Jan 2019 14:21:19 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> FWIW, the math formula:
> a + b # ?? + ??
> ... becomes invalid if normalized NFKD/NFKC.? (Or if copy/pasted from
> an HTML page using marked-up ASCII into a plain-text editor.)

(a) Italic versus plain is not significant in the mathematics I've
encountered.  It's worse than distinguishing capital em and capital mu,
which is allowed if you're the head of department.

(b) a + b # b + a is a general, but not universally true, statement for
ordinal numbers, the simplest example being

? = 1 + ? ? ? + 1

(c) You're talking about a folding, not a normalisation.

The example you want would use emboldening, e.g.

"In general, ?? + ??   ??? ?? + ??"

which is true for vectors ???? and ?? if one is treating the
quaternions as a direct sum of reals and real 3-vectors.

Richard.


From unicode at unicode.org  Sat Jan 12 10:26:59 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Sat, 12 Jan 2019 16:26:59 +0000 (GMT)
Subject: A last missing link for interoperable representation
In-Reply-To: <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
Message-ID: <7f64c5a1.9721.16842e34f54.Webtop.70@btinternet.com>

James Kass wrote:

> For the V.S. option there should be a provision for consistency and 
> open-endedness to keep it simple.  Start with VS14 and work backwards 
> for italic, ?

I have now made, tested and published a font, VS14 Maquette, that uses 
VS14 to indicate italic.

https://forum.high-logic.com/viewtopic.php?f=10&t=7831&p=37561#p37561

William Overington
Saturday 12 January 2019


------ Original Message ------
From: "James Kass via Unicode" <unicode at unicode.org>
To: unicode at unicode.org
Sent: Friday, 2019 Jan 11 At 01:48
Subject: Re: A last missing link for interoperable representation


Richard Wordingham responded,

>> ... simply using an existing variation
>> selector character to do the job.
>
> Actually, this might be a superior option.

For the V.S. option there should be a provision for consistency and 
open-endedness to keep it simple.? Start with VS14 and work backwards 
for italic, fraktur, antiqua...? (whatever the preferred order works out 
to be).? Or (better yet) start at VS17 and move forward (and change the 
rule that seventeen and up is only for CJK).

Is it true that many of the CJK variants now covered were previously 
considered by the Consortium to be merely stylistic variants?


From unicode at unicode.org  Sat Jan 12 12:50:00 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 12 Jan 2019 10:50:00 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <20190112132221.7497fdea@JRWUBU2>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
Message-ID: <93c4be0a-a591-627e-d7b5-58142859dca9@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190112/771b2127/attachment.html>

From unicode at unicode.org  Sat Jan 12 13:16:17 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 12 Jan 2019 20:16:17 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <e0d994d3-c0ee-5746-326a-beb9e0ef7635@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <e0d994d3-c0ee-5746-326a-beb9e0ef7635@gmail.com>
Message-ID: <7670985e-2e49-5b5e-848c-24e00cff7ebd@orange.fr>

On 12/01/2019 00:17, James Kass via Unicode wrote:
[?]
> The fact that the math alphanumerics are incomplete may have been
> part of what prompted Marcel Schneider to start this thread.

No, really not at all. I didn?t even dream of having italics in Unicode
working out of the box. That would exactly be the sort of demand that
would have completely discredited me advocating the use of preformatted
superscripts for the Unicode conformant and interoperable representation
of a handful of languages spoken by one third of mankind and using the
Latin script, while no other scripts are concerned with that orthographic
feature. (No clear borderline between orthography and typography here,
but with ordinal indicators in particular and abbreviation indicators in
general we?re clearly on the orthographic side. (SC2/WG3 would agree,
since they deemed "?" and "?" worth encoding in 8-bit charsets.)

It started when I found in the XKB keysymdef.h four dead keysyms added
for Karl Pentzlin?s German T3, among which dead_lowline, and remembered
that at some point in history, users were deprived of the means of typing
a combining underscore. I didn?t think at the extra letterspacing (called
?gesperrt? spaced out in German) that Mark E. Shoulson mentioned upthread,
(a) because it isn?t used for that purpose in the locale I?m working for,
and (b) because emulating it with interspersed NARROW NO-BREAK SPACEs
would make that text unsearchable.

> 
> If stringing encoded italic Latin letters into words is an abuse of
> Unicode, then stringing punctuation characters to simulate a "smiley"
> (?) is an abuse of ASCII - because that's not what those punctuation
> characters are *for*.  If my brain parses such italic strings into
> recognizable words, then I guess my brain is non-compliant.

I think that like Google Search having extensive equivalence classes
treating mathematical letters like plain ASCII, text-to-speech software
could use a little bit of AI to recognize strings of those letters as
ordinary words with emphasis, like James Kass suggested ? the more as
we?re actually able to add combining diacritics for correct spelling
in some diacriticized alphabets (including a few with non-decomposable
diacritics), though with somewhat less-than-optimal diacritic placement
in many cases in the actual state of the art ? and also parse ASCII art
correspondingly, unlike what happened in another example shared on
Twitter downthread of the math letters tweet:

https://twitter.com/ourelectra/status/1083367552430989315

Thanks,

Marcel

From unicode at unicode.org  Sat Jan 12 19:22:08 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 13 Jan 2019 01:22:08 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <93c4be0a-a591-627e-d7b5-58142859dca9@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <93c4be0a-a591-627e-d7b5-58142859dca9@ix.netcom.com>
Message-ID: <cf428fc8-8e99-fee0-6d78-c1a53e91c034@gmail.com>


Asmus Freytag wrote,

 > ...What this teaches you is that italicizing (or boldfacing)
 > text is fundamentally related to picking out parts of your
 > text in a different font.

Typically from the same typeface, though.

 > So those screen readers got it right, except that they could
 > have used one of the more typical notational conventions that
 > the mathalphabetics are used to express (e.g. "vector" etc.),
 > rather than rattling off the Unicode name.

WRT text-to-voice applications, such as "VoiceOver", I wonder how well 
they would do when encountering /any/ exotic text runs or characters.? 
Like Yi, or Vai, or even an isolated CJK ideograph in otherwise Latin 
text.? For example:? "The Han radical # 72, which looks like '?', means 
'sun'."? Would the application "say" the character as a Japanese reader 
would expect to hear it?? Or in one of the Chinese dialects?? Or would 
the application just give the hex code point?

In an era where most of the states in my country no longer teach cursive 
writing in public schools, it seems unlikely that Twitter users (and so 
forth) will be clamoring for the ability to implement Chicago Style text 
properly on their cell phone screens.? (Many users would probably prefer 
to use the cell phone to order a Chicago style pizza.)? But, stranger 
things have happened.


From unicode at unicode.org  Sat Jan 12 20:15:35 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sat, 12 Jan 2019 21:15:35 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
Message-ID: <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org>

Just to add some more fuel for this fire, I note also the highly popular 
(in some places) technique of using Unicode letters that may have 
nothing whatsoever to do with the symbol or letter you mean to 
represent, apart from coincidental resemblance and looking "cool" 
enough.? This happens a lot on Second Life, where you can set your 
"display name" distinct from your "user name", but the display name 
appears to be limited to Unicode *letters* and some punctuation, mostly, 
and certainly can't be outside the BMP.? So for a sampling from stuff 
I've heard of...

?bi??? S???lS?ul
?P??D
???????? ?? ?? ?ud? ??itm??
????? B??D????
?L?????
????? Fashionablez ????? ?ha?g
???? M???????
??????? ? .
?u ?u
????? ???? ?????
??n
???o
?'M ??????? ??????? ?????MM???
?????? ????? ??????u??
??????? ???? ?r??
??????
? ?? ? ? ? ? :.
??Z?R?
?????
?J??????
?cH ????????
?????
? Amy ?
????? G?????L?
?????t Wu??????
?h??h? ??c???????
???? Jarah Sparks????
?? fleur ??
????? ??????
???- Pandora Barbaros???-
? ??????? ?-x-
????? ??u???
?l?? ???l??
???? ?????
?? Gatatem ????? ???

I could do more searching... Some of these things are even more common 
than shown here.? Using ? for a heart ? is extremely widespread, and 
decorations like ? and ? abound.? Note some decorations involving ? with 
some Arabic(!) combining characters. Note the use of Hebrew and Arabic 
and CJK and other characters to represent Latin letters to which they 
bear only a passing resemblance.? There are also a lot of names in all 
small-caps or all full-width (I didn't include any examples of just that 
because they seemed so ordinary), or "inverted"? ?uo???s?? ??nsn ??? u??

I don't know what, precisely, this argues for or against.? Would people 
deny that this is an "abuse" of the character-set, even though people 
are doing it and it works for them?? The medium is pretty indisputably 
plain-text.? Should all this kind of thing be somehow made to "work" for 
these creative, if mystifying, people? These are clearly pretty far-out 
examples (though not extreme, compared to what's out there, nor 
uncommon, from what I have been told.)

This discussion has been very interesting, really.? I've heard what I 
thought were very good points and relevant arguments from both/all 
sides, and I confess to not being sure which I actually prefer.? Just 
giving you more to think about...

~mark


From unicode at unicode.org  Sat Jan 12 20:17:34 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 13 Jan 2019 02:17:34 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <7f64c5a1.9721.16842e34f54.Webtop.70@btinternet.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <3b885166-58c3-f970-829e-e6d521259670@gmail.com>
 <9aca09ac-d0a6-34eb-9a92-ad27c60e96d2@ix.netcom.com>
 <0442806e-6f9f-34c0-a644-d1813b2a6fc3@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <7f64c5a1.9721.16842e34f54.Webtop.70@btinternet.com>
Message-ID: <f85dfe95-59e6-7820-7177-8c47d44d37f9@gmail.com>


On 2019-01-12 4:26 PM, wjgo_10009 at btinternet.com wrote:
> I have now made, tested and published a font, VS14 Maquette, that uses 
> VS14 to indicate italic.
>
> https://forum.high-logic.com/viewtopic.php?f=10&t=7831&p=37561#p37561
>

The italics don't happen in Notepad, but VS14 Maquette works spendidly 
in LibreOffice!? (Windows 7)? (In a *.TXT file)

Since the VS characters are supposed to be used with officially 
registered/recognized sequences, it's possible that Notepad isn't trying 
to implement the feature.

The official reception of the notion of using variant letter forms, such 
as italics, in plain-text is typically frosty.? So advancement of 
plain-text might be left up to third-party developers, enthusiasts, and 
the actual text users.? And there's nothing wrong with that.? (It's 
non-conformant, though, unless the VS material is officially 
recognized/registered.)

Non-Latin scripts, such as Khmer, may have their own traditions and 
conventions WRT special letter forms.? Which is why starting at VS14 and 
working backwards might be inadequate in the long run.

Khmer has letter forms called muul/moul/muol (not sure how to spell that 
one, but neither is anybody else).? It superficially resembles fraktur 
for Khmer.? Other non-Latin scripts may have a plethora of such 
forms/fonts/styles.


From unicode at unicode.org  Sat Jan 12 22:24:29 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 13 Jan 2019 04:24:29 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org>
Message-ID: <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com>


Mark E. Shoulson wrote,

 > This discussion has been very interesting, really.? I've heard what I
 > thought were very good points and relevant arguments from both/all
 > sides, and I confess to not being sure which I actually prefer.

It's subjective, really.? It depends on how one views plain-text and 
one's expectations for its future.? Should plain-text be progressive, 
regressive, or stagnant?? Because those are really the only choices.? 
And opinions differ.

Most of us involved with Unicode probably expect plain-text to be around 
for quite a while.? The figure bandied about in the past on this list is 
"a thousand years".? Only a society of mindless drones would cling to 
the past for a millennium.? So, many of us probably figure that 
strictures laid down now will be overridden as a matter of course, over 
time.

Unicode will probably be around for awhile, but the barrier between 
plain- and rich-text has already morphed significantly in the relatively 
short period of time it's been around.

I became attracted to Unicode about twenty years ago.? Because Unicode 
opened up entire /realms/ of new vistas relating to what could be done 
with computer plain text.? I hope this trend continues.


From unicode at unicode.org  Sun Jan 13 02:20:36 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Sun, 13 Jan 2019 08:20:36 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org>
 <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com>
Message-ID: <3edf3287-1fe9-6444-bd12-42fbc0232094@it.aoyama.ac.jp>

On 2019/01/13 13:24, James Kass via Unicode wrote:
> 
> Mark E. Shoulson wrote,
> 
>  > This discussion has been very interesting, really.? I've heard what I
>  > thought were very good points and relevant arguments from both/all
>  > sides, and I confess to not being sure which I actually prefer.
> 
> It's subjective, really.? It depends on how one views plain-text and 
> one's expectations for its future.? Should plain-text be progressive, 
> regressive, or stagnant?? Because those are really the only choices. And 
> opinions differ.

I'd say it should be conservative. As the meaning of that word (similar 
to others such as progressive and regressive) may be interpreted in 
various way, here's what I mean by that.

It should not take up and extend every little fad at the blink of an 
eye. It should wait to see what the real needs are, and what may be just 
a temporary fad. As the Mathematical style variants show, once 
characters are encoded, it's difficult to get people off using them, 
even in ways not intended.

Emoji have often been often cited in this thread. But there are some 
important observations:

1) Emoji were added to Unicode only after it turned out that they were
    widely used in Japanese character encodings, and dripping into
    Unicode-based systems in large numbers but without any clearly
    assigned code points. The Unicode Consortium didn't start encoding
    them because they thought emoji were cute or progressive or anything
    like that.

2) The Unicode Consortium is continuing to hold down the number of newly
    encoded emoji by using an approximate limit for each year and a
    strict process.

3) The Unicode Consortium is somewhat motivated to encode new emoji
    because of the publicity surrounding them. That publicity might
    subside sooner or later. It's difficult to imagine the same kind
    of publicity for italics and friends.

> Most of us involved with Unicode probably expect plain-text to be around 
> for quite a while.? The figure bandied about in the past on this list is 
> "a thousand years".? Only a society of mindless drones would cling to 
> the past for a millennium.? So, many of us probably figure that 
> strictures laid down now will be overridden as a matter of course, over 
> time.
> 
> Unicode will probably be around for awhile, but the barrier between 
> plain- and rich-text has already morphed significantly in the relatively 
> short period of time it's been around.

Because whatever is encoded can't be "unencoded", it's clear that we can 
only move in one direction, and not back. But because we want Unicode to 
work for a long, long time, it's very important to be conservative.

> I became attracted to Unicode about twenty years ago.? Because Unicode 
> opened up entire /realms/ of new vistas relating to what could be done 
> with computer plain text.? I hope this trend continues.

I hope this trend only continues very slowly, if at all.

Regards,    Martin.


From unicode at unicode.org  Sun Jan 13 02:22:37 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Sun, 13 Jan 2019 08:22:37 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <93c4be0a-a591-627e-d7b5-58142859dca9@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <93c4be0a-a591-627e-d7b5-58142859dca9@ix.netcom.com>
Message-ID: <85c6455c-aae0-510e-63ed-69358ce96a2f@it.aoyama.ac.jp>

On 2019/01/13 03:50, Asmus Freytag via Unicode wrote:

> To reiterate, if you effectively require a span (even if you could simulate that
> differently) you are in the realm or rich text. The one big exception to that is
> bidi, because it is utterly impossible to do bidi text without text ranges.
> Therefore, Unicode plain text explicitly violates that principle in favor of
> achieving a fundamental goal of universality, that is being able to include the
> bidi languages.

Yes, and in HTML, where higher-level (span-based) mechanisms are 
available, it is preferred to use these rather than the bidi control 
characters.

Regards,    Martin.


From unicode at unicode.org  Sun Jan 13 10:44:58 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Sun, 13 Jan 2019 16:44:58 +0000 (GMT)
Subject: A last missing link for interoperable representation
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com>
Message-ID: <slrnq3mqo9.vm5.jcb@home.stevens-bradfield.com>

On 2019-01-12, James Kass via Unicode <unicode at unicode.org> wrote:

> Sounds like you didn't try it.? VS characters are default ignorable.

By software that has a full understanding of Unicode. There is a very
large world out there of software that was written before Unicode was
dreamed of, let alone popular.

> apricot
> a?p?r?i?c?o?t?
> Notepad finds them both if you type the word "apricot" into the search box.

What has Notepad to do with me?

> "But for plain text, it's crazy."
>
> Are you a member of the plain-text user community?

Certainly:)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Sun Jan 13 10:46:42 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Sun, 13 Jan 2019 16:46:42 +0000 (GMT)
Subject: A last missing link for interoperable representation
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
Message-ID: <slrnq3mqrh.vm5.jcb@home.stevens-bradfield.com>

On 2019-01-12, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> On Sat, 12 Jan 2019 10:57:26 +0000 (GMT)
> Julian Bradfield via Unicode <unicode at unicode.org> wrote:
>
>> It's also fundamentally misguided. When I _italicize_ a word, I am
>> writing a word composed of (plain old) letters, and then styling the
>> word; I am not composing a new and different word ("_italicize_") that
>> is distinct from the old word ("italicize") by virtue of being made up
>> of different letters.
>
> And what happens when you capitalise a word for emphasis or to begin a
> sentence?  Is it no longer the same word?

Indeed. As has been observed up-thread, the casing idea is a dumb one!
We are, however, stuck with it because of legacy encoding transported
into Unicode. We aren't stuck with encoding fonts into Unicode.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Sun Jan 13 10:52:25 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Sun, 13 Jan 2019 16:52:25 +0000 (GMT)
Subject: A last missing link for interoperable representation
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
Message-ID: <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>

On 2019-01-12, James Kass via Unicode <unicode at unicode.org> wrote:
> This is a math formula:
> a + b = b + a
> ... where the estimable "mathematician" used Latin letters from ASCII as 
> though they were math alphanumerics variables.

Yup, and it's immediately understandable by anyone reading on any
computer that understands ASCII.  That's why mathematicians write like
that in plain text.

> This is an italicized word:
> ????????????????????????
> ... where the "geek" hacker used Latin italics letters from the math 
> alphanumeric range as though they were Latin italics letters.

It's a sequence of question marks unless you have an up to date
Unicode font set up (which, as it happens, I don't for the terminal in
which I read this mailing list). Since actual mathematicians don't use
the Unicode math alphabets, there's no strong incentive to get updated
fonts.

> Where's the harm?

You lose your audience for no reasons other than technogeekery. 


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Sun Jan 13 14:38:45 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sun, 13 Jan 2019 21:38:45 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
Message-ID: <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>

On 13/01/2019 17:52, Julian Bradfield via Unicode wrote:
> On 2019-01-12, James Kass via Unicode <unicode at unicode.org> wrote:
>> This is a math formula:
>> a + b = b + a
>> ... where the estimable "mathematician" used Latin letters from ASCII as
>> though they were math alphanumerics variables.
> 
> Yup, and it's immediately understandable by anyone reading on any
> computer that understands ASCII.  That's why mathematicians write like
> that in plain text.

As far as the information goes that was running until now on this List,
Mathematicians are both using TeX and liking the Unicode math alphabets.

> 
>> This is an italicized word:
>> ????????????????????????
>> ... where the "geek" hacker used Latin italics letters from the math
>> alphanumeric range as though they were Latin italics letters.
> 
> It's a sequence of question marks unless you have an up to date
> Unicode font set up (which, as it happens, I don't for the terminal in
> which I read this mailing list). Since actual mathematicians don't use
> the Unicode math alphabets, there's no strong incentive to get updated
> fonts.

These statements make me fear that the font you are using might unsupport
the NARROW NO-BREAK SPACE U+202F >?<. If you see a question mark between
these pointy brackets, please let us know. Because then, You?re unable to
read interoperably usable French text, too, as you?ll see double punctuation
(eg "?!") where a single mark is intended, like here?!

There is a crazy typeface out there, misleadingly called 'Courier New', as if
the foundry didn?t anticipate that at some point it would be better called
"Courier Obsolete". Or they did, but? (Referring to CLDR ticket #11423.)

BTW if anybody knows a version of Courier New updated to a decent level of
Unicode support, please be so kind and share the link so I can spread the word.

> 
>> Where's the harm?
> 
> You lose your audience for no reasons other than technogeekery.

Aiming at extending the subset of environments supporting correct typesetting
is no geekery but awareness of our cultural heritage that we?re committed to
maintain and to develop, taking it over into the digital world while adapting
technology to culture, not conversely.


Best regards,

Marcel

From unicode at unicode.org  Sun Jan 13 15:43:11 2019
From: unicode at unicode.org (Khaled Hosny via Unicode)
Date: Sun, 13 Jan 2019 23:43:11 +0200
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
References: <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
Message-ID: <20190113214311.GA1281@macbook.localdomain>

On Sun, Jan 13, 2019 at 04:52:25PM +0000, Julian Bradfield via Unicode wrote:
> On 2019-01-12, James Kass via Unicode <unicode at unicode.org> wrote:
> > This is an italicized word:
> > ????????????????????????
> > ... where the "geek" hacker used Latin italics letters from the math 
> > alphanumeric range as though they were Latin italics letters.
> 
> It's a sequence of question marks unless you have an up to date
> Unicode font set up (which, as it happens, I don't for the terminal in
> which I read this mailing list). Since actual mathematicians don't use
> the Unicode math alphabets, there's no strong incentive to get updated
> fonts.

They do, but not necessarily by directly inputting them. LaTeX with the
?unicode-math? package will translate ASCII + font switches to the
respective Unicode math alphanumeric characters. Word will do the same.
Even browsers rendering MathML will do the same (though most likely the
MathML source will have the math alphanumeric characters already).

Regards,
Khaled

From unicode at unicode.org  Sun Jan 13 17:36:24 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 13 Jan 2019 23:36:24 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3mqo9.vm5.jcb@home.stevens-bradfield.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com>
 <slrnq3mqo9.vm5.jcb@home.stevens-bradfield.com>
Message-ID: <efcc32dd-5875-ec09-e6c6-340307ef9ed0@gmail.com>


Julian Bradfield replied,

 >> Sounds like you didn't try it.? VS characters are default ignorable.
 >
 > By software that has a full understanding of Unicode. There is a very
 > large world out there of software that was written before Unicode was
 > dreamed of, let alone popular.

??? ?? ???? ????? ??? ?? ??? ?? ??? ???, ?? ????? ????? (*) ??????

What happens with Devanagari text?? Should the user community refrain 
from interchanging data because 1980s era software isn't Unicode aware?


From unicode at unicode.org  Sun Jan 13 21:00:36 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Mon, 14 Jan 2019 03:00:36 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3mqrh.vm5.jcb@home.stevens-bradfield.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <slrnq3mqrh.vm5.jcb@home.stevens-bradfield.com>
Message-ID: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp>

On 2019/01/14 01:46, Julian Bradfield via Unicode wrote:
> On 2019-01-12, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
>> On Sat, 12 Jan 2019 10:57:26 +0000 (GMT)

>> And what happens when you capitalise a word for emphasis or to begin a
>> sentence?  Is it no longer the same word?
> 
> Indeed. As has been observed up-thread, the casing idea is a dumb one!
> We are, however, stuck with it because of legacy encoding transported
> into Unicode. We aren't stuck with encoding fonts into Unicode.

No, the casing idea isn't actually a dumb one. As Asmus has shown, one 
of the best ways to understand what Unicode does with respect to text 
variants is that style works on spans of characters (words,...), and is 
rich text, but thinks that work on single characters are handled in 
plain text. Upper-case is definitely for most part a single-character 
phenomenon (the recent Georgian MTAVRULI additions being the exception).

UPPER CASE can be used on whole spans of text, but that's not the main 
use case. And if UPPER CASE is used for emphasis, one way to do it (and 
the best way if this is actually a styling issue) is to use rich text 
and mark it up according to semantics, and then use some styling 
directive (e.g. CSS text-transform: uppercase) to get the desired look.


Another criterion is orthography. Schoolchildren learn when to 
capitalize a word and when not. Teachers check and correct it all the 
time. Grammar books and books for second language learners discuss 
capitalization, because it's part of orthography, the rules differ by 
language, and not getting it right will make the writer look bad.

But even most adults won't know the rules for what to italicize that 
have been brought up in this thread. Even if they have read books that 
use italic and bold in ways that have been brought up in this thread, 
most readers won't be able to tell you what the rules are. That's left 
to copy editors and similar specialist jobs.

There was a time when computers (and printers in particular) were 
single-case. There was some discussion about having to abolish case 
distinctions to adapt to computers, but fortunately, that wasn't necessary.

Regards,   Martin.


From unicode at unicode.org  Sun Jan 13 22:31:35 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Sun, 13 Jan 2019 20:31:35 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org>
 <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com>
Message-ID: <CAMZ=zj7UiPCrM+wePUem9qA0F3_P_7d2BzGgfNXUP96Qba6cqw@mail.gmail.com>

On Sat, Jan 12, 2019 at 8:26 PM James Kass via Unicode
<unicode at unicode.org> wrote:
> It's subjective, really.  It depends on how one views plain-text and
> one's expectations for its future.  Should plain-text be progressive,
> regressive, or stagnant?  Because those are really the only choices.
> And opinions differ.
>
> Most of us involved with Unicode probably expect plain-text to be around
> for quite a while.  The figure bandied about in the past on this list is
> "a thousand years".  Only a society of mindless drones would cling to
> the past for a millennium.  So, many of us probably figure that
> strictures laid down now will be overridden as a matter of course, over
> time.

And yet you write this in the Latin script that's been around for a
couple millennia. Arabic, Han ideographs, Cyrillic and Devanagari have
all been around a millennia.

Looking back at the history of computing, a large chunk of the
underlying technology has hit stability. ARM chips, x86 chips, Unix,
and Windows have all been around since 1985 or before, roughly 35
years ago and 35 years since the first programmed computer. They
aren't wildly changing. Unicode is moving towards that position; it
does a job and doesn't need disrupt changes to continue to be
relevant.

> Unicode will probably be around for awhile, but the barrier between
> plain- and rich-text has already morphed significantly in the relatively
> short period of time it's been around.

Fixed pictures have been parts of character sets for decades and were
part of Unicode 1.1. U+2704, WHITE SCISSORS, for example. And emoji
aren't disruptive in the way that moving something that's been a part
of the rich-text layer forever into the plain-text layer.

> I became attracted to Unicode about twenty years ago.  Because Unicode
> opened up entire /realms/ of new vistas relating to what could be done
> with computer plain text.  I hope this trend continues.

The right tool for the job. If you need rich text, you should use rich
text. Emoji had to make the case that they were being used as
characters and there were no competing tools to handle them.

-- 
Kie ekzistas vivo, ekzistas espero.

From unicode at unicode.org  Sun Jan 13 23:06:04 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Sun, 13 Jan 2019 21:06:04 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <slrnq3mqrh.vm5.jcb@home.stevens-bradfield.com>
 <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp>
Message-ID: <CAMZ=zj4tw-pEFnnQQ2B_3k5=ysoxTDCj6mZ7BYRjZE=9q2qeSA@mail.gmail.com>

On Sun, Jan 13, 2019 at 7:03 PM Martin J. D?rst via Unicode
<unicode at unicode.org> wrote:
> No, the casing idea isn't actually a dumb one. As Asmus has shown, one
> of the best ways to understand what Unicode does with respect to text
> variants is that style works on spans of characters (words,...), and is
> rich text, but thinks that work on single characters are handled in
> plain text. Upper-case is definitely for most part a single-character
> phenomenon (the recent Georgian MTAVRULI additions being the exception).

I would disagree; upper case is normally used in all caps or
title-case, and the latter is used on a word, not a character.

I don't argue that Unicode is wrong for handling casing the way it
does, but it does massively complicate the processing of any Latin
text; virtually all searches should be case-insensitive, for example.
At least in English, computerized casing will always be problematic.

> UPPER CASE can be used on whole spans of text, but that's not the main
> use case. And if UPPER CASE is used for emphasis, one way to do it (and
> the best way if this is actually a styling issue) is to use rich text
> and mark it up according to semantics, and then use some styling
> directive (e.g. CSS text-transform: uppercase) to get the desired look.

That's an example of how having multiple systems makes things more
complex and less consistent. If something can be written as all upper
case with the caps lock key, it will be. If a generated HTML file can
have uppercase added with a Python or SQL function, it probably will
be. Using CSS text-transform may be best practice, but simpler plain
text solutions will be used in a lot of cases and nothing can be
extrapolated clearly from its use or lack of use.

-- 
Kie ekzistas vivo, ekzistas espero.


From unicode at unicode.org  Sun Jan 13 23:08:37 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 14 Jan 2019 05:08:37 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
Message-ID: <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>


Marcel Schneider wrote,

 > There is a crazy typeface out there, misleadingly called 'Courier New',
 > as if the foundry didn?t anticipate that at some point it would be better
 > called "Courier Obsolete". ...

?????? ?????????????? seems a bit ????????<i>?</i> nowadays, as well.

(Had to use mark-up for that ?span? of a single letter in order to 
indicate the proper letter form.? But the plain-text display looks crazy 
with that HTML jive in it.)


From unicode at unicode.org  Mon Jan 14 00:02:21 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Sun, 13 Jan 2019 22:02:21 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <slrnq3mqrh.vm5.jcb@home.stevens-bradfield.com>
 <9a1a2642-4123-45ce-ebb0-c1aa4461c 266@it.aoyama.ac.jp>
Message-ID: <000901d4abce$b781b250$268516f0$@xencraft.com>

> But even most adults won't know the rules for what to italicize that 
> have been brought up in this thread. Even if they have read books that 
> use italic and bold in ways that have been brought up in this thread, 
> most readers won't be able to tell you what the rules are. That's left 
> to copy editors and similar specialist jobs.

Most adults don't know the right places to soft-hyphenate a word, and yet we support that in plain-text.
They also don't know the differences between the various dashes and spaces and when to use each.
Literacy isn't an appropriate criteria.  Even the apostrophe fails that test since so many people fail to distinguish its from it's and there from they're. :-)


> There was a time when computers (and printers in particular) were 
> single-case. There was some discussion about having to abolish case 
> distinctions to adapt to computers, but fortunately, that wasn't necessary.

Ironic to mention the example of the failure of technology to support linguistic requirements driving a proposal to limit the attributes of language.
As you say it was fortunate it wasn't necessary then...
It makes the case for the importance of improving technology to support fundamental language attributes.

tex


From unicode at unicode.org  Mon Jan 14 00:19:29 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Sun, 13 Jan 2019 22:19:29 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <CAMZ=zj7UiPCrM+wePUem9qA0F3_P_7d2BzGgfNXUP96Qba6cqw@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a7fc1dee-f115-a751-c0ac-b291c3672d0a@ix.netcom.com>
 <a00eabf7-50b1-ce02-a82d-fff32b0bef5c@gmail.com>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org>
 <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com>
 <CAMZ=zj7UiPCrM+wePUem9qA0F3_P_7d2BzGgfNXUP96Qba6cqw@mail.gmail
 .com>
Message-ID: <001001d4abd1$1becb800$53c62800$@xencraft.com>


"Looking back at the history of computing, a large chunk of the
underlying technology has hit stability. ARM chips, x86 chips, Unix,
and Windows have all been around since 1985 or before, roughly 35
years ago and 35 years since the first programmed computer. They
aren't wildly changing."

I would encourage you to return to a system of 35 years ago, if you believe they are the same.

Performance, pipeline, memory access, device support, graphical capabilities, underlying instructions, security features...

One could argue the wheel is medieval and still works today, but the wheels I drive on are designed for a variety of weather conditions, traction, minimal noise generation, light weight with durability and high performance, and are particular to the front or back axle. And I know from experience the wrong wheels can spin me around and ram me into a median...

tex


From unicode at unicode.org  Mon Jan 14 00:24:46 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 14 Jan 2019 06:24:46 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <3edf3287-1fe9-6444-bd12-42fbc0232094@it.aoyama.ac.jp>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <4bd54e06-6e26-2687-5282-6eda9621da5d@att.net>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org>
 <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com>
 <3edf3287-1fe9-6444-bd12-42fbc0232094@it.aoyama.ac.jp>
Message-ID: <34796c35-a574-438d-e842-83fd87a17e6d@gmail.com>


Martin J. D?rst wrote,

 > I'd say it should be conservative. As the meaning of that word
 > (similar to others such as progressive and regressive) may be
 > interpreted in various way, here's what I mean by that.
 >
 > It should not take up and extend every little fad at the blink of an
 > eye. It should wait to see what the real needs are, and what may be
 > just a temporary fad. As the Mathematical style variants show, once
 > characters are encoded, it's difficult to get people off using them,
 > even in ways not intended.

A conservative approach to progress is a sensible position for computer 
character encoders.? Taking a conservative approach doesn't necessarily 
mean being anti-progress.

Trying to "get people off" using already encoded characters, whether or 
not the encoded characters are used as intended, might give an 
impression of being anti-progress.

Unicode doesn't enforce any spelling or punctuation rules.? Unicode 
doesn't tell human beings how to pronounce strings of text or how to 
interpret them.? Unicode doesn't push any rules about splitting 
infinitives or conjugating verbs.

Unicode should not tell people how any written symbol must be 
interpreted.? Unicode should not tell people how or where to deploy 
their own written symbols.

Perhaps fraktur is frivolous in English text.? Perhaps its use would 
result in a new convention for written English which would enhance the 
literary experience.? Italics conventions which have only been around a 
hundred years or so may well turn out to be just a passing fad, so we 
should probably give it a bit more time.

Telling people they mustn't use Latin italics letter forms in computer 
text while we wait to see if the practice catches on seems flawed in 
concept.


From unicode at unicode.org  Mon Jan 14 01:26:36 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Mon, 14 Jan 2019 07:26:36 +0000 (GMT)
Subject: A last missing link for interoperable representation
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
Message-ID: <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>

On 2019-01-13, Marcel Schneider via Unicode <unicode at unicode.org> wrote:
> As far as the information goes that was running until now on this List,
> Mathematicians are both using TeX and liking the Unicode math alphabets.

As Khaled has said, if they use them, it's because some software
designer has decided to use them to implement markup.
I have never seen a Unicode math alphabet character in email outside
this list.

> These statements make me fear that the font you are using might unsupport
> the NARROW NO-BREAK SPACE U+202F >?<. If you see a question mark between

It displays as a space. As one would expect - I use fixed width fonts
for plain text.

> these pointy brackets, please let us know. Because then, You?re unable to
> read interoperably usable French text, too, as you?ll see double punctuation
> (eg "?!") where a single mark is intended, like here?!

I see "like here !".
French text does not need narrow spacing any more than science does.
When doing typography, fifty centimetres is $50\thinspace\mathrm{cm}$;
in plain text, 50cm does just fine.
Likewise, normal French people writing email write "Quel idiot!", or
sometimes "Quel idiot !".

If you google that phrase on a few French websites, you'll see that
some (such as Larousse, whom one might expect to care about such
things) use no space before punctuation, while others (such as some
random T-shirt company) use an ASCII space.

The Acad?mie Fran?aise, which by definition knows more about French
orthography than you do, uses full ASCII spaces before ? and ! on its
front page. Also after opening guillemets, which looks even more
stupid from an Anglophone perspective.

> Aiming at extending the subset of environments supporting correct typesetting

There are many fine programs, including TeX, for doing good
typesetting. Unicode is not about typesetting, it's about information
exchange and preservation.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Mon Jan 14 01:28:18 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Mon, 14 Jan 2019 07:28:18 +0000 (GMT)
Subject: A last missing link for interoperable representation
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
Message-ID: <slrnq3oegi.7rk.jcb@home.stevens-bradfield.com>

On 2019-01-14, James Kass via Unicode <unicode at unicode.org> wrote:
> ?????? ?????????????? seems a bit ????????<i>?</i> nowadays, as well.
>
> (Had to use mark-up for that ?span? of a single letter in order to 
> indicate the proper letter form.? But the plain-text display looks crazy 
> with that HTML jive in it.)

Indeed. But
 _Art nouveau_ seems a bit _pass?_ nowadays
looks fine and is understood even by those who have never annotated a
manuscript with proof corrections.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Mon Jan 14 01:47:45 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Mon, 14 Jan 2019 07:47:45 +0000 (GMT)
Subject: A last missing link for interoperable representation
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com>
 <slrnq3mqo9.vm5.jcb@home.stevens-bradfield.com>
 <efcc32dd-5875-ec09-e6c6-340307ef9ed0@gmail.com>
Message-ID: <slrnq3ofl1.7rk.jcb@home.stevens-bradfield.com>

On 2019-01-13, James Kass via Unicode <unicode at unicode.org> wrote:
> ??? ?? ???? ????? ??? ?? ??? ?? ??? ???, ?? ????? ????? (*) ??????

> What happens with Devanagari text?? Should the user community refrain 
> from interchanging data because 1980s era software isn't Unicode aware?

Devanagari is an established writing system (which also doesn't need
separate letters for different typefaces). Those who wish to exchange
information in devanagari will use either an ISCII or Unicode system
with suitable font support.
Just as those who wish to exchange English text with typographic
detail will use a suitable typographic mark-up system with font
support, which will typically not interfere with plain text searching.
Even in a PDF document, "art nouveau" will appear as "art nouveau"
whatever font it's in.

Incidentally, a large chunk of my facebook feed is Indian politics,
and of that portion of it that is in Hindi or other Indian
languages, most is still written in ASCII transcription, even though
every web browser and social media application in common use surely
has full Unicode support these days. Sometimes using your own writing
system is just too much effort!

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Mon Jan 14 01:56:43 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 14 Jan 2019 07:56:43 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
Message-ID: <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>


Julian Bradfield wrote,

 > I have never seen a Unicode math alphabet character in email
 > outside this list.

It's being done though.? Check this message from 2013 which includes the 
following, copy/pasted from the web page into Notepad:

???????? ???? ????????.??????????????????? ? ???????? ???????? ????????? 
????????????.??????/????????????????????

https://apple.stackexchange.com/questions/104159/what-are-these-characters-and-how-can-i-use-them


From unicode at unicode.org  Mon Jan 14 02:40:58 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Mon, 14 Jan 2019 08:40:58 +0000 (GMT)
Subject: A last missing link for interoperable representation
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
Message-ID: <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>

On 2019-01-14, James Kass via Unicode <unicode at unicode.org> wrote:
> Julian Bradfield wrote,
> > I have never seen a Unicode math alphabet character in email
> > outside this list.
>
> It's being done though.? Check this message from 2013 which includes the 
> following, copy/pasted from the web page into Notepad:
>
> ???????? ???? ????????.??????????????????? ? ???????? ???????? ????????? 
> ????????????.??????/????????????????????
>
> https://apple.stackexchange.com/questions/104159/what-are-these-characters-and-how-can-i-use-them

Which makes the point very nicely. They're not being *used* to do maths,
they're being played with for purely decorative purposes, and moreover
in a way which breaks the actual intended use as a URL.
If you introduce random stuff into Unicode, people will play with it
(or use it for phishing).
The whole thread is, as it says, "what is this weird stuff"?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Mon Jan 14 02:48:05 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 14 Jan 2019 08:48:05 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3ofl1.7rk.jcb@home.stevens-bradfield.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <7f1887c9-bd10-9fe7-f099-3fe1d78551f1@gmail.com>
 <slrnq3mqo9.vm5.jcb@home.stevens-bradfield.com>
 <efcc32dd-5875-ec09-e6c6-340307ef9ed0@gmail.com>
 <slrnq3ofl1.7rk.jcb@home.stevens-bradfield.com>
Message-ID: <20190114084805.596db197@JRWUBU2>

On Mon, 14 Jan 2019 07:47:45 +0000 (GMT)
Julian Bradfield via Unicode <unicode at unicode.org> wrote:

> On 2019-01-13, James Kass via Unicode <unicode at unicode.org> wrote:
> > ??? ?? ???? ????? ??? ?? ??? ?? ??? ???, ?? ????? ????? (*) ??????  
> 
> > What happens with Devanagari text?? Should the user community
> > refrain from interchanging data because 1980s era software isn't
> > Unicode aware?  
> 
> Devanagari is an established writing system (which also doesn't need
> separate letters for different typefaces). Those who wish to exchange
> information in devanagari will use either an ISCII or Unicode system
> with suitable font support.

Has ISCII kept abreast of additions to the encoded Devanagari script?
Hindi may be an established writing system, but Vedic Sanskrit with a
full details is another matter.

Even with full Unicode support, having a 'suitable font' is an issue
with 'plain text', even deprecated plain text. The problems are that
writers of Hindi don't want to have to manually suppress ligature
formation, and it doesn't help that tables of Hidi conjuncts don't
express the difference between real and fake viramas.  (The difference
surfaces with preposed vowels.) 

> Just as those who wish to exchange English text with typographic
> detail will use a suitable typographic mark-up system with font
> support, which will typically not interfere with plain text searching.
> Even in a PDF document, "art nouveau" will appear as "art nouveau"
> whatever font it's in.

But "art nouveau" is ASCII.  Copying truly complex Indic from a PDF is
still something of an adventure.

> Incidentally, a large chunk of my facebook feed is Indian politics,
> and of that portion of it that is in Hindi or other Indian
> languages, most is still written in ASCII transcription, even though
> every web browser and social media application in common use surely
> has full Unicode support these days.

I don't believe the USE has been added to IE 11, and certainly not on
Windows 7.  And I fear that of OpenType fonts, only mine
widely support Tai Tham as documented on the Unicode site.  (And
'widely' excludes IE 11, but not MS Edge.)  A fair few Tai Tham fonts
rely on being permitted to bypass the script-specific support, which the
Windows stack only permits to privileged scripts. 

Richard.


From unicode at unicode.org  Mon Jan 14 03:06:47 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 14 Jan 2019 09:06:47 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
Message-ID: <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>


Not a twitter user, don't know how popular the practice is, but here's a 
couple of links concerned with how to use bold or italics in Twitter 
plain text messages.

https://www.simplehelp.net/2018/03/13/how-to-use-bold-and-italicized-text-on-twitter/
https://mothereff.in/twitalics

Both pages include a form of caveat.? But the caveat isn't about the 
intended use of the math alphanumerics.

The first page includes the following text as part of a "tweet":
Just because you ?????? doesn?t mean you ???????????? :)

And, as before, I have no idea how /popular/ the practice is.? But 
here's some more links:

(web page from 2013)
How To Write In Italics, Tweet Backwards And Use Lots Of Different ...
https://www.adweek.com/digital/twitter-font-italics-backwards/

(This is copy/pasted *as-is* from the web page to plain-text)
Bold and Italic Unicode Text Tool - ???????? ?????? ?????????????? - YayText
https://yaytext.com/bold-italic/
Super cool unicode text magic. Write ???????? and/or ???????????? 
updates on Facebook, Twitter, and elsewhere. Bold (serif) preview copy 
tweet.

Michael Maurino [emoji redacted-JK] on Twitter: "Can I make italics on 
twitter? 'cause ...
https://twitter.com/iron_stylus/status/281991180064022528?lang=en

Charlie Brooker on Twitter: "How do you do italics on this thing again?"
https://twitter.com/charltonbrooker/status/484623185862983680?lang=en

How to make your Facebook and Twitter text bold or italic, and other ...
https://boingboing.net/2016/04/10/yaytext-unicode-text-styling.html
Apr 10, 2016 - For years I've been using the Panix Unicode Text 
Converter to create ironic, weird or simply annoying text effects for 
use on Twitter, Facebook ...

How to change your Twitter font | Digital Trends
https://www.digitaltrends.com/.../now-you-can-use-bold-italics-and-other-fancy-fonts-...
Aug 14, 2013 - now you can use bold italics and other fancy fonts on 
twitter isaac ... or phrase into your Twitter text box, and there you 
have it: fancy tweets.

Twitter Fonts Generator (???????? ?????? ??????????) ? LingoJam
https://lingojam.com/TwitterFonts
You might have noticed that some users on Twitter are able to change the 
font ... them to seemingly make their tweet font bold, italic, or just 
completely different.


From unicode at unicode.org  Mon Jan 14 03:45:42 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Mon, 14 Jan 2019 09:45:42 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>
Message-ID: <96a4e654-cf49-563a-ad23-4aef54eed4c3@it.aoyama.ac.jp>

Hello James, others,

 From the examples below, it looks like a feature request for Twitter 
(and/or Facebook). Blaming the problem on Unicode doesn't seem to be 
appropriate.

Regards,   Martin.

On 2019/01/14 18:06, James Kass via Unicode wrote:
> 
> Not a twitter user, don't know how popular the practice is, but here's a 
> couple of links concerned with how to use bold or italics in Twitter 
> plain text messages.
> 
> https://www.simplehelp.net/2018/03/13/how-to-use-bold-and-italicized-text-on-twitter/ 
> 
> https://mothereff.in/twitalics
> 
> Both pages include a form of caveat.? But the caveat isn't about the 
> intended use of the math alphanumerics.
> 
> The first page includes the following text as part of a "tweet":
> Just because you ?????? doesn?t mean you ???????????? :)
> 
> And, as before, I have no idea how /popular/ the practice is.? But 
> here's some more links:
> 
> (web page from 2013)
> How To Write In Italics, Tweet Backwards And Use Lots Of Different ...
> https://www.adweek.com/digital/twitter-font-italics-backwards/
> 
> (This is copy/pasted *as-is* from the web page to plain-text)
> Bold and Italic Unicode Text Tool - ???????? ?????? ?????????????? - 
> YayText
> https://yaytext.com/bold-italic/
> Super cool unicode text magic. Write ???????? and/or ???????????? 
> updates on Facebook, Twitter, and elsewhere. Bold (serif) preview copy 
> tweet.
> 
> Michael Maurino [emoji redacted-JK] on Twitter: "Can I make italics on 
> twitter? 'cause ...
> https://twitter.com/iron_stylus/status/281991180064022528?lang=en
> 
> Charlie Brooker on Twitter: "How do you do italics on this thing again?"
> https://twitter.com/charltonbrooker/status/484623185862983680?lang=en
> 
> How to make your Facebook and Twitter text bold or italic, and other ...
> https://boingboing.net/2016/04/10/yaytext-unicode-text-styling.html
> Apr 10, 2016 - For years I've been using the Panix Unicode Text 
> Converter to create ironic, weird or simply annoying text effects for 
> use on Twitter, Facebook ...
> 
> How to change your Twitter font | Digital Trends
> https://www.digitaltrends.com/.../now-you-can-use-bold-italics-and-other-fancy-fonts-... 
> 
> Aug 14, 2013 - now you can use bold italics and other fancy fonts on 
> twitter isaac ... or phrase into your Twitter text box, and there you 
> have it: fancy tweets.
> 
> Twitter Fonts Generator (???????? ?????? ??????????) ? LingoJam
> https://lingojam.com/TwitterFonts
> You might have noticed that some users on Twitter are able to change the 
> font ... them to seemingly make their tweet font bold, italic, or just 
> completely different.
> 


From unicode at unicode.org  Mon Jan 14 03:57:18 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Mon, 14 Jan 2019 09:57:18 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <34796c35-a574-438d-e842-83fd87a17e6d@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org>
 <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com>
 <3edf3287-1fe9-6444-bd12-42fbc0232094@it.aoyama.ac.jp>
 <34796c35-a574-438d-e842-83fd87a17e6d@gmail.com>
Message-ID: <a69a1b17-5d15-11a4-4e15-c3ef7dd386b0@it.aoyama.ac.jp>

Hello James, others,

On 2019/01/14 15:24, James Kass via Unicode wrote:
> 
> Martin J. D?rst wrote,
> 
>  > I'd say it should be conservative. As the meaning of that word
>  > (similar to others such as progressive and regressive) may be
>  > interpreted in various way, here's what I mean by that.
>  >
>  > It should not take up and extend every little fad at the blink of an
>  > eye. It should wait to see what the real needs are, and what may be
>  > just a temporary fad. As the Mathematical style variants show, once
>  > characters are encoded, it's difficult to get people off using them,
>  > even in ways not intended.
> 
> A conservative approach to progress is a sensible position for computer 
> character encoders.? Taking a conservative approach doesn't necessarily 
> mean being anti-progress.
> 
> Trying to "get people off" using already encoded characters, whether or 
> not the encoded characters are used as intended, might give an 
> impression of being anti-progress.

Using the expression "get people off" was indeed somewhat ambiguous. Of 
course we cannot forbid people to use Mathematical alphanumerics. 
There's no standards police, neither for Unicode nor most other standards.


> Unicode doesn't enforce any spelling or punctuation rules.? Unicode 
> doesn't tell human beings how to pronounce strings of text or how to 
> interpret them.? Unicode doesn't push any rules about splitting 
> infinitives or conjugating verbs.
> 
> Unicode should not tell people how any written symbol must be 
> interpreted.? Unicode should not tell people how or where to deploy 
> their own written symbols.

Yes. But Unicode can very well say: These characters are for Math, and 
if you use them for anything else, that's your problem, and because they 
are used for Math, they support what's used in Math, and we won't add 
copies of accented characters or variant characters for style or [your 
proposal goes here] because that's not what Unicode is about. If you 
want real styling, then use applications that can do that, or try to 
convince your application provider to provide that.

(Well, Unicode is more or less saying just exactly that currently.)

And that's what I meant with "getting people off". If that then leads to 
less people (mis)using these characters, all the better.


> Perhaps fraktur is frivolous in English text.? Perhaps its use would 
> result in a new convention for written English which would enhance the 
> literary experience.? Italics conventions which have only been around a 
> hundred years or so may well turn out to be just a passing fad, so we 
> should probably give it a bit more time.

There's no need to give italic conventions more time. Of course they may 
die out, but they are very active now. And they are very actively 
supported in rich text, where they belong.


> Telling people they mustn't use Latin italics letter forms in computer 
> text while we wait to see if the practice catches on seems flawed in 
> concept.

The practice is already there. Lots of people use italics in rich text. 
That's just fine because that's the right thing to do. We don't need to 
muddy the waters.

Regards,   Martin.


From unicode at unicode.org  Mon Jan 14 04:08:04 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Mon, 14 Jan 2019 02:08:04 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com>
Message-ID: <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>

This thread has gone on for a bit and I question if there is any more light that can be shed.

 
BTW, I admit to liking Asmus definition for functions that span text being a definition or criteria for rich text.

I also liked James examples of the twitter use case.

 
The arguments against italics seem to be:

?        Unicode is plain text. Italics is rich text.

?        We haven't had it until now, so we don't need it.

?        There are many rich text solutions, such as html.

?        There are ways to indicate or simulate italics in plain text including using underscore or other characters, using characters that look italic (eg math), etc.

?        Adding Italicization might break existing software

?        The examples of  existing Unicode characters that seem to represent rich text (emoji, interlinear annotation, et al) have justifications.

 
The case for it are:

?        Plain text still has tremendous utility and rich text is not always an option.

?        Simulations for italics are non-standard and therefore hurt interoperability. This includes math characters not being supported universally, underscore and other indicators are not a standard, nor are alternative fonts.

?        There are legitimate needs for a standardized approach for interchange, accessibility (e.g. screen readers), search, twitter, et al. 

?        Evidence of the demand is perhaps demonstrated by the number of simulations, and the requests for how to implement it to vendors of plain text apps (such as twitter).

?        Supporting italics can be implemented without breaking existing documents and should be easily supported in modern Unicode apps.

?        The impact on the standard for adding a character for italics (and another for bold and perhaps a couple others) is miniscule as it fits into the VS model.

?        The argument that italics is rich text is an ideological one. However, as with other examples, there are cases where practicality should win out.

?        This isn?t a slippery slope.

 
Personally, I think the cost seems very low, both to the standard and to implementers. I don?t see a lot of risk that it will break apps. (At least not those that wouldn?t be broken by VS or other features in the standard.)

It will help many apps.

I think the benefits to interoperability, accessibility, search, standardization of text are significant.

 
Perhaps the question should be put to twitter, messaging apps, text-to-voice vendors, and others whether it will be useful or not.

If the discussion continues I would like to see more of a cost/benefit analysis. Where is the harm? What will the benefit to user communities be?

 
tex

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190114/b6f03e08/attachment.html>

From unicode at unicode.org  Mon Jan 14 04:30:58 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 14 Jan 2019 10:30:58 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <96a4e654-cf49-563a-ad23-4aef54eed4c3@it.aoyama.ac.jp>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>
 <96a4e654-cf49-563a-ad23-4aef54eed4c3@it.aoyama.ac.jp>
Message-ID: <fc7c6a86-e0f1-99fa-5ebe-775f74c0ea94@gmail.com>


Hello Martin, others...

 > Blaming the problem on Unicode doesn't seem to be appropriate.

I don't consider that there's any problem with plain text users 
exchanging plain text.? I give Unicode /credit/ for being the foundation 
of that ability.? Anyone imagining that I'm casting blame is under a 
misconception.

There's plain text data out there stringing math alphanumerics into 
recognizable words.? It's being stored and shared and indexed.? I have 
no problem with that; I'm in favor of it.

(Everyone, please let's focus on Tex Texin's latest post.? Wish I'd sent 
this post before his...)

Best regards,

James Kass


From unicode at unicode.org  Mon Jan 14 07:19:03 2019
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 14 Jan 2019 14:19:03 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
Message-ID: <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com>


> On 14 Jan 2019, at 06:08, James Kass via Unicode <unicode at unicode.org> wrote:
> 
> ?????? ?????????????? seems a bit ????????<i>?</i> nowadays, as well.
> 
> (Had to use mark-up for that ?span? of a single letter in order to indicate the proper letter form.  But the plain-text display looks crazy with that HTML jive in it.)

How about using U+0301 COMBINING ACUTE ACCENT: ???????????


From unicode at unicode.org  Mon Jan 14 07:44:52 2019
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 14 Jan 2019 14:44:52 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <20190113214311.GA1281@macbook.localdomain>
References: <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <20190113214311.GA1281@macbook.localdomain>
Message-ID: <7141EA4B-238B-4B21-B3E6-B7AB23C7023B@telia.com>


> On 13 Jan 2019, at 22:43, Khaled Hosny via Unicode <unicode at unicode.org> wrote:
> 
> LaTeX with the
> ?unicode-math? package will translate ASCII + font switches to the
> respective Unicode math alphanumeric characters. Word will do the same.
> Even browsers rendering MathML will do the same (though most likely the
> MathML source will have the math alphanumeric characters already).

For full translation, one probably has to use ConTexT and LuaTeX. Then, along with PDF, one can also generate HTML with MathML.


From unicode at unicode.org  Mon Jan 14 09:38:26 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Mon, 14 Jan 2019 16:38:26 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <slrnq3mqrh.vm5.jcb@home.stevens-bradfield.com>
 <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp>
Message-ID: <abe1b0d6-9120-19d2-f3c9-f91ff2dc6a6c@orange.fr>

On 14/01/2019 04:00, Martin J. D?rst via Unicode wrote:
[?]
> [?] As Asmus has shown, one of the best ways to understand what
> Unicode does with respect to text variants is that style works on
> spans of characters (words,...), and is rich text, but thinks that
> work on single characters are handled in plain text. Upper-case is
> definitely for most part a single-character phenomenon (the recent
> Georgian MTAVRULI additions being the exception).

Obviously the single-character rule also applies to superscript when
used as ordinal indicator or more generally, as abbreviation indicator.

Thanks for the hint, it?s all about interoperability and in this case
too the point in using preformatted characters is a good one IIUC.

Sorry for getting a little off-topic. There?s also one reply on my
to-do list where I?ll do even more so; can?t help given it?s our
digital representation that?s at stake, and due to past neglect on
either side there?s still a need to painfully lobby for each
character while so many other important issues are out there?

Best Regards,

Marcel

From unicode at unicode.org  Mon Jan 14 13:14:34 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Mon, 14 Jan 2019 20:14:34 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
Message-ID: <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>

On 14/01/2019 06:08, James Kass via Unicode wrote:
> 
> Marcel Schneider wrote,
> 
>> There is a crazy typeface out there, misleadingly called 'Courier
>> New', as if the foundry didn?t anticipate that at some point it
>> would be better called "Courier Obsolete". ...
> 
> ?????? ?????????????? seems a bit ????????<i>?</i> nowadays, as well.
> 
> (Had to use mark-up for that ?span? of a single letter in order to
> indicate the proper letter form.  But the plain-text display looks
> crazy with that HTML jive in it.)
> 

I apologize for seeming to question the font name ?????? ???? while targeting only
the fact that this typeface is not updated to support the <NNBSP>. It just
looks like the grand name is now misused to make people believe that if
**this** great font is unsupporting <NNBSP>, it has a good reason to do so,
and we should keep people off using that ?exotic whitespace? otherwise than
?intended,? ie for Mongolian. Since fortunately TUS started backing its use
in French (2014) and ended up raising this usage to the first place, I can?t
see why major vendors are both using this obsolete font as monospace default
in main software *and* are not seeming to think at updating its coverage.

OK, in fact I *can* see a ?good? reason, that I?ve hinted in the cited ticket,
but I won?t be going to dump it on the List again and again.

Thanks for pointing the flaw in my wording.

Best regards,

Marcel

From unicode at unicode.org  Mon Jan 14 15:21:13 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 14 Jan 2019 13:21:13 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
Message-ID: <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190114/0bc1da54/attachment.html>

From unicode at unicode.org  Mon Jan 14 15:42:40 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Mon, 14 Jan 2019 22:42:40 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
Message-ID: <b6fc249b-36df-7d6e-32d9-ccba9f5c4e24@orange.fr>

On 14/01/2019 08:26, Julian Bradfield via Unicode wrote:
> On 2019-01-13, Marcel Schneider via Unicode <unicode at unicode.org> wrote:
[?]
>> These statements make me fear that the font you are using might unsupport
>> the NARROW NO-BREAK SPACE U+202F >?<. If you see a question mark between
> 
> It displays as a space. As one would expect - I use fixed width fonts
> for plain text.

It?s mainly that I suspected you could be using Courier New in the terminal.
It?s default for plain text in main browsers, and there are devices whose
copy of Courier New shows a .notdef box for U+202F. That?s at least what I
?nderstood from the feedback, and a test in my browser looked likewise.

> 
>> these pointy brackets, please let us know. Because then, You?re unable to
>> read interoperably usable French text, too, as you?ll see double punctuation
>> (eg "?!") where a single mark is intended, like here?!
> 
> I see "like here !".

That?s fine, your font has support for <NNBSP>. Thanks for reporting.

The reason why I?m anxious to see that checked is that the impact on
implementations of <NNBSP> as the group separator is being assessed.

> French text does not need narrow spacing any more than science does.
> When doing typography, fifty centimetres is $50\thinspace\mathrm{cm}$;
> in plain text, 50cm does just fine.

By ?plain text? you probably mean *draft style*. I?m thinking that
because "$50\thinspace\mathrm{cm}$" is not less plain text than "50cm".

Indeed, in not understanding that sooner I was an idiot, naively
believing that all Unicode List Members are using Unicode terminology.
Turns out that that cannot be taken for granted any more than knowing
the preferences of French people as of French text display, while not
being a Frenchman:

1. Most French people prefer that big punctunation be spaced off from
    the word it pertains to.

2. Most French people strongly dislike punctuation cut off by a line
    break, but cannot fix it because:
    a) the ordinary keyboard layout has no non-breaking spaces;
    b) the <NBSP> readily available on peculiar keyboard layouts
       is bugging in most e-mail composers, ending up as breakable.

3. A significant part of French people strongly dislike angle quotes
    that are spaced off too far, as it happens when using <NBSP>.

> Likewise, normal French people writing email write "Quel idiot!", or
> sometimes "Quel idiot !".

Normal people using normal keyboard layouts are writing with the
readily available characters most of the time. This is why (to pick
another example) French people abbreviate ?num?ro? to "n?", while
on a British English or an American English keyboard layout we
can?t normally expect anything else than "no", or "#" for ?Number.?

We?re not trying to keep people off writing fast and draft style.
What in the Unicode era every locale is expected to achieve is to
enable normal users to get the accurate interoperable representation
of their language while typing fast, as opposed to coding in TeX,
which is like using InDesign with system spaces instead of Unicode.
System spaces are not interoperable, nor is LaTeX \thinspace if that
is non-breakable in LaTeX, which it obviously is, since it is used
to represent the thin space between a number and a measurement unit.

In Unicode, as we know it, U+2009 THIN SPACE is breakable, and the
worst thing here is that its duplicate encoding U+2008 PUNCTUATION
SPACE is breakable too, instead of being non-breakable like U+2007
FIGURE SPACE. That is why there was a need to add U+202F NARROW
NO-BREAK SPACE later. (More details in the cited CLDR ticket.)

> 
> If you google that phrase on a few French websites, you'll see that
> some (such as Larousse, whom one might expect to care about such
> things) use no space before punctuation,

Thanks for catching, that flaw shall be reported with link to
your email.

You may also wish to look up this page:
https://communaute.lerobert.com/forum/LE-ROBERT-CORRECTEUR/LE-ROBERT-CORRECTEUR-CORRECTION-D-ORTHOGRAPHE-DICTIONNAIRES-ET-GUIDES/Espace-entre-le-meotet-le-point-d-interrogation/2918628/398261

reading: ?Le logiciel Le Robert correcteur justement signale les
espaces fines ins?cables si elles ne sont pas pr?sentes sur le texte
et propose la correction.? (?Le?Robert spellchecker does report
the lack of narrow no-break spaces and proposes to fix it.?)

> while others (such as some
> random T-shirt company) use an ASCII space.
> 
> The Acad?mie Fran?aise, which by definition knows more about French
> orthography than you do, uses full ASCII spaces before ? and ! on its
> front page. Also after opening guillemets, which looks even more
> stupid from an Anglophone perspective.

(See point 3 above.) That is a very good point. Indeed this website is
reasonably expected to be an example and a template of correctly
typesetting a French website. There are several reasons why actually it
is not. The main reason is that it is not the work of the A.F. itself,
but of webdesigners, webmasters and content managers, who are normal
people like for any other website. They just haven?t got an appropriate
keyboard layout yet, and that is ultimately my fault because in the
nineties and later I didn?t care about computers and keyboard layouts.
That may sound crazy but it isn?t really. French is needing so a
peculiar keyboard layout to get its representation functional, useful
and interoperable without slowing down typists, that numerous
preconditions and time was needed to design it.

Among the preconditions, Unicode did not have the needed
non-breakable thin space when keyboarding was on in France.

French typesetters were aware of the thin space needed with big
punctuation marks (sometimes called tall or double punctuation).
The style manual of the Imprimerie Nationale is unambiguous, and
where it isn?t, its actual practice is to be followed. That leaves
only the colon not with <NNBSP> but with <NBSP>. I cannot post a
scan or photo of the table at page 149, nor of the examples as
they are typeset in the print book, because it?s copyrighted
material, but you?re welcome to purchase your copy if you didn?t
already. That guide is kind of quoted by the A.F. when it?s up to
determine whether capital letters should be diacriticized or not.

Philippe Verdy reported in 2015 on this List that in France,
the colon too is widely typeset with <NNBSP>, and that the
Imprimerie Nationale conforms to the specs of its clients.

> 
>> Aiming at extending the subset of environments supporting correct typesetting
> 
> There are many fine programs, including TeX, for doing good
> typesetting. Unicode is not about typesetting, it's about information
> exchange and preservation.

Yah and TeX is converting our code to Unicode, so that we have
several formats to choose from when considering exchange and
preservation.

The point in having an interoperable digital representation of
all natural languages is that normal people are not forced to
use draft style when just writing their language on a computer.

Best regards,

Marcel

From unicode at unicode.org  Mon Jan 14 16:08:23 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Mon, 14 Jan 2019 14:08:23 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.n etcom.com>
Message-ID: <001201d4ac55$ab716500$02542f00$@xencraft.com>

Asmus,

 
I agree 100%. Asking where is the harm was an actual question intended to surface problems. It wasn?t rhetoric for saying there is no harm.

 
Also, it may not be obvious to social media, messaging platforms, that there is a possibility of a solution. Often when a problem exists for a long time, it fades into unconsciousness. The pain is accepted as that is the way it is and has to be.

It becomes part of the culture. Asking if there is a pain and whether a solution would be welcomed is consciousness raising.

 
I agree about leading standardization. I thought some legitimate needs were raised. The questions were designed to quantify the use case as well as the potential damage.

 
I didn?t think anyone was recommending more math abuse. I thought it was raised as an example of people resorting to them as a solution for a need. Of course they are also an example of playful experimentation.

 
Separately,

Regarding messaging platforms, although twitter is one example in the social media space, today there are many business, commercial, and other applications that embed messaging capabilities for their communities and for servicing customers.

I wouldn?t dismiss the need just based on twitter?s assessment or on the idea that social media is just for casual or ?fun? use. Clarity of communications can be significant for many organizations. Having the proposed capabilities in plain text rather than requiring all of the overhead of a more rich text solution could be a big win for these apps.

 
tex

 
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag via Unicode
Sent: Monday, January 14, 2019 1:21 PM
To: unicode at unicode.org
Subject: Re: A last missing link for interoperable representation

 
On 1/14/2019 2:08 AM, Tex via Unicode wrote:

Perhaps the question should be put to twitter, messaging apps, text-to-voice vendors, and others whether it will be useful or not.

If the discussion continues I would like to see more of a cost/benefit analysis. Where is the harm? What will the benefit to user communities be?

The "it does no harm" is never an argument "for" making a change. It's something of a necessary, but not a sufficient condition, in other words.

More to the point, if there were platforms (like social media) that felt an urgent need to support styling without a markup language, and could articulate that need in terms of a proposal, then we would have something to discuss. (We might engage them in a discussion of the advisability of supporting "markdown", for example).

Short of that, I'm extremely leery of "leading" standardization; that is, encoding things that "might" be used.

As for the abuse of math alphabetics. That's happening whether we like it or not, but at this point represents playful experimentation by the exuberant fringe of Unicode users and certainly doesn't need any additional extensions. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190114/f6b63c53/attachment.html>

From unicode at unicode.org  Mon Jan 14 16:43:17 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 14 Jan 2019 22:43:17 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com>
Message-ID: <b461f2d6-bcfe-266a-5ea5-e73edb1391e0@gmail.com>


Hans ?berg wrote,

 > How about using U+0301 COMBINING ACUTE ACCENT: ???????????

Thought about using a combining accent.? Figured it would just display 
with a dotted circle but neglected to try it out first.? It actually 
renders perfectly here.? /That's/ good to know.? (smile)


From unicode at unicode.org  Mon Jan 14 16:58:24 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Mon, 14 Jan 2019 14:58:24 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
Message-ID: <CAMZ=zj5f-+xf5OZAmqgC9wvFhW9f38N7S4HwzuTbG69b4Dpm6Q@mail.gmail.com>

On Mon, Jan 14, 2019 at 2:09 AM Tex via Unicode <unicode at unicode.org> wrote:
> The arguments against italics seem to be:
>
> ?        Unicode is plain text. Italics is rich text.
>
> ?        We haven't had it until now, so we don't need it.
>
> ?        There are many rich text solutions, such as html.
>
> ?        There are ways to indicate or simulate italics in plain text including using underscore or other characters, using characters that look italic (eg math), etc.
>
> ?        Adding Italicization might break existing software
>
> ?        The examples of  existing Unicode characters that seem to represent rich text (emoji, interlinear annotation, et al) have justifications.

There generally shouldn't be multiple ways of doing things. For
example, if you think that searching for certain text in italics is
important, then having both HTML italics and Unicode italics are going
to cause searches to fail or succeed unexpectedly, unless the
underlying software unifies the two systems (an extra complexity).
Searching for certain italicized text could be done today in rich text
applications, were there actual demand for it.

> ?        Plain text still has tremendous utility and rich text is not always an option.

Where? Twitter has the option of doing rich text, as does any closed
system. In fact, Twitter is rich text, in that it hyperlinks web
addresses. That Twitter has chosen not to support italics is a choice.
If users don't like this, they could go another system, or use
third-party tools to transmit rich text over Twitter. The use of
underscores or <i> </i> markings for italics would be mostly
compatible with human twitterers using the normal interface.

Source code is an example of plain text, and yet adding italics into
comments would require but a trivial change to editors. If the user
audience cared, it would have been done. In fact, I suspect there
exist editors and environments where an HTML subset is put into
comments and rendered by the editors; certainly active links would be
more useful in source code comments than italics.

Lastly, the places where I still find massive use of plain text are
the places this would hurt the most. GNU Grep's manpage shows no sign
that it supports searching under any form of Unicode normalization.
Same with GNU Less. Adding italics would just make searching plain
text documents more complex for their users. The domain name system
would just add them to the ban list, and they'd be used for spoofing
in filenames and other less controlled but still sensitive
environments.

-- 
Kie ekzistas vivo, ekzistas espero.


From unicode at unicode.org  Mon Jan 14 17:02:49 2019
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 15 Jan 2019 00:02:49 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <b461f2d6-bcfe-266a-5ea5-e73edb1391e0@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com>
 <b461f2d6-bcfe-266a-5ea5-e73edb1391e0@gmail.com>
Message-ID: <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com>


> On 14 Jan 2019, at 23:43, James Kass via Unicode <unicode at unicode.org> wrote:
> 
> Hans ?berg wrote,
> 
> > How about using U+0301 COMBINING ACUTE ACCENT: ???????????
> 
> Thought about using a combining accent.  Figured it would just display with a dotted circle but neglected to try it out first.  It actually renders perfectly here.  /That's/ good to know.  (smile)

It is a bit off here. One can try math, too: the derivative of ??(??) is ???(??).


From unicode at unicode.org  Mon Jan 14 17:21:02 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 14 Jan 2019 15:21:02 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <CAMZ=zj5f-+xf5OZAmqgC9wvFhW9f38N7S4HwzuTbG69b4Dpm6Q@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <CAMZ=zj5f-+xf5OZAmqgC9wvFhW9f38N7S4HwzuTbG69b4Dpm6Q@mail.gmail.com>
Message-ID: <4763e887-33c5-7a82-fb2f-3357791b61bc@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190114/d2766614/attachment.html>

From unicode at unicode.org  Mon Jan 14 17:37:15 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 14 Jan 2019 23:37:15 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com>
 <b461f2d6-bcfe-266a-5ea5-e73edb1391e0@gmail.com>
 <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com>
Message-ID: <20190114233715.0a46eb16@JRWUBU2>

On Tue, 15 Jan 2019 00:02:49 +0100
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> > On 14 Jan 2019, at 23:43, James Kass via Unicode
> > <unicode at unicode.org> wrote:
> > 
> > Hans ?berg wrote,
> >   
> > > How about using U+0301 COMBINING ACUTE ACCENT: ???????????  
> > 
> > Thought about using a combining accent.  Figured it would just
> > display with a dotted circle but neglected to try it out first.  It
> > actually renders perfectly here.  /That's/ good to know.  (smile)  
> 
> It is a bit off here. One can try math, too: the derivative of ??(??)
> is ???(??).

No it isn't.  You should be using a spacing character for
differentiation.  On the other hand, one uses a combining circumflex
for Fourier transforms.

Richard. 


From unicode at unicode.org  Mon Jan 14 18:02:05 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 14 Jan 2019 16:02:05 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <20190114233715.0a46eb16@JRWUBU2>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com>
 <b461f2d6-bcfe-266a-5ea5-e73edb1391e0@gmail.com>
 <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com>
 <20190114233715.0a46eb16@JRWUBU2>
Message-ID: <f1d8c0db-a4a7-f90b-5315-b31bbaf1fd41@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190114/90aa0a1e/attachment.html>

From unicode at unicode.org  Mon Jan 14 18:05:42 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 14 Jan 2019 16:05:42 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <b461f2d6-bcfe-266a-5ea5-e73edb1391e0@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com>
 <b461f2d6-bcfe-266a-5ea5-e73edb1391e0@gmail.com>
Message-ID: <4d354d16-3730-00f3-647c-e8c512bd4abf@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190114/01fc5e49/attachment.html>

From unicode at unicode.org  Mon Jan 14 18:17:00 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 14 Jan 2019 16:17:00 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <001201d4ac55$ab716500$02542f00$@xencraft.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.n etcom.com>
 <001201d4ac55$ab716500$02542f00$@xencraft.com>
Message-ID: <0158d32d-63f4-a120-d3a5-389f206c232c@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190114/6ad67500/attachment.html>

From unicode at unicode.org  Mon Jan 14 19:09:08 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 15 Jan 2019 01:09:08 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <34796c35-a574-438d-e842-83fd87a17e6d@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <25b45eae-e443-6cde-191f-0f27ad761ab3@kli.org>
 <29f7981a-9f88-11b0-a28b-3e7b54ec99ed@gmail.com>
 <3edf3287-1fe9-6444-bd12-42fbc0232094@it.aoyama.ac.jp>
 <34796c35-a574-438d-e842-83fd87a17e6d@gmail.com>
Message-ID: <20190115010908.1f20e000@JRWUBU2>

On Mon, 14 Jan 2019 06:24:46 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> Unicode doesn't enforce any spelling or punctuation rules.? Unicode 
> doesn't tell human beings how to pronounce strings of text or how to 
> interpret them.

These are not statements that are both honest and true.  Unicode lays
down rules and recommendations which others may then enforce.

In Indic scripts where LETTER A is not also a consonant, Unicode
forbids writing <LETTER A, SIGN AA> where LETTER AA would do the same
job, and most renderers enforce that rule.  Similarly, in phonetically
ordered LTR scripts, one can't write a dependent vowel as the first
character even if it is the leftmost character.

There is a subtler rule about not spelling negative numbers with a
hyphen-minus - if one does, one may suddenly find a line break just
after what is being used as a negative sign.

In scripts where Sanskrit grv and gvr may be rendered identically,
Unicode tells us what the two code sequences are, and therefore
indirectly what the range of pronunciations is for a given spelling.

Now, sometimes the enforcers overstep the mark.  For example, the USE
tells us that when we write Northern Thai /p?ia?/ 'sound of a
smack' which visually is <gSIGN_E, gMEDIAL_RA (/?/), gLOW_PA (/p/),
gSAKOT_LOW_YA, gSIGN_A (/?/)>, with <gSIGN_E,...gSAKOT_LOW_YA>
denoting /ia/, we should write it ?????? <LOW PA, SAKOT, LOW YA, MEDIAL
RA, SIGN E, SIGN A>.  So much for phonetic order!

Enforcement can be more subtle.  TUS says that Farsi should use U+06CC
ARABIC LETTER FARSI YEH instead of U+064A ARABIC LETTER YEH although
they are identical in initial and medial positions.  In this case, the
enforcer will be the spell-checker.

Richard.


From unicode at unicode.org  Mon Jan 14 19:18:24 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 15 Jan 2019 01:18:24 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <f1d8c0db-a4a7-f90b-5315-b31bbaf1fd41@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com>
 <b461f2d6-bcfe-266a-5ea5-e73edb1391e0@gmail.com>
 <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com>
 <20190114233715.0a46eb16@JRWUBU2>
 <f1d8c0db-a4a7-f90b-5315-b31bbaf1fd41@ix.netcom.com>
Message-ID: <20190115011824.670e04b6@JRWUBU2>

On Mon, 14 Jan 2019 16:02:05 -0800
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> On 1/14/2019 3:37 PM, Richard Wordingham via Unicode wrote:
> On Tue, 15 Jan 2019 00:02:49 +0100
> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
> On 14 Jan 2019, at 23:43, James Kass via Unicode
> <unicode at unicode.org> wrote:
> 
> Hans ?berg wrote,
>   
> How about using U+0301 COMBINING ACUTE ACCENT: ???????????  
> 
> Thought about using a combining accent.  Figured it would just
> display with a dotted circle but neglected to try it out first.  It
> actually renders perfectly here.  /That's/ good to know.  (smile)  
> 
> It is a bit off here. One can try math, too: the derivative of ??(??)
> is ???(??).
> 
> No it isn't.  You should be using a spacing character for
> differentiation. 
> 
> Sorry, but there may be different conventions. The dot / double-dot
> above is definitely common usage in physics.
> 
> A./

Apologies.  It was positioned in the parenthesis, and it looked like a
misplaced U+0301.

Richard.


From unicode at unicode.org  Mon Jan 14 19:37:56 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 14 Jan 2019 20:37:56 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <96a4e654-cf49-563a-ad23-4aef54eed4c3@it.aoyama.ac.jp>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>
 <96a4e654-cf49-563a-ad23-4aef54eed4c3@it.aoyama.ac.jp>
Message-ID: <42a41a15-43b6-9fbf-4cf9-ca385e377dcf@kli.org>

On 1/14/19 4:45 AM, Martin J. D?rst via Unicode wrote:
> Hello James, others,
>
>   From the examples below, it looks like a feature request for Twitter
> (and/or Facebook). Blaming the problem on Unicode doesn't seem to be
> appropriate.

I think what people here are doing is not blaming the problem on 
Unicode, but rather blaming the _solution_ on Unicode, for better or worse.

~mark


From unicode at unicode.org  Mon Jan 14 19:41:17 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 14 Jan 2019 20:41:17 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
Message-ID: <e339f70f-7679-b18b-a126-508c59e08101@kli.org>

On 1/14/19 5:08 AM, Tex via Unicode wrote:
>
> This thread has gone on for a bit and I question if there is any more 
> light that can be shed.
>
> BTW, I admit to liking Asmus definition for functions that span text 
> being a definition or criteria for rich text.
>
>
Me too.? There are probably some exceptions or weird corner-cases, but 
it seems to be a really good encapsulation of the distinction which I 
had never seen before.

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190114/095a6205/attachment.html>

From unicode at unicode.org  Mon Jan 14 19:48:45 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 14 Jan 2019 20:48:45 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.netcom.com>
Message-ID: <a493de39-7b3c-801b-672b-99b82baa272f@kli.org>

On 1/14/19 4:21 PM, Asmus Freytag via Unicode wrote:
> On 1/14/2019 2:08 AM, Tex via Unicode wrote:
>>
>> Perhaps the question should be put to twitter, messaging apps, 
>> text-to-voice vendors, and others whether it will be useful or not.
>>
>> If the discussion continues I would like to see more of a 
>> cost/benefit analysis. Where is the harm? What will the benefit to 
>> user communities be?
>>
> The "it does no harm" is never an argument "for" making a change. It's 
> something of a necessary, but not a sufficient condition, in other words.
>
> More to the point, if there were platforms (like social media) that 
> felt an urgent need to support styling without a markup language, and 
> could articulate that need in terms of a proposal, then we would have 
> something to discuss. (We might engage them in a discussion of the 
> advisability of supporting "markdown", for example).
>
> Short of that, I'm extremely leery of "leading" standardization; that 
> is, encoding things that "might" be used.
>
It is certainly true that Unicode should not be (and wasn't, before 
emoji) in the business of encoding things that "could be used", but 
rather, was for encoding things that *were* used.? This, naturally, 
poses a chicken-and-egg problem which has been complained about by 
several people in the past (including me).? Still, there are ways to 
show that things that haven't been encoded are still being "used", as 
people make shift to do what they can to use the script/notation, like 
using PUA or characters that aren't QUITE right, but close...? And in 
fairness, I'd have to say that the use of mathematical italics would 
count in that regard.? It's hard to dispute that there is a demand for 
it, just by looking at how people have been trying to do it!? So I'm 
starting to think this is not really "leading" standardization, but 
rather following up and, well, standardizing it, replacing ad-hoc 
attempts with a standard way to do things, just as Unicode is supposed 
to do.

~mark


> As for the abuse of math alphabetics. That's happening whether we like 
> it or not, but at this point represents playful experimentation by the 
> exuberant fringe of Unicode users and certainly doesn't need any 
> additional extensions.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190114/0a041892/attachment.html>

From unicode at unicode.org  Mon Jan 14 19:56:37 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 14 Jan 2019 20:56:37 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <CAMZ=zj5f-+xf5OZAmqgC9wvFhW9f38N7S4HwzuTbG69b4Dpm6Q@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <CAMZ=zj5f-+xf5OZAmqgC9wvFhW9f38N7S4HwzuTbG69b4Dpm6Q@mail.gmail.com>
Message-ID: <9b3c8e3f-5e20-2481-8d12-b7be716f177e@kli.org>

In some of this discussion, I'm not sure what is being proposed or 
forbidden here... I don't know that anyone is advocating removing the 
"don't use these for words!" warning sticker on the mathematical 
italics.? The closest-to-sensible suggestions I've heard are things like 
a VS to italicize a letter, a combining italicizer so to speak (this is 
actually very similar to the emoji-style vs text-style VS sequences).? 
*If* the VS is ignored by searches, as apparently it should be and some 
have reported that it is, then VS-type solutions would NOT be a problem 
when it comes to searches (and don't go whining about legacy software.? 
If Unicode had to be backward-compatible with everything we wouldn't 
have gone beyond ASCII).? So I'm not sure what you mean when you speak 
of "Unicode italics".? Do you mean using the mathematical italics as 
we've been seeing?? Or having a whole new plane of italic characters for 
everything that could conceivably be italicized?? Those would probably 
both be mistakes, I agree.

~mark

On 1/14/19 5:58 PM, David Starner via Unicode wrote:
> On Mon, Jan 14, 2019 at 2:09 AM Tex via Unicode <unicode at unicode.org> wrote:
>> The arguments against italics seem to be:
>>
>> ?        Unicode is plain text. Italics is rich text.
>>
>> ?        We haven't had it until now, so we don't need it.
>>
>> ?        There are many rich text solutions, such as html.
>>
>> ?        There are ways to indicate or simulate italics in plain text including using underscore or other characters, using characters that look italic (eg math), etc.
>>
>> ?        Adding Italicization might break existing software
>>
>> ?        The examples of  existing Unicode characters that seem to represent rich text (emoji, interlinear annotation, et al) have justifications.
> There generally shouldn't be multiple ways of doing things. For
> example, if you think that searching for certain text in italics is
> important, then having both HTML italics and Unicode italics are going
> to cause searches to fail or succeed unexpectedly, unless the
> underlying software unifies the two systems (an extra complexity).
> Searching for certain italicized text could be done today in rich text
> applications, were there actual demand for it.
>
>> ?        Plain text still has tremendous utility and rich text is not always an option.
> Where? Twitter has the option of doing rich text, as does any closed
> system. In fact, Twitter is rich text, in that it hyperlinks web
> addresses. That Twitter has chosen not to support italics is a choice.
> If users don't like this, they could go another system, or use
> third-party tools to transmit rich text over Twitter. The use of
> underscores or <i> </i> markings for italics would be mostly
> compatible with human twitterers using the normal interface.
>
> Source code is an example of plain text, and yet adding italics into
> comments would require but a trivial change to editors. If the user
> audience cared, it would have been done. In fact, I suspect there
> exist editors and environments where an HTML subset is put into
> comments and rendered by the editors; certainly active links would be
> more useful in source code comments than italics.
>
> Lastly, the places where I still find massive use of plain text are
> the places this would hurt the most. GNU Grep's manpage shows no sign
> that it supports searching under any form of Unicode normalization.
> Same with GNU Less. Adding italics would just make searching plain
> text documents more complex for their users. The domain name system
> would just add them to the ban list, and they'd be used for spoofing
> in filenames and other less controlled but still sensitive
> environments.
>


From unicode at unicode.org  Mon Jan 14 20:02:20 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 14 Jan 2019 18:02:20 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <e339f70f-7679-b18b-a126-508c59e08101@kli.org>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <e339f70f-7679-b18b-a126-508c59e08101@kli.org>
Message-ID: <0bcc5124-534c-6049-1854-3f51aa10db19@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190114/0c8487a1/attachment.html>

From unicode at unicode.org  Mon Jan 14 20:02:42 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 14 Jan 2019 21:02:42 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <slrnq3mqrh.vm5.jcb@home.stevens-bradfield.com>
 <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp>
Message-ID: <ef3987c6-8ad9-0eab-e947-740c3985ad02@kli.org>

On 1/13/19 10:00 PM, Martin J. D?rst via Unicode wrote:
> On 2019/01/14 01:46, Julian Bradfield via Unicode wrote:
>> On 2019-01-12, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
>>> On Sat, 12 Jan 2019 10:57:26 +0000 (GMT)
>>> And what happens when you capitalise a word for emphasis or to begin a
>>> sentence?  Is it no longer the same word?
>> Indeed. As has been observed up-thread, the casing idea is a dumb one!
>> We are, however, stuck with it because of legacy encoding transported
>> into Unicode. We aren't stuck with encoding fonts into Unicode.
> No, the casing idea isn't actually a dumb one.

Well, for me, when I say or said that the "casing idea" is a dumb one, I 
don't mean how Unicode handled it.? Unicode is quite correct in encoding 
capitals distinctly from lowercase, both for computer-historical reasons 
and others you mention.? I think the idea of having case in alphabets 
_in the first place_ was a bad move.? It's a "mistake" that happened 
centuries ago.

~mark

From unicode at unicode.org  Mon Jan 14 20:07:48 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 14 Jan 2019 21:07:48 -0500
Subject: A last missing link for interoperable representation
In-Reply-To: <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <00e2b810-845d-2ed8-a294-1c63937be5db@gmail.com>
 <6f3ffed8-5890-917b-d07d-d381218447d1@ix.netcom.com>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <1713264411.159081.1547029604745.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <slrnq3mqrh.vm5.jcb@home.stevens-bradfield.com>
 <9a1a2642-4123-45ce-ebb0-c1aa4461c266@it.aoyama.ac.jp>
Message-ID: <c385c889-5fb5-c660-d0b0-1a2f43547675@kli.org>

(sorry for multiple responses...)

On 1/13/19 10:00 PM, Martin J. D?rst via Unicode wrote:
> On 2019/01/14 01:46, Julian Bradfield via Unicode wrote:
>> On 2019-01-12, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
>>> On Sat, 12 Jan 2019 10:57:26 +0000 (GMT)
>>> And what happens when you capitalise a word for emphasis or to begin a
>>> sentence?  Is it no longer the same word?
>> Indeed. As has been observed up-thread, the casing idea is a dumb one!
>> We are, however, stuck with it because of legacy encoding transported
>> into Unicode. We aren't stuck with encoding fonts into Unicode.
> No, the casing idea isn't actually a dumb one. As Asmus has shown, one
> of the best ways to understand what Unicode does with respect to text
> variants is that style works on spans of characters (words,...), and is
> rich text, but thinks that work on single characters are handled in
> plain text. Upper-case is definitely for most part a single-character
> phenomenon (the recent Georgian MTAVRULI additions being the exception).
Not just an exception, but an exception that proves the rule.? It's 
precisely because plain-text distinctions, generally speaking, should be 
at the letter level as Asmus says that there was so much shouting about 
MTAVRULI.? That these are exceptional demonstrates the existence of the 
rule.
> But even most adults won't know the rules for what to italicize that
> have been brought up in this thread. Even if they have read books that
> use italic and bold in ways that have been brought up in this thread,
> most readers won't be able to tell you what the rules are. That's left
> to copy editors and similar specialist jobs.
I don't think there's really a case to be made that italics are or 
should work the same as capitals, or that they are justified for the 
same reasons that capitals are justified.? And the use-cases show how 
people are using them: not necessarily for Chicago Manual of Style 
mandated purposes, but for emphasis of varying kinds.
> There was a time when computers (and printers in particular) were
> single-case. There was some discussion about having to abolish case
> distinctions to adapt to computers, but fortunately, that wasn't necessary.
Abolishing case I could see as a hassle, and we have become somewhat 
dependent on it for other things.? But it was a bad idea to start with.


~mark


From unicode at unicode.org  Tue Jan 15 00:31:02 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Tue, 15 Jan 2019 06:31:02 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <a493de39-7b3c-801b-672b-99b82baa272f@kli.org>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.netcom.com>
 <a493de39-7b3c-801b-672b-99b82baa272f@kli.org>
Message-ID: <e91edc45-6b2c-f111-f5de-ef7d96ad4278@it.aoyama.ac.jp>


On 2019/01/15 10:48, Mark E. Shoulson via Unicode wrote:
> On 1/14/19 4:21 PM, Asmus Freytag via Unicode wrote:

>> Short of that, I'm extremely leery of "leading" standardization; that 
>> is, encoding things that "might" be used.
>>
> It is certainly true that Unicode should not be (and wasn't, before 
> emoji)

Just to be precise, as already has been mentioned in this thread, the 
first batch of 'emoji' was in Unicode from the start (e.g. U+2603 
SNOWMAN, there since Unicode 1.1), I think from Zapf Dingbats. The 
second batch came from Japanese phones. So for the first two batches of 
emoji, Unicode did not do any "leading" standardization. It was only 
after that, for later batches, where that happened.

> in the business of encoding things that "could be used", but 
> rather, was for encoding things that *were* used.? This, naturally, 
> poses a chicken-and-egg problem which has been complained about by 
> several people in the past (including me).? Still, there are ways to 
> show that things that haven't been encoded are still being "used", as 
> people make shift to do what they can to use the script/notation, like 
> using PUA or characters that aren't QUITE right, but close...? And in 
> fairness, I'd have to say that the use of mathematical italics would 
> count in that regard.? It's hard to dispute that there is a demand for 
> it, just by looking at how people have been trying to do it!

"a demand" doesn't quantify the demand at all. My guess is that given 
the overall volume of Twitter or Facebook communication, the percentage 
of Math italics (ab)use is really, really low. It's impossible to say 
that there's no demand, but use cases like "look, I found these 
characters, aren't they cute" in some corners of some social services is 
not the same as "we urgently need this, otherwise we can't communicate 
in our language".

Regards,    Martin.


From unicode at unicode.org  Tue Jan 15 00:46:31 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Tue, 15 Jan 2019 06:46:31 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <CAMZ=zj5f-+xf5OZAmqgC9wvFhW9f38N7S4HwzuTbG69b4Dpm6Q@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <CAMZ=zj5f-+xf5OZAmqgC9wvFhW9f38N7S4HwzuTbG69b4Dpm6Q@mail.gmail.com>
Message-ID: <f3c39721-9678-d8af-a28c-d4ee808a2527@it.aoyama.ac.jp>

On 2019/01/15 07:58, David Starner via Unicode wrote:
> On Mon, Jan 14, 2019 at 2:09 AM Tex via Unicode <unicode at unicode.org> wrote:

>> ?        Plain text still has tremendous utility and rich text is not always an option.
> 
> Where? Twitter has the option of doing rich text, as does any closed
> system. In fact, Twitter is rich text, in that it hyperlinks web
> addresses. That Twitter has chosen not to support italics is a choice.
> If users don't like this, they could go another system, or use
> third-party tools to transmit rich text over Twitter. The use of
> underscores or <i> </i> markings for italics would be mostly
> compatible with human twitterers using the normal interface.

Yes indeed. Some similar services allow styling. One example is Slack, 
see e.g. 
https://get.slack.help/hc/en-us/articles/202288908-Format-your-messages.

Markdown has been mentioned as an example of how some basic styling 
options (bold, italic,...) can be implemented. Another choice is using 
an user interface component (menu,...). The user then doesn't have to 
care about any 'weird' conventions, even the simplest ones, nor about 
what happens in the background (most probably HTML), and already is 
familiar with it from other applications.

As for implementation complexity, it's not trivial, but there are quite 
a lot of components available, in particular for Web technology. It's 
not rocket science.

Actually, in some cases, it is even difficult to get rid of styling on 
the Web. I recently wanted to print out a map of how to get to a 
restaurant for a party. The restaurant's Web site was all black 
background. I copied the address to Google Maps and then tried to print 
it. Google Maps insists that the first page is just information about 
the location, so I copied the name of the restaurant from the Web page. 
What happened was that it still had the black background. So copy-paste 
on your average Web browser these days doesn't lose styles, even in 
cases where that would be desirable (because more legible).

So rich text technology is already way ahead when it comes to styled 
text. Do we want to encode background-color variant selectors in 
Unicode? If yes, how many?

[Hint: The last two questions are rhetorical.]

Regards,   Martin.


From unicode at unicode.org  Tue Jan 15 01:07:25 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Mon, 14 Jan 2019 23:07:25 -0800
Subject: A last missing link for interoperable representation
In-Reply-To: <9b3c8e3f-5e20-2481-8d12-b7be716f177e@kli.org>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmail.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <CAMZ=zj5f-+xf5OZAmqgC9wvFhW9f38N7S4HwzuTbG69b4Dpm6Q@mail.gmail.com>
 <9b3c8e3f-5e20-2481-8d12-b7be716f177e@kli.org>
Message-ID: <CAMZ=zj5dZ5c3MAunAaMh9PPoyJ4EtFs5f5-tzRuQ74U+Q7QGVA@mail.gmail.com>

On Mon, Jan 14, 2019 at 5:58 PM Mark E. Shoulson via Unicode
<unicode at unicode.org> wrote:
> *If* the VS is ignored by searches, as apparently it should be and some
> have reported that it is, then VS-type solutions would NOT be a problem
> when it comes to searches

Who is using VS-type solutions? I could not enter except for manually
using some sort of \u notations. Languages that need special input
support can easily adapt to unusual rules, but English Unicode is
weirdly hard to enter, because the QWERTY keyboard is ubiquitous and
standard. Smart quotes, non-HYPHEN-MINUS hyphens and dashes, and
accents generally need memorizing of obscure entry methods or resort
to a character list. Without great support from vendors, a new Unicode
italic system only going to be used by the same people who currently
use mathematical italics.

> (and don't go whining about legacy software.
> If Unicode had to be backward-compatible with everything we wouldn't
> have gone beyond ASCII).

Then where's this plain text that absolutely needs italics? Those
legacy software systems are the place where unadorned plain text still
lives. Anything on the Web is inherently dealing with rich text.

-- 
Kie ekzistas vivo, ekzistas espero.

From unicode at unicode.org  Tue Jan 15 01:31:21 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 15 Jan 2019 08:31:21 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <0158d32d-63f4-a120-d3a5-389f206c232c@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <0a76bc13-e872-65d7-2695-a2dcaaf08ae0@ix.n etcom.com>
 <001201d4ac55$ab716500$02542f00$@xencraft.com>
 <0158d32d-63f4-a120-d3a5-389f206c232c@ix.netcom.com>
Message-ID: <b4a6b66f-5b12-7f0a-3b82-71172e6717d0@orange.fr>

On 15/01/2019 01:17, Asmus Freytag via Unicode wrote:
> On 1/14/2019 2:08 PM, Tex via Unicode wrote:
>>
>> Asmus,
>>
>> I agree 100%. Asking where is the harm was an actual question intended to surface problems. It wasn?t rhetoric for saying there is no harm.
>>
> The harm comes when this is imported into rich text environments (like this e-mail inbox). Here, the math abuse and the styled text run may look the same, but I cannot search for things based on what I see. I see an English or French word, type it in the search box and it won't be found. I call that 'stealth' text.
>
> The answer is not necessarily in folding the two, because one of the reasons for having math alphabetics is so you can search for a variable "a" of? certain kind without getting hits on every "a" in the text. Destroying that functionality in an attempt to "solve" the problems created by the alternate facsimile of styled text is also "harm" in some way.
>
That may end up in a feature request for webmails and e-mail clients, where the user should be given the ability to toggle between what I?d call a ?Bing search mode? and a ?Google search mode.? Google Search has extended equivalence classes that enable it to handle math alphabets like plain ASCII runs, i.e. we may type a search in ASCII and Google finds instances where the text is typeset ?abusing? math alphabets. On the other hand, Bing Search does not have such extended equivalence classes, and brings up variables as they are styled when searching correspondingly.

I won?t blame Google of doing ?harm?, and I?d like to position rather on Google?s side as it seems to meet the expectations of a larger part of end-user communities. I won?t blame Microsoft neither, I?m just noting a dividing line between the two vendors about handling math alphabets.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190115/15167a4a/attachment.html>

From unicode at unicode.org  Tue Jan 15 02:09:00 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 15 Jan 2019 09:09:00 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <0bcc5124-534c-6049-1854-3f51aa10db19@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <slrnq3oedc.7rk.jcb@home.stevens-bradfield.com>
 <8c0ee1d7-e054-9d2d-1c45-a8dbb25ea607@gmail.com>
 <slrnq3oioq.bkl.jcb@home.stevens-bradfield.com>
 <29638bbf-6065-adc9-115a-88f038ea5313@gmai l.com>
 <000901d4abf1$0ae95980$20bc0c80$@xencraft.com>
 <e339f70f-7679-b18b-a126-508c59e08101@kli.org>
 <0bcc5124-534c-6049-1854-3f51aa10db19@ix.netcom.com>
Message-ID: <8eb1d719-462a-b75c-c84c-2b737213c210@orange.fr>

On 15/01/2019 03:02, Asmus Freytag via Unicode wrote:
> On 1/14/2019 5:41 PM, Mark E. Shoulson via Unicode wrote:
>> On 1/14/19 5:08 AM, Tex via Unicode wrote:
>>>
>>> This thread has gone on for a bit and I question if there is any more light that can be shed.
>>>
>>> BTW, I admit to liking Asmus definition for functions that span text being a definition or criteria for rich text.
>>>
>>>
>> Me too.? There are probably some exceptions or weird corner-cases, but it seems to be a really good encapsulation of the distinction which I had never seen before.
>>
> ** blush **
>
> A./
>
>
I did like it too, and I was really amazed that the issue could be boiled down to such a handy shibboleth. It wasn?t until I?m looking harder that I can?t help any more seeing it as a mere rewording of current practice. That is, if we?re using markup (that typically acts on spans and other elements), it?s rich text; if we?re using characters, it?s plain text. The reason why I changed my mind is that the new shibboleth can be misused to relegate to the realm of rich text some feature of a writing system, like using superscript as ordinal indicators (English "3??", French "2?" [order] or "2??" [rank], Italian "1?" or ? in Latin-1 ? "1?", the latter being used in German as a narrow form of "prima" that has special semantics there ["top quality" or "great!"]), only on the basis that it is currently emulated using rich text by declaring that "?" is?or ?should? be?a span with superscript markup, so that we end up with "2<sup>e</sup>".

As I?ve (too) slightly pointed in a previous reply, that is not what we should end up with. Abbreviation indicators in Latin script are a case of a single character solution, albeit multiple characters may be involved in a single instance. We can also have inner uppercase, aka camelcase, that cannot be handled by the titlecase attribute. We?re clearly in the realm of plain text, and any other solution may be called an emulation, or a legacy workaround, but not a Unicode conformant interoperable representation.

Also, please note the presence in Unicode, of U+070F SYRIAC ABBREVIATION MARK, a format control? Probably there are also some other format controls in other scripts, performing likely the same job. Remember when a similar solution was suggested for Latin script on this List?

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190115/cca96d2f/attachment.html>

From unicode at unicode.org  Tue Jan 15 03:24:03 2019
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Tue, 15 Jan 2019 10:24:03 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
Message-ID: <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>

Le lun. 14 janv. 2019 ? 20:25, Marcel Schneider via Unicode <
unicode at unicode.org> a ?crit :

> On 14/01/2019 06:08, James Kass via Unicode wrote:
> >
> > Marcel Schneider wrote,
> >
> >> There is a crazy typeface out there, misleadingly called 'Courier
> >> New', as if the foundry didn?t anticipate that at some point it
> >> would be better called "Courier Obsolete". ...
> >
> > ?????? ?????????????? seems a bit ????????<i>?</i> nowadays, as well.
> >
> > (Had to use mark-up for that ?span? of a single letter in order to
> > indicate the proper letter form.  But the plain-text display looks
> > crazy with that HTML jive in it.)
> >
>
> I apologize for seeming to question the font name ?????? ???? while
> targeting only
> the fact that this typeface is not updated to support the <NNBSP>. It just
> looks like the grand name is now misused to make people believe that if
> **this** great font is unsupporting <NNBSP>, it has a good reason to do so,
> and we should keep people off using that ?exotic whitespace? otherwise than
> ?intended,? ie for Mongolian. Since fortunately TUS started backing its use
> in French (2014)
>

This is not for Mongolian and French wanted this space since long and it
has a use even in English since centuries for fine typography.
So no, NNBSP is definitely NOT "exotic whitespace". It's just that it was
forgotten in the early stages of computing with legacy 8-bit encodings but
it should have been in Unicode since the begining as its existence is
proven long before the computing age (before ASCII, or even before Baudot
and telegraphic systems). It has alsway been used by typographs, it has
centuries of tradition in publishing. And it has always been recommended
and still today for French for all books/papers publishers.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190115/053068f6/attachment-0001.html>

From unicode at unicode.org  Tue Jan 15 03:30:34 2019
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 15 Jan 2019 10:30:34 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <20190115011824.670e04b6@JRWUBU2>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <28EEB9C7-F6EE-442D-AE00-F01F98FC57FB@telia.com>
 <b461f2d6-bcfe-266a-5ea5-e73edb1391e0@gmail.com>
 <9B53F4F8-F1F7-4505-A31E-AAFA741A910F@telia.com>
 <20190114233715.0a46eb16@JRWUBU2>
 <f1d8c0db-a4a7-f90b-5315-b31bbaf1fd41@ix.netcom.com>
 <20190115011824.670e04b6@JRWUBU2>
Message-ID: <EAC2359C-F1F4-4AA2-A82F-63648BD66264@telia.com>


> On 15 Jan 2019, at 02:18, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Mon, 14 Jan 2019 16:02:05 -0800
> Asmus Freytag via Unicode <unicode at unicode.org> wrote:
> 
>> On 1/14/2019 3:37 PM, Richard Wordingham via Unicode wrote:
>> On Tue, 15 Jan 2019 00:02:49 +0100
>> Hans ?berg via Unicode <unicode at unicode.org> wrote:
>> 
>> On 14 Jan 2019, at 23:43, James Kass via Unicode
>> <unicode at unicode.org> wrote:
>> 
>> Hans ?berg wrote,
>> 
>> How about using U+0301 COMBINING ACUTE ACCENT: ???????????  
>> 
>> Thought about using a combining accent.  Figured it would just
>> display with a dotted circle but neglected to try it out first.  It
>> actually renders perfectly here.  /That's/ good to know.  (smile)  
>> 
>> It is a bit off here. One can try math, too: the derivative of ??(??)
>> is ???(??).
>> 
>> No it isn't.  You should be using a spacing character for
>> differentiation. 
>> 
>> Sorry, but there may be different conventions. The dot / double-dot
>> above is definitely common usage in physics.

Also in differential geometry, as for curves.

>> A./
> 
> Apologies.  It was positioned in the parenthesis, and it looked like a
> misplaced U+0301.

In MacOS, one can drop the combined character into the character table, and see that it is U+0307 COMBINING DOT ABOVE.

It comes out right when typeset in ConTeXt.


From unicode at unicode.org  Tue Jan 15 04:04:33 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Tue, 15 Jan 2019 10:04:33 +0000 (GMT)
Subject: A last missing link for interoperable representation
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
Message-ID: <slrnq3rc1g.bqa.jcb@home.stevens-bradfield.com>

On 2019-01-15, Philippe Verdy via Unicode <unicode at unicode.org> wrote:
> This is not for Mongolian and French wanted this space since long and it
> has a use even in English since centuries for fine typography.
> So no, NNBSP is definitely NOT "exotic whitespace". It's just that it was
> forgotten in the early stages of computing with legacy 8-bit encodings but
> it should have been in Unicode since the begining as its existence is
> proven long before the computing age (before ASCII, or even before Baudot
> and telegraphic systems). It has alsway been used by typographs, it has
> centuries of tradition in publishing. And it has always been recommended
> and still today for French for all books/papers publishers.

Do you expect people to encode all the variable justification spaces
between words by combining all the (numerous) spaces already available
in Unicode?
And how about the kerning between letters? If spacing of punctuation
is to be encoded instead of left to display algorithms, shouldn't you
also encode the kerns instead of leaving them to the font display
technology?

Oh, and what about dropped initials? They have been used in both
manuscripts and typography for many centuries - surely we must encode
them?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Tue Jan 15 05:24:44 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 15 Jan 2019 12:24:44 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
Message-ID: <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>

On 15/01/2019 10:24, Philippe Verdy via Unicode wrote:
>
> Le?lun. 14 janv. 2019 ??20:25, Marcel Schneider via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>> a ?crit?:
>
>     On 14/01/2019 06:08, James Kass via Unicode wrote:
>     >
>     > Marcel Schneider wrote,
>     >
>     >> There is a crazy typeface out there, misleadingly called 'Courier
>     >> New', as if the foundry didn?t anticipate that at some point it
>     >> would be better called "Courier Obsolete". ...
>     >
>     > ?????? ?????????????? seems a bit ????????<i>?</i> nowadays, as well.
>     >
>     > (Had to use mark-up for that ?span? of a single letter in order to
>     > indicate the proper letter form.? But the plain-text display looks
>     > crazy with that HTML jive in it.)
>     >
>
>     I apologize for seeming to question the font name ?????? ???? while targeting only
>     the fact that this typeface is not updated to support the <NNBSP>. It just
>     looks like the grand name is now misused to make people believe that if
>     **this** great font is unsupporting <NNBSP>, it has a good reason to do so,
>     and we should keep people off using that ?exotic whitespace? otherwise than
>     ?intended,? ie for Mongolian. Since fortunately TUS started backing its use
>     in French (2014)
>
>
> This is not for Mongolian and French wanted this space since long and it has a use even in English since centuries for fine typography.
> So no, NNBSP is definitely NOT "exotic whitespace". It's just that it was forgotten in the early stages of computing with legacy 8-bit encodings but it should have been in Unicode since the begining as its existence is proven long before the computing age (before ASCII, or even before Baudot and telegraphic systems). It has alsway been used by typographs, it has centuries of tradition in publishing. And it has always been recommended and still today for French for all books/papers publishers.
Many thanks for bringing this to the point. So the case is even worse as Unicode deliberately skipped the non-breakable thin space while thinking at encoding the whole range of other typographic spaces, even with duplicate encoding of en and em spaces, and not forgetting those old-fashioned tabular spaces and dash: figure space and dash, and punctuation space. In this particular context and with all that historic practice background, what else than malice (supposedly inspired by an unlawful and exuberant DTP vendor) could drive people not to define the line-breaking property value of U+2008 PUNCTUATION SPACE as "GL", while they did define it so for U+2007 FIGURE SPACE.

Here is also the still outdated wording of UAX?#14 wrt NNBSP, Mongolian and French:

                            [?] NARROW NO-BREAK SPACE is used in Mongolian. The MONGOLIAN VOWEL SEPARATOR acts like a NARROW NO-BREAK SPACE in its line breaking behavior. It additionally affects the shaping of certain vowel characters as described in/Section 13.5, Mongolian/, of [Unicode <http://www.unicode.org/reports/tr41/tr41-23.html#Unicode>].

                            NARROW NO-BREAK SPACE is a narrow version of NO-BREAK SPACE, which has exactly the same line breaking behavior, but with a narrow display width. It is regularly used in Mongolian in certain grammatical contexts (before a particle), where it also influences the shaping of the glyphs for the particle. In Mongolian text, the NARROW NO-BREAK SPACE is typically displayed with one third the width of a normal space character.

                            When NARROW NO-BREAK SPACE occurs in French text, it should be interpreted as an ?espace fine ins?cable?.


?When [?] it should be interpreted as [?]? is a pure insult. NARROW NO-BREAK SPACE *is* exactly at least the French "espace fine ins?cable" *and* the Mongolian whatever-it-is-called-in-Mongolian *and* the group separator, aka triad separator, in *all* locales following the SI and ISO recommendation to group digits with spaces, not with any punctuation.

As hopefully that misleading section will be edited, here?s the link to the quoted version:
https://www.unicode.org/reports/tr14/tr14-41.html#DescriptionOfProperties


Also I?d like or better I need to kindly ask the knowing List Members to correct the following statement *if* it is wrong:

                If the Unicode Standard had been set up in an unbiased way, U+2008 PUNCTUATION SPACE had been given the line break property value "GL".

Perhaps the following would also be true:

                If the Unicode Standard had been set up in an unbiased way, there would be a NARROW NO-BREAK SPACE encoded in the range U+2000..U+200F.


Thanks in advance to Philippe Verdy and any other knowing List Members for staying or getting in touch and (keeping) posting feedback.

I don?t edit the subject line, nor do I spin off a new thread, given when I lauched this one I sincerely believed that the issues with NARROW NO-BREAK SPACE and with preformatted superscript abbreviation indicators for interoperable representation of French and numerous other languages (part of which are using not only the former as groun separator, but also the latter as ordinal indicators) are about to be definitely settled. Turns out they?re not. Hopefully when this thread goes on, the sometimes extremely aggressive anti-NNBSP lobbying (and also the more lenient anti-preformatted-superscript lobbying) will come to an end, freeing the way to the real Unicode interoperable digital representation of all of the world?s languages.


Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190115/b68c083e/attachment.html>

From unicode at unicode.org  Tue Jan 15 06:25:06 2019
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Tue, 15 Jan 2019 13:25:06 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <a92003ab-128a-a288-05ad-74c7f864b57f@gmail.com>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
Message-ID: <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>

Note that even if this NNBSP character is not mapped in a font, it should
be rendered correctly with all modern renderers (the mapping is necessary
only when a font design wants to tune its metrics, because its width varies
between 1/8 and 1/6 em (the narrow space is a bit narrower in traditional
English typography than in French, so typical English design set it at
about 1/8 em, typical French design set it at 1/6 em, and neutral fonts may
set it somewhere in the middle); the measure in em may however vary with
some fonts (notably those using "narrow" or "wide" letters by default
(because the font size in em indicates only its height) and in
decorated/cursive styles (e.g. fonts with swashes need a higher line gap,
the font design of the em size may be smaller than for modern simplified
styles for display).

But a renderer should have no problem using a default metric for all
whitespace characters, that actually don't need any glyph to be drawn:
All what is needed is metrics, everything else, inclusing character
properties like breaking are infered by the renderer independantly of the
font and other per-language tuning, or controled by styling effects applied
on top of the font

A renderer may expand the kerning/approach if needed for example to
generate "hollow" or "shadow" effects, or to generate synthetic weights,
including with "variable" fonts support, typically the renderer will base
the metrics of all missing/unmapped whitespaces from the metrics given to
the normal SPACE or NBSP which are typically both mapped to the same glyph;
NNBSP will be synthetized easily using half the advance width of SPACE, and
it's fine; renderers can also synthetize all other whitespaces for
ideographic usages, or will adapt the rendering if instructed to synthetize
a monospaced variant: here there's a choice for NNBSP to be rendered like
NBSP, typically for French as it is normally a bit wider, or as a
zero-width space like in English, or contextually for example zero-width
near punctuations or NBSP between letters/digits).

Fonts only specify defaults that alter the rendering produced by a
renderer, but a renderer is not required to use all infos and all glyphs in
a specific font, it has to adapt to the context and choose what is more
relevant and which kind of data it recognizeds and implements/uses at
runtime. The font just provides the best settings according to the font
designer, if all features are enabled, but most work is done by the
renderer (and fonts are completely unaware of tyhe actual encoding of
documents, fonts are only a database containing multiple features/settings,
all of them bneing optional and selectable individually).

If your fonts behave incorrectly on your system because it does not map any
glyph for NNBSP, don't blame the font or Unicode about this problem, blame
the renderer (or the application or OS using it, may be they are very
outdated and were not aware of these features, theyt are probably based on
old versions of Unicode when NNBSP was still not present even if it was
requested since very long at least for French and even English, before even
Unicode, and long before Mongolian was then encoded, only in Unicode and
not in any known supported legacy charset: Mongolian was specified by
borrowing the same NNBSP already designed for Latin, because the Mongolian
space had no known specific behavior: the encoded whitespaces in Unicode
are compeltely script-neutral, they are generic, and are even BiDi-neutral,
they are all usable with any script).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190115/32da1489/attachment.html>

From unicode at unicode.org  Tue Jan 15 05:32:34 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Tue, 15 Jan 2019 11:32:34 +0000 (GMT)
Subject: A last missing link for interoperable representation
Message-ID: <62f33173.a54b.1685148d5b6.Webtop.229@btinternet.com>

Martin J. D?rst wrote:

> So rich text technology is already way ahead when it comes to styled 
> text. Do we want to encode background-color variant selectors in 
> Unicode? If yes, how many?

Yes.

You would only need one.

Background colour was a feature of teletext in the United Kingdom from 
1976. It was very effective in its application.

In teletext, there were seven choices of foreground colour (red, green, 
yellow, blue, magenta, cyan, white), the default background was black.

The New Background control character caused the background colour to 
become the same as the current foreground colour in which text was being 
displayed. One could then change the foreground colour.

There was also a Black Background control code. This was necessary 
because neither text nor graphics could be black in teletext.

In teletext those control codes were stateful and applied until a change 
or to the end of the line of text, whichever came first.

So, given that Unicode is starting to encode colour choices for emoji 
and black is in the set of colours - and that might possibly extend to 
choosing colour for text - if Unicode were to encode CHANGE BACKGROUND 
COLOUR then the background colour could become the current foreground 
colour, even if that chosen foreground colour had just been selected and 
not actually used to colour text.

The implementation in Unicode need not be stateful.

> [Hint: The last two questions are rhetorical.]

Maybe that was the intention, but the questions were asked and the 
concept is an interesting possibility for implementation.

William Overington

Tuesday 15 January 2019


From unicode at unicode.org  Tue Jan 15 07:08:39 2019
From: unicode at unicode.org (Victor Gaultney via Unicode)
Date: Tue, 15 Jan 2019 13:08:39 +0000
Subject: Encoding italic (was: A last missing link)
Message-ID: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>

I've been alerted to this thread by a friend, so just rejoined in order 
to respond. I'm currently doing research into italics.

Some of the confusion and disagreement about italics centers around 
whether it is typographic markup or textual content. Both historically 
and currently italics can be used for either, but can clearly change the 
meaning of a word or phrase*.? It also has a different semantic meaning 
than bold.** It is not just rich text, nor parallel to casing. It works 
differently, and most like the use of matching punctuation (parentheses, 
brackets, quotation marks).

Italics are sometimes used to indicate stress, although that is only one 
use. Stress is like a phonetic sound. It is represented in writing 
systems in different ways. However a writing system text encoding 
standard relates to the visual symbols and the rules of their behaviour 
rather than to the sound itself. Italicised text is visually different, 
and that difference can have a variety of meanings.

It would make sense for Unicode to encode the visual difference that 
marks those meanings (such as stress), just as it does with punctuation. 
Quotation marks, for example, are visually represented in different ways 
depending on the language, but Unicode does have characters that are use 
to indicate that 'this is a quote'. So it makes no sense for Unicode to 
encode 'stress' as a character, but it *may* make theoretical sense to 
encode 'italic begin' and 'italic end' characters, just as we do 
parentheses, brackets, quotation marks, etc. This would allow for the 
use of italic in non-styled environments (text messages, social media, 
etc.).

BTW - encoding the begin/end of italic would be very different from HTML 
semantic tags that attempt to encode meaning. Like punctuation, it only 
encodes the visual distinction, not the meaning.

Use of variation selectors, a single character modifier, or combining 
characters also seem to be less useful options, as they act at the 
individual character level and are highly impractical. They also violate 
the key concept that italics are a way of marking a span of text as 
'special' - not individual letters. Matched punctuation works the same 
way and is a good fit for italic.

Although italic is a deeply Latin script concept, people do want to 
apply it to non-latin text (with sometimes limited sense and success). 
Encoding two punctuation characters would allow use across scripts, in 
the same way that quotation marks are sometimes used.

My current research in italic won't get published publicly until 2020, 
however I gave a talk at ATypI Montreal about the nature of italic 
(https://www.youtube.com/watch?v=4vlFxed22Sg). I have an unpublished 
paper on italic but can't share it publicly (due to image rights). 
Contact me if you would like to see a private copy.

Victor Gaultney

* David Crystal's famous example is that these two sentences mean 
different things: 'I've lost my red slippers' and 'I've lost my /red/ 
slippers' (as opposed to my blue ones). Crystal, David. 1994. The 
Cambridge encyclopedia of language (Cambridge University Press), p13-14.

** Vachek, Josef, and Philip A Luelsdorff. 1989. Written language 
revisited (Amsterdam: Benjamins), p45-48.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190115/56e359e0/attachment.html>

From unicode at unicode.org  Tue Jan 15 09:48:15 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Tue, 15 Jan 2019 15:48:15 +0000 (GMT)
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <664185367.230185.1547566799485.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <1239322872.230125.1547566618695.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <664185367.230185.1547566799485.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
Message-ID: <2cd96c5f.8020.1685232ee39.Webtop.232@btinternet.com>

Hi

You are the gentleman who kindly made the Gentium typeface open source.

  Thank you for your generous gift to the world.

  > Use of variation selectors, a single character modifier, or 
combining characters also seem to be less useful options, as they act at 
the individual character level and are highly impractical. They also 
violate the key concept that italics are a way of marking a span of text 
as 'special' - not individual letters. Matched punctuation works the 
same way and is a good fit for italic.

Italics works differently from matched punctuation marks in that with 
italics there is a change to each glyph whereas with matched punctuation 
there is no change to the glyphs between the matched punctuation marks.

That difference leads to the significant difficult that there are thus 
two competing forces here.

One of those forces is what you have stated about the nature of italics. 
The other of those forces is that Unicode is not stateful.

Years ago I encoded some Private Use Area codes for such features as 
italics, with a start character and an end character to surround a span 
of text that would then be rendered in italics.  As a result of 
discussion and advice I learned that such characters are not acceptable 
for encoding into regular Unicode because the effect would be stateful. 
So yes, the method that I suggested and for which James Kass suggested 
an enhancement is peculiar when viewed against the theory of the way 
that italics are used, but neither the method nor the enhanced method is 
stateful and that is an important feature of them.

Now it would be possible for a software application program to have a 
feature for composing plain text where a span of text may be highlighted 
by a user of the software application program and every character 
(except perhaps spaces?) within that span of text has, at the click of a 
button, a VS14 character inserted after it.

I remember that when handsetting metal type the same space sorts were 
used with italics as with roman.

There could also be a button that could remove all VS14 characters, if 
any, from within a highlighted span of text.

So, for someone typesetting plain text and viewing plain text the effect 
could look to be in accordance with how you consider italics should be 
encoded, though for plain text interchange the encoding would still be 
by using a VS14 character after each character that one wishes to become 
displayed italicized.

William Overington
Tuesday 15 January 2019


From unicode at unicode.org  Tue Jan 15 12:22:03 2019
From: unicode at unicode.org (Johannes Bergerhausen via Unicode)
Date: Tue, 15 Jan 2019 19:22:03 +0100
Subject: wws dot org
Message-ID: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>

Dear list,

I am happy to report that www.worldswritingsystems.org <http://www.worldswritingsystems.org/> is now online.

The web site is a joint venture by

? Institut Designlabor Gutenberg (IDG), Mainz, Germany,
? Atelier National de Recherche Typographique (ANRT), Nancy, France and
? Script Encoding Initiative (SEI), Berkeley, USA.

For every known script, we researched and designed a reference glyph.

You can sort these 292 scripts by Time, Region, Name, Unicode version and Status.
Exactly half of them (146) are already encoded in Unicode.

Here you can find more about the project:
www.youtube.com/watch?v=CHh2Ww_bdyQ <http://www.youtube.com/watch?v=CHh2Ww_bdyQ>

And is a link to see the poster:
https://shop.designinmainz.de/produkt/the-worlds-writing-systems-poster/

All the best,
Johannes


? Prof. Bergerhausen
Hochschule Mainz, School of Design, Germany
www.designinmainz.de
www.decodeunicode.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190115/d66790e9/attachment.html>

From unicode at unicode.org  Tue Jan 15 15:40:21 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 15 Jan 2019 21:40:21 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
Message-ID: <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com>


Although there probably isn't really any concerted effort to "keep 
plain-text mediocre", it can sometimes seem that way.

As we've been told repeatedly, just because something has been done over 
and over again doesn't mean that there's a precedent for it.

Using spans of text as a general indicator of rich-text seems reasonable 
at first blush.? But selected spans can also be copy/pasted (relocated), 
which is not stylistic at all.? Spans of text can be selected to apply 
casing, which is often seen as non-stylistic.? In applications such as 
BabelPad, spans of text can be converted to-and-from various forms of 
Unicode references and encodings.? Spans of text can be transliterated, 
moved, or deleted. In short, selecting a span of text only means that 
the user is going to apply some kind of process to that span.

Avant-garde enthusiasts are on the leading edge by definition. That's 
why they're known as trend setters.? Unicode exists because 
forward-looking people envisioned it and worked to make it happen. 
Regardless of one's perception of exuberance, Unicode turned out to be 
so much more than a fringe benefit.


From unicode at unicode.org  Tue Jan 15 16:16:16 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 15 Jan 2019 23:16:16 +0100
Subject: A last missing link for interoperable representation
In-Reply-To: <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
Message-ID: <f621ea97-4626-4558-559a-3a235410dfaf@orange.fr>

On 15/01/2019 13:25, Philippe Verdy via Unicode wrote:
> 
> Note that even if this NNBSP character is not mapped in a font, it
> should be rendered correctly with all modern renderers (the mapping
> is necessary only when a font design wants to tune its metrics,
> because its width varies between 1/8 and 1/6 em (the narrow space is
> a bit narrower in traditional English typography than in French, so
> typical English design set it at about 1/8 em, typical French design
> set it at 1/6 em, and neutral fonts may set it somewhere in the
> middle); the measure in em may however vary with some fonts (notably
> those using "narrow" or "wide" letters by default (because the font
> size in em indicates only its height) and in decorated/cursive styles
> (e.g. fonts with swashes need a higher line gap, the font design of
> the em size may be smaller than for modern simplified styles for
> display).
> 
> But a renderer should have no problem using a default metric for all
> whitespace characters, that actually don't need any glyph to be
> drawn: All what is needed is metrics, everything else, inclusing
> character properties like breaking are infered by the renderer
> independantly of the font and other per-language tuning, or controled
> by styling effects applied on top of the font

Indeed, since every Unicode implementation must rely on the character
properties, and given keeping this library up-to-date is straightforward
and easy, there is really no point in displaying a .notdef box in lieu
of whatever whitespace.

As a consequence, prior to assessing the impact of the group separator
migration from (wrong) <NBSP> to (correct) <NNBSP> on implementations
and interoperability, Unicode would be well advised to start assessing
the impact of implementations (and, of course, the backing vendors) on
correct rendering of <NNBSP>, and on the related usability and
interoperability of the digital representation of those many locales
that should rely on <NNBSP>.

> 
> A renderer may expand the kerning/approach if needed for example to
> generate "hollow" or "shadow" effects, or to generate synthetic
> weights, including with "variable" fonts support, typically the
> renderer will base the metrics of all missing/unmapped whitespaces
> from the metrics given to the normal SPACE or NBSP which are
> typically both mapped to the same glyph; NNBSP will be synthetized
> easily using half the advance width of SPACE, and it's fine;
> renderers can also synthetize all other whitespaces for ideographic
> usages, or will adapt the rendering if instructed to synthetize a
> monospaced variant: here there's a choice for NNBSP to be rendered
> like NBSP, typically for French as it is normally a bit wider, or as
> a zero-width space like in English, or contextually for example
> zero-width near punctuations or NBSP between letters/digits).

In a monospaced font, NNBSP has normally the width of a character,
but it has been designed for proportional fonts, and there, it must
not have the width of a digit, as that would annihilate the required
effect. The group separator must never have the width of a full digit,
not even of digit 1 in variable-width digits, but just a slight gap
ensuring correct readability, BTW also after the decimal separator
as per ISO 80000.

Between punctuation, <NNBSP> mustn?t be zero-wide, as it is used in
English to separate closing single and double quotation marks when
a nested quotation ends the first level quotation. I don?t think
that English does use <NNBSP> elsewhere around punctuation except
dashes if appropriate according to the applied style manual, but
Canadian French does, unlike an urban legend saying it doesn?t.
It does only prefer not to space off punctuation *if* <NNBSP> is
unavailable. That is another proof of the inappropriateness of
the <NBSP> for the purpose of spacing off tall punctuation marks.

> 
> Fonts only specify defaults that alter the rendering produced by a
> renderer, but a renderer is not required to use all infos and all
> glyphs in a specific font, it has to adapt to the context and choose
> what is more relevant and which kind of data it recognizeds and
> implements/uses at runtime. The font just provides the best settings
> according to the font designer, if all features are enabled, but most
> work is done by the renderer (and fonts are completely unaware of
> tyhe actual encoding of documents, fonts are only a database
> containing multiple features/settings, all of them bneing optional
> and selectable individually).

Good point, indeed. Currently we are too much concerned with fonts,
while actually it?s all up to the renderer. Today as most devices
are permanently connected to the internet, a decent rendering engine
could as well grab missing glyphs from an online repository, at
Google Fonts or at the application vendor?s website. All that
missing-glyph-whining seems completely outdated and very detrimental
to the user experience. It is so anachronistic that people shouldn?t
be surprised about suspicions of intentional bugs for the purpose of
unlawful lobbying by messing up user experience outside of certain
DTP applications. The French locale is the most heavily impacted
victim of those operating modes.

> 
> If your fonts behave incorrectly on your system because it does not
> map any glyph for NNBSP, don't blame the font or Unicode about this
> problem, blame the renderer (or the application or OS using it, may
> be they are very outdated and were not aware of these features, theyt
> are probably based on old versions of Unicode when NNBSP was still
> not present even if it was requested since very long at least for
> French and even English, before even Unicode, and long before
> Mongolian was then encoded, only in Unicode and not in any known
> supported legacy charset: Mongolian was specified by borrowing the
> same NNBSP already designed for Latin, because the Mongolian space
> had no known specific behavior: the encoded whitespaces in Unicode
> are compeltely script-neutral, they are generic, and are even
> BiDi-neutral, they are all usable with any script).
> 

Completely agreed. If I blame Unicode it?s for keeping the NNBSP off
the Standard during almost a decade, which translates to two decades
of delay due to the loss of dynamics past the early rush, and to
people who keep bullying the NNBSP 20 years after it was encoded,
and despite it is now widely supported. Also the ignorance related
to NNBSP is still abysmal despite the very popular style manual of
the French Imprimerie Nationale requires it?s use explicitly:

                     EXCLAMATION MARK
espace fine ins?cable      !      justifying space

(quoted/translated from figure p. 149; ISBN 9782743304829).


Many thanks to all who took part in this thread ? that is very
instructive and has brought up many new insights ? and likewise
to those spinning of child threads and sharing material.
Keep on the good work and be successful!

Best regards,

Marcel

From unicode at unicode.org  Tue Jan 15 17:47:14 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Tue, 15 Jan 2019 15:47:14 -0800
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com>
Message-ID: <CAMZ=zj7-r+m8YpLxWFV6xRr2S6VMUBt21HWKOfE=Wp2U89XqQw@mail.gmail.com>

On Tue, Jan 15, 2019 at 1:47 PM James Kass via Unicode <unicode at unicode.org>
wrote:

>
> Although there probably isn't really any concerted effort to "keep
> plain-text mediocre", it can sometimes seem that way.
>

Dennis Ritchie allegedly replied to requests for new features in C with ?If
you want PL/I, you know where to find it.? C is still an austere language,
and still well used, with users who want C++ or Java knowing where to find
them. If you want all the features of rich text, use rich text.

Avant-garde enthusiasts are on the leading edge by definition. That's
> why they're known as trend setters.  Unicode exists because
> forward-looking people envisioned it and worked to make it happen.
> Regardless of one's perception of exuberance, Unicode turned out to be
> so much more than a fringe benefit.
>

Unicode exists because large corporations wanted to sell computers to users
around the world, and found supporting a million different character sets
was costly and buggy, and that users wanted to mix scripts in ways that a
single character set didn't support and ISO 2022 and similar solutions just
weren't cutting it.

That's a clear user story. People can use italics on computers without
problem. Twitter has chosen not to support italics on their platform, which
users have found hacky work-arounds for. That's not such a clear user
story; shouldn't Twitter add support for italics instead of changing every
system in the world?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190115/2a0c8b7d/attachment.html>

From unicode at unicode.org  Tue Jan 15 19:15:27 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 16 Jan 2019 01:15:27 +0000
Subject: Encoding italic
In-Reply-To: <CAMZ=zj7-r+m8YpLxWFV6xRr2S6VMUBt21HWKOfE=Wp2U89XqQw@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com>
 <CAMZ=zj7-r+m8YpLxWFV6xRr2S6VMUBt21HWKOfE=Wp2U89XqQw@mail.gmail.com>
Message-ID: <fd2750f5-97a8-26b6-f27c-41c0bca64901@gmail.com>


Enabling plain-text doesn't make rich-text poor.

People who regard plain-text with derision, disdain, or contempt have 
every right to hold and share opinions about what plain-text is *for* 
and in which direction it should be heading.? Such opinions should 
receive all the consideration they deserve.


From unicode at unicode.org  Tue Jan 15 21:53:47 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 16 Jan 2019 04:53:47 +0100
Subject: Encoding italic
In-Reply-To: <fd2750f5-97a8-26b6-f27c-41c0bca64901@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com>
 <CAMZ=zj7-r+m8YpLxWFV6xRr2S6VMUBt21HWKOfE=Wp2U89XqQw@mail.gmail.com>
 <fd2750f5-97a8-26b6-f27c-41c0bca64901@gmail.com>
Message-ID: <4520c27d-c54a-ea00-87f1-ed2b9cc43c04@orange.fr>

On 16/01/2019 02:15, James Kass via Unicode wrote:
> 
> Enabling plain-text doesn't make rich-text poor.
> 
> People who regard plain-text with derision, disdain, or contempt have
> every right to hold and share opinions about what plain-text is *for*
> and in which direction it should be heading.  Such opinions should
> receive all the consideration they deserve.

Perhaps there?s a need to sort out what plain text is thought to be
across different user communities. Sometimes ?plain text? is just a
synonym for _draft style_, considering that a worker should not need
to follow any style guide, because (a) many normal keyboards don?t
enable users to do so, and (b) the process is too complicated using
mainstream extended keyboard layouts.

 From this point of view, any demand to key in directly a text in a
locale?s accurate digital representation is likely to be considered
an unreachable challenge and thus, an offense.

But indeed, people are entitled not to screw down their requirements
as of what text is supposed to look like. From that POV, draft style
is unbearable, and being bound to it is then the actual offense.

The first step would then be to beef up that draft style so that it
integrates all characters needed for a fully featured representation
of a locale?s language, from curly quotes to preformatted superscript.
Unicode makes it possible, in the straight line of what was set up
in ISO/IEC 6937. The next step is to design appropriate input methods.
Today, we can even get back the u?n?d?e?r?l?i?n?e? that we were deprived of,
by adding an appropriate dead key or combining diacritic, but that?s
still experimental. It already works better, though, than the Unicode
Syriac abbreviation control, whose overline is *not* rendered in
Chrome on Linux, The same way, Unicode could encode a Latin italic
control, or as Victor Gaultney proposes, a Latin italic start control
and a Latin italic end control, directing the rendering engine to
pick italics instead of drawing a linie along the rest of the word.

However, the discussion about Fraktur typefaces in the parent thread
made clear that reasoning in terms of roman vs italic is not really
interoperable, because in Roman typefaces, italic is polysemic, as
it?s used both for foreign words and for stress, while in Fraktur,
stress is denoted by spacing out, and foreign words, by using roman.
That would require a start and end pair of both Latin foreign word
controls and Latin stress controls.

As we see it from here, that would be even less implemented than
the Syriac abbreviation format control. It might be considered
Unicode conformant, since it would be part of the interoperable
digital representation of Latin script using languages, and its
use could be extended to other scripts.

But that is *not* what I?m asking for. First, we aren?t writing
in Fraktur any more, at least not in France nor in any other
language using preformatted superscript abbreviation indicators.
And second, if we need a document for full-fleshed publishing,
we can use LaTeX or InDesign.

What I?m asking for is simply that people are enabled to write
in their language in a decent manner and can use that text in
any environment without postprocessing *and* without looking
downright bad.

That might please even those who are looking at draft style
with disdain.


Best regards,

Marcel

From unicode at unicode.org  Tue Jan 15 22:40:16 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 16 Jan 2019 04:40:16 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
Message-ID: <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>


Victor Gaultney wrote,

 > Use of variation selectors, a single character modifier, or combining
 > characters also seem to be less useful options, as they act at the 
individual
 > character level and are highly impractical. They also violate the key 
concept
 > that italics are a way of marking a span of text as 'special' - not 
individual
 > letters. Matched punctuation works the same way and is a good fit for 
italic.

The VS possibility would double the character count of any strings 
including them.? That may make it undesirable for groups like Twitter 
who have limits.? But math (mis)use doesn't affect the character count.? 
If the VS method were to be used, the math alphanumerics might continue 
to be used where possible, at least by Twitter users who already employ 
the math-alphas to make their corpus of legacy data.

Using VS arose in the parent thread as a way of avoiding the necessity 
of adding additional characters to the standard.? (But we don't seem to 
be running out of available code space.)? The purpose of VS is to 
preserve variant letter form distinctions in plain-text, which seems to 
apply to italics.? Further, VS is an existing mechanism which wouldn't 
be expected to impact searching and so forth on savvy systems.? (An 
opening/closing pair of control characters also shouldn't impact 
searching.)? Finally, VS already works in existing technology and there 
wouldn't be a long down-time waiting for updates to the standard and 
implementation of same. (Not that we should rush to judgment or 
"solutions" here, just that an ad-hoc "solution" is possible and could 
be implemented by third-parties.)

Concerns about statefulness in plain-text exist.? Treating "italic" as 
an opening/closing "punctuation" may help get around such concerns.? 
IIRC, it was proposed that the Egyptian cartouche be handled that way.

Like emoji, people who don't like italics in plain text don't have to 
use them.


From unicode at unicode.org  Tue Jan 15 23:05:24 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Tue, 15 Jan 2019 21:05:24 -0800
Subject: Encoding italic
In-Reply-To: <fd2750f5-97a8-26b6-f27c-41c0bca64901@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com>
 <CAMZ=zj7-r+m8YpLxWFV6xRr2S6VMUBt21HWKOfE=Wp2U89XqQw@mail.gmail.com>
 <fd2750f5-97a8-26b6-f27c-41c0bca64901@gmail.com>
Message-ID: <CAMZ=zj6NpNZWGnONO9Luk=cpeMdOVF2xcwP2moaHyF-7pqYyQg@mail.gmail.com>

On Tue, Jan 15, 2019 at 5:17 PM James Kass via Unicode <unicode at unicode.org>
wrote:

> Enabling plain-text doesn't make rich-text poor.
>

Adding italics to Unicode will complicate the implementation of all rich
text applications that currently support italics.


> People who regard plain-text with derision, disdain, or contempt have
> every right to hold and share opinions about what plain-text is *for*
> and in which direction it should be heading.  Such opinions should
> receive all the consideration they deserve.
>

Really? There's no one here regards plain text with derision, disdain or
contempt. I might complain about the people who claim to like plain text
yet would only be happy with massive changes to it, though.

However, plain text can be used standalone, and it can be used inside
programs and other formats. Dismissing the people who use Unicode in ways
that aren't plain text is unfair and hurts your case.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190115/487a7dd3/attachment.html>

From unicode at unicode.org  Wed Jan 16 00:17:46 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 16 Jan 2019 06:17:46 +0000
Subject: Encoding italic
In-Reply-To: <CAMZ=zj6NpNZWGnONO9Luk=cpeMdOVF2xcwP2moaHyF-7pqYyQg@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com>
 <CAMZ=zj7-r+m8YpLxWFV6xRr2S6VMUBt21HWKOfE=Wp2U89XqQw@mail.gmail.com>
 <fd2750f5-97a8-26b6-f27c-41c0bca64901@gmail.com>
 <CAMZ=zj6NpNZWGnONO9Luk=cpeMdOVF2xcwP2moaHyF-7pqYyQg@mail.gmail.com>
Message-ID: <05235340-4f56-9c34-67e1-6f9c2e5829d1@gmail.com>


Responding to David Starner,

 > I might complain about the people who claim to like plain text yet would
 > only be happy with massive changes to it, though.

Most movie lovers welcomed talkies.

People are free to cling to their rotary phones as long as they like.? 
They just can't press the pound sign.

 > However, plain text can be used standalone, and it can be used inside
 > programs and other formats.

That remains true even post-emoji.? How would italics change that?

 > Dismissing the people who use Unicode in ways that aren't plain text
 > is unfair and hurts your case.

It wasn't my intention to be dismissive, much, so point taken. 
Discussions like this one exist so that people can express concerns and 
share ideas towards resolutions.

 > Adding italics to Unicode will complicate the implementation of all rich
 > text applications that currently support italics.

Would there be any advantages to rich-text apps if italics were added to 
Unicode?? Is there any cost/benefit data?? You've made an assertion 
about complication to rich-text apps which I can neither confirm nor refute.

One possible advantage would be interoperability.? People snagging 
snippets of text from web pages or word processors and dropping data 
into their plain-text windows wouldn't be bamboozled by the unexpected.? 
If computer text is getting exchanged, isn't it better when it can be 
done in a standard fashion?


From unicode at unicode.org  Wed Jan 16 02:15:38 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 16 Jan 2019 09:15:38 +0100
Subject: Encoding italic
In-Reply-To: <CAMZ=zj6NpNZWGnONO9Luk=cpeMdOVF2xcwP2moaHyF-7pqYyQg@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com>
 <CAMZ=zj7-r+m8YpLxWFV6xRr2S6VMUBt21HWKOfE=Wp2U89XqQw@mail.gmail.com>
 <fd2750f5-97a8-26b6-f27c-41c0bca64901@gmail.com>
 <CAMZ=zj6NpNZWGnONO9Luk=cpeMdOVF2xcwP2moaHyF-7pqYyQg@mail.gmail.com>
Message-ID: <33661316-2594-8183-f268-e3430e36c2ae@orange.fr>

On 16/01/2019 06:05, David Starner via Unicode wrote:
[?]
> [?] There's no one here regards plain text with derision, disdain or contempt.
There is one sort of so-called plain text that looks unbearable to me. That is the draft-style plain text full of ASCII fallbacks. Especially those where Latin abbreviation indicators that are correctly superscript, are sitting on the baseline. Also those using ASCII space or Latin-1 non-breakable space to space off French punctuation, and where those marks are then cut off by line breaks, or torn apart by justification when such plain text is the backbone of rich text on the web (where <NBSP> remains unhacked, unlike in word processors where it?s fixed-width, and even then it?s ugly).

> [?] Dismissing the people who use Unicode in ways that aren't plain text is unfair [?].
Is this statement applying the restrictive house policy about what is ?ordinary (plain) text? as it is found in TUS? I?m asking the question because even if this statement is a mark of support and empathy, I?m uncomfortable with the idea that there seems to be a subset of Unicode that despite being plain text by definition, cannot be used in every plain text string. Please feel free to post your definition of "plain text". I feel that it will add to the collection.

Best regards,

Marcel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190116/c4b0fb5e/attachment.html>

From unicode at unicode.org  Wed Jan 16 02:57:15 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 16 Jan 2019 08:57:15 +0000
Subject: A last missing link for interoperable representation
In-Reply-To: <slrnq3rc1g.bqa.jcb@home.stevens-bradfield.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <211de2f2.5cdb.16832185601.Webtop.73@btinternet.com>
 <6127f5ea.5d0e.168322908d2.Webtop.73@btinternet.com>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <slrnq3rc1g.bqa.jcb@home.stevens-bradfield.com>
Message-ID: <e5463bf4-e17a-f8db-00aa-24947b4d8f5a@gmail.com>


Julian Bradfield wrote,

 > Oh, and what about dropped initials? They have been used in both
 > manuscripts and typography for many centuries - surely we must encode
 > them?

Naa-aah, we just hack the full width presentation forms for that.

?rop ?aps in ?lain ?ext

(Whether they actually drop depends on the font, though.)


From unicode at unicode.org  Wed Jan 16 03:12:23 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Wed, 16 Jan 2019 01:12:23 -0800
Subject: Encoding italic
In-Reply-To: <05235340-4f56-9c34-67e1-6f9c2e5829d1@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <7f1cd838-e388-aff8-7dab-a5d62a6e5c3f@gmail.com>
 <CAMZ=zj7-r+m8YpLxWFV6xRr2S6VMUBt21HWKOfE=Wp2U89XqQw@mail.gmail.com>
 <fd2750f5-97a8-26b6-f27c-41c0bca64901@gmail.com>
 <CAMZ=zj6NpNZWGnONO9Luk=cpeMdOVF2xcwP2moaHyF-7pqYyQg@mail.gmail.com>
 <05235340-4f56-9c34-67e1-6f9c2e5829d1@gmail.com>
Message-ID: <CAMZ=zj6L6jD+guZnrCNAcH8SpCLJFKW79UzAH8AdCiDyXSfrNA@mail.gmail.com>

On Tue, Jan 15, 2019 at 10:19 PM James Kass via Unicode
<unicode at unicode.org> wrote:
> Would there be any advantages to rich-text apps if italics were added to
> Unicode?  Is there any cost/benefit data?  You've made an assertion
> about complication to rich-text apps which I can neither confirm nor refute.

It's trivial; virtually all rich-text apps support italics or
specifically don't support italics. Suddenly they have to unify
italics from the plain text with the higher level italics, or they
have to exclude italics from the input data.

> One possible advantage would be interoperability.  People snagging
> snippets of text from web pages or word processors and dropping data
> into their plain-text windows wouldn't be bamboozled by the unexpected.
> If computer text is getting exchanged, isn't it better when it can be
> done in a standard fashion?

Bamboozled by the unexpected? I think the expectations of those who
have plain-text windows (who are still watching silents, in a sense)
is that pasting data into them will not copy italics. As for more
common users, a quick websearch shows many examples of people
frustrated that they cut and paste something and details like bold and
italics were carried along. This also establishes that current systems
already allow rich text to be cut-and-pasted in a platform-specific
manner.

-- 
Kie ekzistas vivo, ekzistas espero.

From unicode at unicode.org  Wed Jan 16 03:30:40 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 16 Jan 2019 09:30:40 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
Message-ID: <0af51bfc-5c40-8b7e-3c55-d560b1aedd27@gmail.com>


I wrote,

 > The VS possibility would double the character count of any strings
 > including them.

A kind list member has pointed out privately that the above is 
mistaken.? Twitter character counts aren't actually character counts.? 
Each math-alpha counts as two characters as do the VS characters.? So a 
string with VS characters interspersed would actually be triple rather 
than double.

(I've also been advised that a lot of the math-alpha on Twitter involves 
fraktur, script, and double struck characters.? As was pointed out to 
me, that practice would probably continue even if Twitter enabled italic 
and bold styling as a feature.? Again, I do not personally know how 
widespread the practice is.)


From unicode at unicode.org  Wed Jan 16 05:23:59 2019
From: unicode at unicode.org (Victor Gaultney via Unicode)
Date: Wed, 16 Jan 2019 11:23:59 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
Message-ID: <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>

James Kass wrote:
> Concerns about statefulness in plain-text exist.? Treating "italic" as 
> an opening/closing "punctuation" may help get around such concerns. 
> IIRC, it was proposed that the Egyptian cartouche be handled that way.

I do appreciate the technical issues surrounding statefulness and user 
expectation when they select, copy, and paste. However that has always 
been an issue. The Latin script (and many others) already has 'states', 
and that is reflected in the encoding of the markers that indicate the 
beginning and end of those states (parens, quotes, etc.). In the Latin 
script those markers are visually represented as separate glyphs, 
although sometimes enterprising font makers will use OpenType or 
Graphite to adjust those glyphs in context.

Encoding 'begin italic' and 'end italic' would introduce difficulties 
when partial strings are moved, etc. But that's no different than with 
current punctuation. If you select the second half of a string that 
includes an end quote character you end up with a mismatched pair, with 
the same problems of interpretation as selecting the second half of a 
string including an 'end italic' character. Apps have to deal with it, 
and do, as in code editors.

Apps (and font makers) can also choose how to deal with presenting 
strings of text that are marked as italic. They can choose to present 
visual symbols to indicate begin/end, such as /this/. Or they can 
present it using the italic variant of the font, if available. Yes that 
brings up the issue of what to do if no italic counterpart is there. But 
that's already an issue with people using math characters for 
pseudo-italic. I'd guess that far, far more fonts in the world have 
italic counterparts than contain math chars, and the trend toward always 
having roman/italic matched pairs is something I've established in my 
research interviews.

Treating italic like punctuation is a win for a lot of people:

- Users get their italic content preserved in plain text

- Those who develop plain text apps (social media in particular) don't 
have to build in a whole markup/markdown layer into their apps

- Misuse of math chars for pseudo-italic would likely disappear

- The text runs between markers remain intact, so they need no special 
treatment in searching, selecting, etc.

- It finally, and conclusively, would end the decades of the mess in 
HTML that surrounds <em> and <italic>.

My main point in suggesting that Unicode needs these characters is that 
italic has been used to indicate specific meaning - this text is somehow 
special - for over 400 years, and that content should be preserved in 
plain text.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190116/fb82daa7/attachment.html>

From unicode at unicode.org  Wed Jan 16 06:13:26 2019
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Wed, 16 Jan 2019 12:13:26 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <0af51bfc-5c40-8b7e-3c55-d560b1aedd27@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <0af51bfc-5c40-8b7e-3c55-d560b1aedd27@gmail.com>
Message-ID: <FAA5ACD0-4801-488A-98A3-CB065953BEF7@lboro.ac.uk>


> On 16 Jan 2019, at 09:30, James Kass via Unicode <unicode at unicode.org> wrote:
> 
> 
> I wrote,
> 
> > The VS possibility would double the character count of any strings
> > including them.
> 
> A kind list member has pointed out privately that the above is mistaken.  Twitter character counts aren't actually character counts.  Each math-alpha counts as two characters as do the VS characters.  So a string with VS characters interspersed would actually be triple rather than double.

Odd! I have just briefly experimented with twitter and it appears that any character ? U+1100 has a count of 2 and any character < U+1100 has a count of 1.

I remember many years ago twitter was incorrectly counting in UTF-16 encoding units thus giving a count of 1 for BMP characters and a count of 2 for astral characters. That problem was fixed long ago.

Andr? Schappo

> (I've also been advised that a lot of the math-alpha on Twitter involves fraktur, script, and double struck characters.  As was pointed out to me, that practice would probably continue even if Twitter enabled italic and bold styling as a feature.  Again, I do not personally know how widespread the practice is.)
> 


From unicode at unicode.org  Wed Jan 16 06:16:17 2019
From: unicode at unicode.org (Andrew Cunningham via Unicode)
Date: Wed, 16 Jan 2019 23:16:17 +1100
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
Message-ID: <CAGJ7U-VkgMCjFmi0MNLU=nOZj-q1HdM7bnmnCzbSSQsZkzGPwA@mail.gmail.com>

HI Victor, an off list reply. The contents are just random thoughts sparked
by an interesting conversation.

On Wed, 16 Jan 2019 at 22:44, Victor Gaultney via Unicode <
unicode at unicode.org> wrote:

>
> - It finally, and conclusively, would end the decades of the mess in HTML
> that surrounds <em> and <italic>.
>

I am not sure that would fix the issue, more likely compound the issue
making it even more blurry what the semantic purpose is. HTML5 make both
<i> and <e> semantic ... and by the definition the style of the elements is
not necessarily italic. <em> for instance would be script dependant, <i>
may be partially script dependant when another appropriate semantic tag is
missing. A character/encoding level distinction is just going to compound
the mess.

And then there are all the other script specific typographic / typesetting
conventions that should also be considered.


> My main point in suggesting that Unicode needs these characters is that
> italic has been used to indicate specific meaning - this text is somehow
> special - for over 400 years, and that content should be preserved in plain
> text.
>
>
> Underlying, bold text, interletter spacing, colour change, font style
change all are used to apply meaning in various ways. Not sure why italic
is special in this sense. Additionally without encoding the meaning of
italic, all you know is that it is italic, not what convention of semantic
meaning lies behind it.

And I am curious on your thoughts, if we distinguish italic in Unicode,
encode some way of spacifying italic text, wouldn't it make more sense to
do away with italic fonts all together? and just roll the italic glyphs
into the regular font?

In theory changing italic from a stylistic choice as it currently is to a
encoding/character level semantic is a paradigmn shift. We dont have
separate fonts for variation selectors or any other mecahanism in
unicode,and it would seem to make sense to roll character glyph variation
into a single font. And potentially exclude italicisation from being a
viable axis in a variable font. Just speculation on my part.

To clarify I am neither for nor against encoding italics. But so far there
doesn't seem to be a robust case for it. But it it were introduced I would
prefer a system that was more inclusive of all scripts, giving proper
analysis of typeseting and typographic conventions in each script and well
founded decisions on which should be encoded. Cherry picking one feature
relevant to a small set of scripts seems to be a problematic path.

I have enough trouble with ordered and unordered lists and list markers in
HTML without expaning the italics mess in HTML.

-- 
Andrew Cunningham
lang.support at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190116/5d18eebd/attachment.html>

From unicode at unicode.org  Wed Jan 16 08:33:39 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 16 Jan 2019 15:33:39 +0100
Subject: wws dot org
In-Reply-To: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
Message-ID: <63d51c54-66c7-1f49-bf18-d5fd8c656d89@orange.fr>

On 15/01/2019 19:22, Johannes Bergerhausen via Unicode wrote:
> Dear list,
>
> I am happy to report that www.worldswritingsystems.org <http://www.worldswritingsystems.org>?is now online.
>
> The web site is a joint venture by
>
> ? Institut Designlabor Gutenberg (IDG), Mainz, Germany,
> ? Atelier National de Recherche Typographique (ANRT), Nancy, France and
> ? Script Encoding Initiative (SEI), Berkeley, USA.
>
> For every known script, we researched and designed a reference glyph.
>
> You can sort these 292 scripts by Time, Region, Name, Unicode version and Status.
> Exactly half of them (146) are already encoded in Unicode.
So to date, Unicode has only made half its way, and for every single script in the
Standard there is another script out there that remains still unsupported.

First things first. When I first replied in the first thread of this year I already
warned:
 >>> Having said that, still unsupported minority languages are top priority.

I didn?t guess that I opened a Pandora box whose content would lead us
far away from the only useful goal deeply embedded in the concept of
Unicode: support all of the world?s writing systems.

Instead, we?re discussing how to enable social media users to tune
ephemeral messages even more to attract even more the scarce attention
of overwhelmed co-users before going buried in the mass of a vanishing timeline.

I sought feedback about using Unicode to get back the underlining feature known
from the typewriter era. But like some other hints I provided, it went unpicked?

Sadly it?s uninteresting, no cherries. Also if Unicode had to wait until enough
characters are picked for adoption prior to encoding the missing scripts, I?m
afraid the job won?t ever be done?

The industry is welcome to help speeding up the process.

Thanks to Johannes Bergerhausen for setting up and sharing this resource.

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190116/05c3cd0e/attachment.html>

From unicode at unicode.org  Wed Jan 16 10:23:40 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 16 Jan 2019 08:23:40 -0800
Subject: wws dot org
In-Reply-To: <63d51c54-66c7-1f49-bf18-d5fd8c656d89@orange.fr>
References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
 <63d51c54-66c7-1f49-bf18-d5fd8c656d89@orange.fr>
Message-ID: <b4f375e6-b067-7e5b-4999-35b4dbcce472@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190116/c52bdfc6/attachment.html>

From unicode at unicode.org  Wed Jan 16 11:30:15 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Wed, 16 Jan 2019 17:30:15 +0000 (GMT)
Subject: New ideas (from: wws dot org)
In-Reply-To: <b4f375e6-b067-7e5b-4999-35b4dbcce472@ix.netcom.com>
References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
 <63d51c54-66c7-1f49-bf18-d5fd8c656d89@orange.fr>
 <b4f375e6-b067-7e5b-4999-35b4dbcce472@ix.netcom.com>
Message-ID: <74897b08.2613.16857b6a9e8.Webtop.231@btinternet.com>

Asmus Freytag wrote as follows:

>  PS: of course, if a contemplated change, such as the one alluded to, 
> should be ill advised, its negative effects could have wide ranging 
> impacts...but that's not the topic here.

If you object to encoding italics please say so and if possible please 
provide some reasons.

I am on moderated posts because one of my ideas, on which I am 
continuing to research, is permanently banned from being discussed in 
the Unicode mailing list. Every post that I send is screened by a 
moderator, not every post gets through to the mailing list.

There have been developments in my research project such as the 
definition of an encoding space for the particular purpose and I am 
proposing being able to express access to those items using U+FFF7 and a 
sequence of tag digits so that the items can be unambiguously encoded in 
Unicode plain text. That could be very useful for the future, yet I 
cannot even post about it to the Unicode mailing list because the topic 
is permanently banned. One character to implement a potentially very 
useful invention, and its encoding cannot be discussed in the Unicode 
mailing list. So many people will not even know that the suggestion has 
ever been made and so they can neither think it a good idea, nor make 
helpful comments or otherwise because it cannot even be discussed.

So the encoding of italics, on which topic my posts have thus far been 
allowed through, is only very minorly regarded as controversial in 
relation to my research project.

I have sent a copy of this email to various people as well as to the 
Unicode mailing list, and it may well be that this post will not be 
allowed to go through to the Unicode mailing list due to the ban, so if 
replying to this email please do not send a copy to the Unicode mailing 
list unless you get a copy listed as from me via Unicode rather than 
just listed as from me, because although I would like to get discussion 
in the Unicode mailing list of the possibility of encoding U+FFF7 as a 
base character for these items I do not wish such a discussion in the 
Unicode mailing list to be by error.

William Overington
Wednesday 16 January 2019

From unicode at unicode.org  Wed Jan 16 11:38:09 2019
From: unicode at unicode.org (Phake Nick via Unicode)
Date: Thu, 17 Jan 2019 01:38:09 +0800
Subject: wws dot org
In-Reply-To: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
Message-ID: <CAGHjPPKR=G25w5VShfSRSoH4Ld_U9e6HYinEPwTPX5YcV4C92A@mail.gmail.com>

Feedback after briefly reading the East Asia section of the website:
1. I am pretty sure the "Kaida" script is not living anymore, according to
Wikipedia description
2. Hentaigana refers to all alternative form of kana that're used before
modern standardization, I don't think they're still used actively now.
3. The meaning of the "Old Hanzi" is not clear. If it is the same
definition as the one stated in this blog:
http://babelstone.blogspot.com/2007/07/old-hanzi.html , then it is not
referring to a single script and instead refer to all historical ways to
write Hanzi, including Oracle Bone script, Bronze script, and (Small) Seal
script and such. And the list have already separately include oracle bone
script, bronze script and seal script, which apparently make this "old
hanzi" entry redundant.


? 2019?1?16??? 02:25?Johannes Bergerhausen via Unicode <unicode at unicode.org>
???

> Dear list,
>
> I am happy to report that www.worldswritingsystems.org is now online.
>
> The web site is a joint venture by
>
> ? Institut Designlabor Gutenberg (IDG), Mainz, Germany,
> ? Atelier National de Recherche Typographique (ANRT), Nancy, France and
> ? Script Encoding Initiative (SEI), Berkeley, USA.
>
> For every known script, we researched and designed a reference glyph.
>
> You can sort these 292 scripts by Time, Region, Name, Unicode version and
> Status.
> Exactly half of them (146) are already encoded in Unicode.
>
> Here you can find more about the project:
> www.youtube.com/watch?v=CHh2Ww_bdyQ
>
> And is a link to see the poster:
> https://shop.designinmainz.de/produkt/the-worlds-writing-systems-poster/
>
> All the best,
> Johannes
>
>
>
>
> ? Prof. Bergerhausen
>
> Hochschule Mainz, School of Design, Germany
>
> www.designinmainz.de
>
> www.decodeunicode.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/0e4470f5/attachment.html>

From unicode at unicode.org  Wed Jan 16 12:07:42 2019
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Wed, 16 Jan 2019 10:07:42 -0800
Subject: New ideas (from: wws dot org)
In-Reply-To: <74897b08.2613.16857b6a9e8.Webtop.231@btinternet.com>
References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
 <63d51c54-66c7-1f49-bf18-d5fd8c656d89@orange.fr>
 <b4f375e6-b067-7e5b-4999-35b4dbcce472@ix.netcom.com>
 <74897b08.2613.16857b6a9e8.Webtop.231@btinternet.com>
Message-ID: <2e461d85-3ba3-adf8-2435-4bdc25c5af24@ix.netcom.com>

On 1/16/2019 9:30 AM, wjgo_10009 at btinternet.com wrote:
> Asmus Freytag wrote as follows:
>
>> ?PS: of course, if a contemplated change, such as the one alluded to, 
>> should be ill advised, its negative effects could have wide ranging 
>> impacts...but that's not the topic here.
>
> If you object to encoding italics please say so and if possible please 
> provide some reasons.

It's not the topic of this thread. Let's keep the discussion in one place.

A./

>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190116/ae55fcba/attachment.html>

From unicode at unicode.org  Wed Jan 16 14:53:05 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 16 Jan 2019 20:53:05 +0000
Subject: NNBSP (was: A last missing link for interoperable representation)
In-Reply-To: <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
Message-ID: <20190116205305.213b335d@JRWUBU2>

On Tue, 15 Jan 2019 13:25:06 +0100
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> If your fonts behave incorrectly on your system because it does not
> map any glyph for NNBSP, don't blame the font or Unicode about this
> problem, blame the renderer (or the application or OS using it, may
> be they are very outdated and were not aware of these features, theyt
> are probably based on old versions of Unicode when NNBSP was still
> not present even if it was requested since very long at least for
> French and even English, before even Unicode, and long before
> Mongolian was then encoded, only in Unicode and not in any known
> supported legacy charset: Mongolian was specified by borrowing the
> same NNBSP already designed for Latin, because the Mongolian space
> had no known specific behavior: the encoded whitespaces in Unicode
> are compeltely script-neutral, they are generic, and are even
> BiDi-neutral, they are all usable with any script).

The concept of this codepoint started for Mongolian, but was generalised
before the character was approved.

Now, I understand that all claims about character properties that cannot
be captured in the UCD should be dismissed as baseless, but if we
believed the text of TUS we would find that NNBSP has some interesting
properties with application only to Mongolian:

1) It has a shaping effect on following character.
2) It has zero width at the start of a line.
3) When the line-breaking algorithm does not provide enough
line-breaking opportunities, it changes its line-breaking property
from GL to BB.

Or is property (3) appropriate for French?

Richard.

From unicode at unicode.org  Wed Jan 16 21:38:46 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 17 Jan 2019 03:38:46 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
Message-ID: <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>


Victor Gaultney wrote,

 > Treating italic like punctuation is a win for a lot of people:

Italic Unicode encoding is a win for a lot of people regardless of 
approach.? Each of the listed wins remains essentially true whether 
treated as punctuation, encoded atomically, or selected with VS.

 > My main point in suggesting that Unicode needs these characters is that
 > italic has been used to indicate specific meaning - this text is somehow
 > special - for over 400 years, and that content should be preserved in 
plain
 > text.

( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf )

"Plain text must contain enough information to permit the text to be 
rendered legibly, and nothing more."

The argument is that italic information can be stripped yet still be 
read.? A persuasive argument towards encoding would need to negate that; 
it would have to be shown that removing italic information results in a 
loss of meaning.

The decision makers at Unicode are familiar with italic use conventions 
such as those shown in "The Chicago Manual of Style" (first published in 
1906).? The question of plain-text italics has arisen before on this 
list and has been quickly dismissed.

Unicode began with the idea of standardizing existing code pages for the 
exchange of computer text using a unique double-byte encoding rather 
than relying on code page switching.? Latin was "grandfathered" into the 
standard.? Nobody ever submitted a formal proposal for Basic Latin.? 
There was no outreach to establish contact with the user community -- 
the actual people who used the script as opposed to the "computer nerds" 
who grew up with ANSI limitations and subsequent ISO code pages.? 
Because that's how Unicode rolled back then.? Unicode did what it was 
supposed to do WRT Basic Latin.

When someone points out that italics are used for disambiguation as well 
as stress, the replies are consistent.

"That's not what plain-text is for."? "That's not how plain-text 
works."? "That's just styling and so should be done in rich-text." 
"Since we do that in rich-text already, there's no reason to provide for 
it in plain-text."? "You can already hack it in plain-text by enclosing 
the string with slashes."? And so it goes.

But if variant letter form information is stripped from a string like 
"Jackie Brown", the primary indication that the string represents either 
a person's name or a Tarantino flick title is also stripped.? "Thorstein 
Veblen" is either a dead economist or the name of a fictional yacht in 
the Travis McGee series.? And so forth.

Computer text tradition aside, nobody seems to offer any legitimate 
reason why such information isn't worthy of being preservable in 
plain-text.? Perhaps there isn't one.

I'm not qualified to assess the impact of italic Unicode inclusion on 
the rich-text world as mentioned by David Starner.? Maybe another list 
member will offer additional insight or a second opinion.


From unicode at unicode.org  Wed Jan 16 21:51:57 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 17 Jan 2019 04:51:57 +0100
Subject: NNBSP (was: A last missing link for interoperable representation)
In-Reply-To: <20190116205305.213b335d@JRWUBU2>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
Message-ID: <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>

On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:
>
> On Tue, 15 Jan 2019 13:25:06 +0100
> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
> 
>> If your fonts behave incorrectly on your system because it does not
>> map any glyph for NNBSP, don't blame the font or Unicode about this
>> problem, blame the renderer (or the application or OS using it, may
>> be they are very outdated and were not aware of these features, theyt
>> are probably based on old versions of Unicode when NNBSP was still
>> not present even if it was requested since very long at least for
>> French and even English, before even Unicode, and long before
>> Mongolian was then encoded, only in Unicode and not in any known
>> supported legacy charset: Mongolian was specified by borrowing the
>> same NNBSP already designed for Latin, because the Mongolian space
>> had no known specific behavior: the encoded whitespaces in Unicode
>> are compeltely script-neutral, they are generic, and are even
>> BiDi-neutral, they are all usable with any script).
> 
> The concept of this codepoint started for Mongolian, but was generalised
> before the character was approved.

Indeed it was proposed as MONGOLIAN SPACE <MSP> at block start, which was
consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
more. When Unicode argued in favor of a unification with <NBSP>, this was
pointed as impracticable, and the need of a specific Mongolian space for
the purpose of appending suffixes was underscored. Only in London in
September 1998 it was agreed that ?The Mongolian Space is retained but
moved to the general punctuation block and renamed ?Narrow No Break Space???.

However, unlike for the Mongolian Combination Symbols sequencing a question
and exclamation mark both ways, a concrete rationale as of how useful the
<NNBSP> could be in other scripts doesn?t seem to be put on the table when
the move to General Punctuation was decided.

> 
> Now, I understand that all claims about character properties that cannot
> be captured in the UCD should be dismissed as baseless, but if we
> believed the text of TUS we would find that NNBSP has some interesting
> properties with application only to Mongolian:

As a side-note: The relevant text of TUS doesn?t predate version 11 (2018).

> 
> 1) It has a shaping effect on following character.
> 2) It has zero width at the start of a line.
> 3) When the line-breaking algorithm does not provide enough
> line-breaking opportunities, it changes its line-breaking property
> from GL to BB.

I don?t believe that these additions to TUS are in any way able to fix
the many issues with <NNBSP> in Mongolian causing so much headache and
ending up in a unanimous desire to replace <NNBSP> with a *new*
*MONGOLIAN SUFFIX CONNECTOR. Indeed some suffixes are as long as 7 letters,
e.g. ????????? <U+202F><U+1832><U+1820><U+1836><U+1822><U+182D><U+1820><U+1828>?

https://lists.w3.org/Archives/Public/public-i18n-mongolian/2015JulSep/att-0036/DS05_Mongolian_NNBSP_Connected_Suffixes.pdf

> 
> Or is property (3) appropriate for French?

No it isn?t. It only introduces new flaws for a character that,
despite being encoded for Mongolian with specific handling intended,
was readily ripped off for use in French, Philippe Verdy reported,
to that extent that it is actually an encoding error in Mongolian
that brought the long-missing narrow non-breakable thin space into
the UCS, in the block where it really belongs to, and where it had
been encoded in the beginning if there had been no desire to keep
it proprietary.

That is the hidden (almost occult) fact where stances like ?The
NNBSP can be used to represent the narrow space occurring around
punctuation characters in French typography, which is called an
?espace fine ins?cable.??? (TUS) and ?When NARROW NO-BREAK SPACE
occurs in French text, it should be interpreted as an ?espace fine
ins?cable?.? (UAX?#14) are stemming from. The underlying meaning
as I understand it now is like: ?The non-breakable thin space is
usually a vendor-specific layout control in DTP applications; it?s
also available via a TeX command. However, if you are interested
in an interoperable representation, here?s a Unicode character you
can use instead.?

Due to the way <NNBSP> made its delayed way into Unicode, font
support was reported as late as almost exactly two years ago to
be extremely scarce, this analysis of the first 47 fonts on
Windows 10 shows:

https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

Surprisingly for me, Courier New has NNBSP. We must have been
using old copies. I?m really glad that this famous and widely
used typeface has been unpdated. Please disregard my previous
posting about Courier New unsupporting NNBSP. I?ll need to use
a font manager to output a complete list wrt NNBSP support.

I?m utterly worried about the fate of the non-breaking thin
space in Unicode, and I wonder why the French and Canadian
French people present at setup ? either on Unicode side or
on JTC1/SC2/WG2 side ? didn?t get this character encoded in
the initial rush. Did they really sell themselves and their
locales to DTP lobbyists? Or were they tricked out?

Also, at least one French typographer was extremely upset
about Unicode not gathering feedback from typographers.
That blame is partly wrong since at least one typographer
was and still is present in WG2, and even if not being a
Frenchman (but knowing French), as an Anglophone he might
have been aware of the most outstanding use case of NNBSP
with English (both British and American) quotation marks
when a nested quotation starts or ends a quotation, where
_???_ or _???_ and _???_ or _???_ are preferred over the
unspaced compounds (_??_ or _??_ and _??_ or _??_), at
least with proportional fonts. And not to forget the SI-
conformant (and later ISO 80000 conformant) use of a thin
space (non-breakable of course) for the purpose of
grouping digits to triads, both before *and after* the
decimal separator.


Thanks to Richard Wordingham for catching this.

It?s a very good point.

Best regards,

Marcel

From unicode at unicode.org  Wed Jan 16 23:45:32 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 16 Jan 2019 21:45:32 -0800
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
Message-ID: <86c92c68-921b-e012-fe35-1c14aabc2031@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190116/441881c0/attachment.html>

From unicode at unicode.org  Thu Jan 17 00:27:21 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Thu, 17 Jan 2019 06:27:21 +0000
Subject: Encoding italic
In-Reply-To: <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
Message-ID: <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp>


On 2019/01/17 12:38, James Kass via Unicode wrote:

> ( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf )
> 
> "Plain text must contain enough information to permit the text to be 
> rendered legibly, and nothing more."
> 
> The argument is that italic information can be stripped yet still be 
> read.? A persuasive argument towards encoding would need to negate that; 
> it would have to be shown that removing italic information results in a 
> loss of meaning.

Well, yes. But please be aware of the fact that characters and text are 
human inventions grown and developed in many cultures over many 
centuries. It's not something where a single sentence will make all the 
subsequent decisions easy.

So even if you can find examples where the presence or absence of 
styling clearly makes a semantic difference, this may or will not be 
enough. It's only when it's often or overwhelmingly (as opposed to 
occasionally) the case that a styling difference makes a semantic 
difference that this would start to become a real argument for plain 
text encoding of italics (or other styling information).

To give a similar example, books about typography may discuss the 
different shapes of 'a' and 'g' in various fonts (often, the roman 
variant uses one shape (e.g. the 'g' with two circles), and the italic 
uses the other (e.g. the 'g' with a hook towards the bottom right)). But 
just because in this context, these shapes are semantically different, 
doesn't mean that they need to be distinguished at the plain text level.
(There are variants for IPA that are restricted to specific shapes, 
namely '?' and '?', but that's a separate issue.)


> The decision makers at Unicode are familiar with italic use conventions 
> such as those shown in "The Chicago Manual of Style" (first published in 
> 1906).? The question of plain-text italics has arisen before on this 
> list and has been quickly dismissed.
> 
> Unicode began with the idea of standardizing existing code pages for the 
> exchange of computer text using a unique double-byte encoding rather 
> than relying on code page switching.? Latin was "grandfathered" into the 
> standard.? Nobody ever submitted a formal proposal for Basic Latin. 
> There was no outreach to establish contact with the user community -- 
> the actual people who used the script as opposed to the "computer nerds" 
> who grew up with ANSI limitations and subsequent ISO code pages. Because 
> that's how Unicode rolled back then.? Unicode did what it was supposed 
> to do WRT Basic Latin.

I think most Unicode specialists have chosen to ignore this thread by 
this point. In their defense, I would like to point out that among the 
people who started Unicode, there were definitely many people who were 
very familiar with styling needs. As a simple example, Apple was 
interested in styled text from the very early beginning. Others were 
very familiar with electronic publishing systems. There were also 
members from the library community, who had their own requirements and 
character encoding standards. And many must have known TeX and other 
kinds of typesetting and publishing software. GML and then SGML were 
developed by IBM.

Based by these data points, and knowing many of the people involved, my 
description would be that decisions about what to encode as characters 
(plain text) and what to deal with on a higher layer (rich text) were 
taken with a wide and deep background, in a gradually forming industry 
consensus.

That doesn't mean that for all these decisions, explicit proposals were 
made. But it means that even where these decisions were made implicitly 
(at least on the level of the Consortium and the ISO/IEC and national 
standards body committees), they were made with a full and rich 
understanding of user needs and technology choices.

This lead to the layering we have now: Case distinctions at the 
character level, but style distinctions at the rich text level. Any good 
technology has layers, and it makes a lot of sense to keep established 
layers unless some serious problem is discovered. The fact that Twitter 
(currently) doesn't allow styled text and that there is a small number 
of people who (mis)use Math alphabets for writing italics,... on Twitter 
doesn't look like a serious problem to me.


> When someone points out that italics are used for disambiguation as well 
> as stress, the replies are consistent.
> 
> "That's not what plain-text is for."? "That's not how plain-text 
> works."? "That's just styling and so should be done in rich-text." 
> "Since we do that in rich-text already, there's no reason to provide for 
> it in plain-text."? "You can already hack it in plain-text by enclosing 
> the string with slashes."? And so it goes.

As such, these answers might indeed not look very convincing. But they 
are given in the overall framework of text representation in today's 
technology (see above). And please note that the end user doesn't ask 
for "italics in plain text", they as for "italics on Twitter" or some such.

If you ask for "italics in plain text", then to people understanding the 
whole technology stack, that very much sounds as if you grew up with 
ASCII and similar plain text limitations, continuing to be a computer 
nerd who hasn't yet seen or understood rich text.


> But if variant letter form information is stripped from a string like 
> "Jackie Brown", the primary indication that the string represents either 
> a person's name or a Tarantino flick title is also stripped.? "Thorstein 
> Veblen" is either a dead economist or the name of a fictional yacht in 
> the Travis McGee series.? And so forth.

In probably around 99% or more of the cases, the semantic distinction 
would be obvious from the context. Also, for probably at least 90% of 
the readership, the style distinction alone wouldn't induce a semantic 
distinction, because most of the readers are not familiar with these 
conventions.

(If you doubt that, please go out on the street and ask people what 
italics are used for, and count how many of them mention film titles or 
ship names.)

(And just while we are at it, it would still not be clear which of 
several potential people named "Jackie Brown" or "Thorstein Veblen" 
would be meant.)


> Computer text tradition aside, nobody seems to offer any legitimate 
> reason why such information isn't worthy of being preservable in 
> plain-text.? Perhaps there isn't one.

See above.


> I'm not qualified to assess the impact of italic Unicode inclusion on 
> the rich-text world as mentioned by David Starner.? Maybe another list 
> member will offer additional insight or a second opinion.

I'd definitely second David Starner on this point. The more options one 
has to represent one and the same thing (italic styling in this thread), 
the more complex and error-prone the technology gets.


Regards,    Martin.


From unicode at unicode.org  Thu Jan 17 00:36:13 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Wed, 16 Jan 2019 22:36:13 -0800
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
Message-ID: <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>

On Wed, Jan 16, 2019 at 7:41 PM James Kass via Unicode <unicode at unicode.org>
wrote:

> Computer text tradition aside, nobody seems to offer any legitimate
>
reason why such information isn't worthy of being preservable in
> plain-text.  Perhaps there isn't one.
>

Worthy of being preservable? Again, if you want rich text, you know where
to find it. Maybe italics could have been encoded in plain text, even as
late as 1991. But more than a quarter century on, everything supports
italics with a few rare exceptions. You're changing everything at a very
low level for a handful of systems.

On the other hand, tradition matters. Again, at the bottom of this email
I'm drafting is "*B* *I* *U* | *A*? tT?|?"; that is, bold, italics,
underline, text color, text size, and extra options, like font choice and
lists. Even non-computer geeks are familiar with that distinction. What's
the advantage of moving one feature into Unicode and breaking the symmetry?

On the other hand, most people won't enter anything into a tweet they can't
enter from their keyboard, and if they had to, would resort to cut and
paste. The only people Unicode italics could help without change are people
who already can use mathematical italics. If you don't have buy-in from
systems makers, people will continue to lack practical access to italics in
plain text systems.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190116/1e78148d/attachment.html>

From unicode at unicode.org  Thu Jan 17 01:47:46 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 17 Jan 2019 08:47:46 +0100
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
Message-ID: <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>

On 17/01/2019 07:36, David Starner via Unicode wrote:
[?]
> On the other hand, most people won't enter anything into a tweet they can't enter from their keyboard, and if they had to, would resort to cut and paste. The only people Unicode italics could help without change are people who already can use mathematical italics. If you don't have buy-in from systems makers, people will continue to lack practical access to italics in plain text systems.

Yes that is the point here, and that?s why I wasn?t proposing anything else than we can input right from the current keyboard layout. For italic plain text we would need a second keyboard layout or some corresponding feature, and switch back and forth between the two. It?s feasible, at least for a wide subset of Latin locales, but it?s an action similar to changing the type wheel or the ball-head.

Now thankfully the word is out.

Best regards,

Marcel


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/484ee032/attachment.html>

From unicode at unicode.org  Thu Jan 17 01:54:27 2019
From: unicode at unicode.org (Johannes Bergerhausen via Unicode)
Date: Thu, 17 Jan 2019 08:54:27 +0100
Subject: wws dot org
In-Reply-To: <CAGHjPPKR=G25w5VShfSRSoH4Ld_U9e6HYinEPwTPX5YcV4C92A@mail.gmail.com>
References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
 <CAGHjPPKR=G25w5VShfSRSoH4Ld_U9e6HYinEPwTPX5YcV4C92A@mail.gmail.com>
Message-ID: <F90B4207-FF72-417B-BFC4-A5CD5DBCD4E3@bergerhausen.com>

Thanks for the input. I?ll discuss it with Deborah Anderson. We can make possible changes with the next update of the site to Unicode 12.


> Am 16.01.2019 um 18:38 schrieb Phake Nick <c933103 at gmail.com>:
> 
> Feedback after briefly reading the East Asia section of the website:
> 1. I am pretty sure the "Kaida" script is not living anymore, according to Wikipedia description
> 2. Hentaigana refers to all alternative form of kana that're used before modern standardization, I don't think they're still used actively now.
> 3. The meaning of the "Old Hanzi" is not clear. If it is the same definition as the one stated in this blog: http://babelstone.blogspot.com/2007/07/old-hanzi.html <http://babelstone.blogspot.com/2007/07/old-hanzi.html> , then it is not referring to a single script and instead refer to all historical ways to write Hanzi, including Oracle Bone script, Bronze script, and (Small) Seal script and such. And the list have already separately include oracle bone script, bronze script and seal script, which apparently make this "old hanzi" entry redundant.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/6c697ef7/attachment.html>

From unicode at unicode.org  Thu Jan 17 02:51:48 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 17 Jan 2019 08:51:48 +0000
Subject: Encoding italic
In-Reply-To: <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp>
Message-ID: <30bf5809-e89f-9b93-c0b9-021b96e98d48@gmail.com>


On 2019-01-17 6:27 AM, Martin J. D?rst replied:

 > ...
 > So even if you can find examples where the presence or absence of
 > styling clearly makes a semantic difference, this may or will not be
 > enough. It's only when it's often or overwhelmingly (as opposed to
 > occasionally) the case that a styling difference makes a semantic
 > difference that this would start to become a real argument for plain
 > text encoding of italics (or other styling information).

(also from PDF chapter 2,)
"Plain text is public, standardized, and universally readable."
The UCS is universal, which implies that even edge cases, such as failed 
or experimental historical orthographies, are preserved in plain text.

 > ...
 > I think most Unicode specialists have chosen to ignore this thread by
 > this point.

Those not switched off by the thread title may well be exhausted and 
pressed for time because of the UTC meeting.

 > ...
 > Based by these data points, and knowing many of the people involved, my
 > description would be that decisions about what to encode as characters
 > (plain text) and what to deal with on a higher layer (rich text) were
 > taken with a wide and deep background, in a gradually forming industry
 > consensus.

(IMO) All of which had to deal with the existing font size limitations 
of 256 characters and the need to reserve many of those for other 
textual symbols as well as box drawing characters.? Cause and effect.? 
The computer fonts weren't designed that way *because* there was a 
technical notion to create "layers".? It's the other way around.? (If 
I'm not mistaken.)

 >> ..."Jackie Brown"...
 > ...
 > Also, for probably at least 90% of
 > the readership, the style distinction alone wouldn't induce a semantic
 > distinction, because most of the readers are not familiar with these
 > conventions.

Proper spelling and punctuation seem to be dwindling in popularity, as 
well.? There's a percentage unable to make a semantic distinction 
between 'your' and 'you?re'.

 > (If you doubt that, please go out on the street and ask people what
 > italics are used for, and count how many of them mention film titles or
 > ship names.)

Or the em-dash, en-dash, Mandaic letter ash, or Gurmukhi sign yakash.? 
Sure, most street people have other interests.

 > (And just while we are at it, it would still not be clear which of
 > several potential people named "Jackie Brown" or "Thorstein Veblen"
 > would be meant.)

Isn't that outside the scope of italics?? (winks)


From unicode at unicode.org  Thu Jan 17 02:58:57 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 Jan 2019 08:58:57 +0000
Subject: NNBSP
In-Reply-To: <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
Message-ID: <20190117085857.33e703e5@JRWUBU2>

On Thu, 17 Jan 2019 04:51:57 +0100
Marcel Schneider via Unicode <unicode at unicode.org> wrote:

> Also, at least one French typographer was extremely upset
> about Unicode not gathering feedback from typographers.
> That blame is partly wrong since at least one typographer
> was and still is present in WG2, and even if not being a
> Frenchman (but knowing French), as an Anglophone he might
> have been aware of the most outstanding use case of NNBSP
> with English (both British and American) quotation marks
> when a nested quotation starts or ends a quotation, where
> _???_ or _???_ and _???_ or _???_ are preferred over the
> unspaced compounds (_??_ or _??_ and _??_ or _??_), at
> least with proportional fonts.

There's an alternative view that these rules should be captured by the
font and avoid the need for a spacing character.  There is an example
in the OpenType documentation of the GPOS table where punctuation
characters are moved rightwards for French.

This alternative conception hits the problem that mass market Microsoft
products don't select font behaviour by language, unlike LibreOffice
and Firefox.  (The downside is that automatic font selection may then
favour a font that declares support for the language, which gets silly
when most fonts only support that language and don't declare support.)

Another spacing mess occurs with the Thai repetition mark U+0E46 THAI
CHARACTER MAIYAMOK, which is supposed to be separated from the
duplicated word by a space.  I'm not sure whether this space should
expand for justification any more often than inter-letter spacing. Some
fonts have taken to including the preceding space in the character's
glyph, which messes up interoperability.  An explicit space looks ugly
when the font includes the space in the repetition mark, and the lack of
an explicit space looks illiterate when the font excludes the leading
space.

Richard.


From unicode at unicode.org  Thu Jan 17 03:05:03 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 17 Jan 2019 10:05:03 +0100
Subject: NNBSP (was: A last missing link for interoperable representation)
In-Reply-To: <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
Message-ID: <c696cb9c-1460-8299-9151-333c2fce96d8@orange.fr>

Courier New was lacking NNBSP on Windows 7. It is including it on
Windows 10. The tests I referred to were made 2 years ago. I
confess that I was so disappointed to see Courier New unsupporting
NNBSP a decade after encoding, while many relevant people in the
industry were surely aware of its role and importance for French
(at least those keeping a branch office in France), that I gave it
up. Turns out that foundries are delaying support until the usage
is backed by TUS, which happened in 2014, timely for Windows 10.
(I?m lacking hints about Windows 8 and 8.1.)

Superscripts are a handy parallel showcasing a similar process.
As long as preformatted superscripts are outlawed by TUS for use
in the digital representation of abbreviation indicators, vendors
keep disturbing their glyphs with what one could start calling an
intentional metrics disorder (IMD). One can also rank the vendors
on the basis of the intensity of IMD in preformatted superscripts,
but this is not the appropriate thread, and anyhow this List is
not the place. A comment on CLDR ticket #11653 is better.

[?]
> Due to the way <NNBSP> made its delayed way into Unicode, font
> support was reported as late as almost exactly two years ago to
> be extremely scarce, this analysis of the first 47 fonts on
> Windows 10 shows:
> 
> https://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf
> 
> Surprisingly for me, Courier New has NNBSP. We must have been
> using old copies. I?m really glad that this famous and widely
> used typeface has been unpdated. Please disregard my previous
> posting about Courier New unsupporting NNBSP. [?]
Marcel

From unicode at unicode.org  Thu Jan 17 04:51:35 2019
From: unicode at unicode.org (Victor Gaultney via Unicode)
Date: Thu, 17 Jan 2019 10:51:35 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
Message-ID: <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>

( I appreciate that UTC meetings are going on - I too will be traveling 
a bit over the next couple of weeks, so may not respond quickly. )

Support for marking 'italic' in plain text - however it's done - would 
certainly require changes in text processing. That would also be the 
case for some of the other span-like issues others have mentioned. 
However a clear model for how to handle that could solve all the issues 
at once. Italic would only be one application of that model, and only 
applicable to certain scripts. Other scripts might have parallel issues. 
BTW - I'm speaking only about span-like things that encode content, not 
the additional level of rich-text presentation.

If however, we say that this "does not adequately consider the harm done 
to the text-processing model that underlies Unicode", then that exposes 
a weakness in that model. That may be a weakness that we have to accept 
for a variety of reasons (technical difficulty, burden on developers, UI 
impact, cost, maturity).

We then have to honestly admit that the current model cannot always 
unambiguously encode text content in English and many other languages. 
It is impossible to express Crystal's distinction between 'red slippers' 
and '/red/ slippers' in plain text without using other characters in 
non-standardized ways. Here I am using my favourite technique for this - 
/slashes/.

There are other uses of italic that indicate difference in actual 
meaning, many that go back centuries, and for which other span-like 
punctuation like quotes aren't used. Examples:

- Titles of books, films, compositions, works of art: 'Daredevil' - the 
Marvel comics character vs. '/Daredevil/' - the Netflix series.

- Internal voice, such as a character's private thoughts within a 
narrative: 'She pulled out a knife. /What are you doing? How did you 
find out.../'

- Change of author/speaker, as in editorial comments: '/The following 
should be considered.../'

- Heavy stress in speech, which is different than Crystal's distinction: 
'Come here /this instant/'

- Examples: 'The phrase /I could care less/...' (quotes are sometimes 
used for this one)

Is it important to preserve these distinctions in plain text? The text 
seems 'readable' without them, but that requires some knowledge of 
context. And without some sort of other marking, as I've done, some of 
the meaning is lost. This is why italics within text have always been 
considered an editorial decision, not a typesetting one.

In a similar way, we really don't need to include diacritics when 
encoding French. In all but a few rare cases, French is perfectly 
'readable' without accents - the content can usually be inferred from 
context. Yet we would never consider unaccented French to be correct.

More evidence for italics as an important element within encoded text 
comes from current use. A couple of years ago I collected every tweet 
that referred to italics for a month. People frequently complained that 
they were not able to express themselves fully without italics, and 
resorted to 40 different techniques to try and mark words and phrases as 
'italic'.

In the current model, plain text cannot fully preserve important 
distinctions in content. Maybe we just need to admit and accept that. 
But maybe an enhancement to the text processing model would enable more 
complete encoding of content, both for italics in Latin script and for 
other features in other scripts.

As for how the UIs of the world would need to change: Until there is a 
way to encode italic in plain text there's no motivation for people to 
even experiment and innovate.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/6f444ed7/attachment.html>

From unicode at unicode.org  Thu Jan 17 04:53:41 2019
From: unicode at unicode.org (Victor Gaultney via Unicode)
Date: Thu, 17 Jan 2019 10:53:41 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <CAGJ7U-VkgMCjFmi0MNLU=nOZj-q1HdM7bnmnCzbSSQsZkzGPwA@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <CAGJ7U-VkgMCjFmi0MNLU=nOZj-q1HdM7bnmnCzbSSQsZkzGPwA@mail.gmail.com>
Message-ID: <a4a937b4-0cc7-cff2-8fdf-e3a56d3df033@gaultney.org>

Andrew Cunningham wrote:
> Underlying, bold text, interletter spacing, colour change, font style 
> change all are used to apply meaning in various ways. Not sure why 
> italic is special in this sense.

Italic is uniquely different from these in that the meaning has been 
well-established in our writing system for centuries, and is 
consistently applied. The only one close to being this is use of 
interletter spacing for distinction, particularly in the German and 
Czech tradition. Of course, a model that can encode span-like features 
such as italic could then support other types of distinction. However 
the meaning of that distinction within the writing system must be clear. 
IOW people do use colour change to add meaning, but the meaning is not 
consistent, and so preserving it in plain text is relatively pointless. 
Even Bold doesn't have a consistent meaning other than <strong> - but 
that's a separate conversation.

> And I am curious on your thoughts, if we distinguish italic in 
> Unicode, encode some way of spacifying italic text, wouldn't it make 
> more sense to do away with italic fonts all together? and just roll 
> the italic glyphs into the regular font?

That's actually being done now. OpenType variation fonts allow a variety 
of styles within a single 'font', although I personally feel using that 
for italic is misguided.

The reality is that the most commonly used Latin fonts - OS core ones - 
all have italic counterparts, so app creators only have to switch to 
using that counterpart for that span. And if the font has no italic 
counterpart then a fallback mechanism can kick in - just like is done 
when a font doesn't have a glyph to represent a character.

> In theory changing italic from a stylistic choice as it currently is 
> to a encoding/character level semantic is a paradigmn shift.

Yes it would be - but it could be a beneficial shift, and one that more 
completely reflects distinctions in the Latin script that go back over 
400 years.

> But it it were introduced I would prefer a system that was more 
> inclusive of all scripts, giving proper analysis of typeseting and 
> typographic conventions in each script and well founded decisions on 
> which should be encoded. Cherry picking one feature relevant to a 
> small set of scripts seems to be a problematic path.

The core issue here is not really italic in Latin - that's only one 
case. An adjusted text model that supports span-like text features, 
could also unlock benefits for other scripts that have consistent 
span-like features.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/0d6cbb15/attachment.html>

From unicode at unicode.org  Thu Jan 17 05:21:56 2019
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 17 Jan 2019 12:21:56 +0100
Subject: NNBSP (was: A last missing link for interoperable representation)
In-Reply-To: <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <7b65e98e.8180.1683a1744fc.Webtop.228@btinternet.com>
 <252bcb46-c87d-15a6-d83c-4828d452ebf7@gmail.com>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
Message-ID: <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>

Le jeu. 17 janv. 2019 ? 05:01, Marcel Schneider via Unicode <
unicode at unicode.org> a ?crit :

> On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:
> >
> > On Tue, 15 Jan 2019 13:25:06 +0100
> > Philippe Verdy via Unicode <unicode at unicode.org> wrote:
> >
> >> If your fonts behave incorrectly on your system because it does not
> >> map any glyph for NNBSP, don't blame the font or Unicode about this
> >> problem, blame the renderer (or the application or OS using it, may
> >> be they are very outdated and were not aware of these features, theyt
> >> are probably based on old versions of Unicode when NNBSP was still
> >> not present even if it was requested since very long at least for
> >> French and even English, before even Unicode, and long before
> >> Mongolian was then encoded, only in Unicode and not in any known
> >> supported legacy charset: Mongolian was specified by borrowing the
> >> same NNBSP already designed for Latin, because the Mongolian space
> >> had no known specific behavior: the encoded whitespaces in Unicode
> >> are compeltely script-neutral, they are generic, and are even
> >> BiDi-neutral, they are all usable with any script).
> >
> > The concept of this codepoint started for Mongolian, but was generalised
> > before the character was approved.
>
> Indeed it was proposed as MONGOLIAN SPACE <MSP> at block start, which was
> consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
> more.


But the French "espace fine ins?cable" was requested long long before
Mongolian was discussed for encodinc in the UCS. The problem is that the
initial rush for French was made in a period where Unicode and ISO were
competing and not in sync, so no agreement could be found, until there was
a decision to merge the efforts. Tge early rush was in ISO still not using
any character model but a glyph model, with little desire to support
multiple whitespaces; on the Unicode side, there was initially no desire to
encode all the languages and scripts, focusing initially only on trying to
unify the existing vendor character sets which were already implemented by
a limited set of proprietary vendor implementations (notably IBM,
Microsoft, HP, Digital) plus a few of the registered chrsets in IANA
including the existing ISO 8859-*, GBK, and some national standard or de
facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this
time but still using another unrelated technology). Font standards were
still not existing and were competing in incompatible ways, all was a mess
at that time, so publishers were still required to use proprietary software
solutions, with very low interoperability (at that time the only "standard"
was PostScript, not needing any character encoding at all, but only
encoding glyphs!)

If publishers had been involded, they would have revealed that they all
needed various whitespaces for correct typography (i.e. layout). Typographs
themselves did not care about whitespaces because they had no value for
them (no glyph to sell). Adobe's publishing software were then completely
proprietary (jsut like Microsoft and others like Lotus, WordPerfect...).
Years ago I was working for the French press, and they absolutely required
us to manage the [FINE] for use in newspapers, classified ads, articles,
guides, phone books, dictionnaries. It was even mandatory to enter these
[FINE] in the composed text and they trained their typists or ads sellers
to use it (that character was not "sold" in classified ads, it was
necessary for correct layout, notably in narrow columns, not using it
confused the readers (notably for the ":" colon): it had to be
non-breaking, non-expanding by justification, narrower than digits and even
narrower than standard non-justified whitespace, and was consistently used
as a decimal grouping separator.

But at that time the most common OSes did not support it natively because
there was no vendor charset supporting it (and in fact most OSes were still
unable to render proportional fonts everywhere and were frequently limited
to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early
start). So intermediate solution was needed. Us chose not to use at all the
non-breakable thin space because in English it was not needed for basic
Latin, but also because of the huge prevalence of 7-bit ASCII for
everything (but including its own national symbol for the "$", competing
with other ISO 646 variants). There were tons of legacy applications
developed ince decenials that did not support anything else and
interoperability in US was available ony with ASCII, everything else was
unreliable.

If you remember the early years when the Internet started to develop
outside US, you remember the nightmare of non-interoperable 8-bit charsets
and the famous "mojibake" we saw everywhere. Then the competition between
ISO and Unicode lasted too long. But it was considered "too late" for
French to change anything (and Windows used in so many places by som many
users promoted the use of the Windows-1252 charset (which had a few updates
before it was frozen definitely: there was no place for NNBSP in it).
Typographers and publishers were upset: to use the NNBSP they still needed
to use proprietary *document* encodings. The W3C did not help much too (it
was long to finally adopt the UCS as a mandatory component for HTML, before
that it depended only on the old IANA charset database promoting only the
work of vendors and a few ISO standards).

France itself wanted to keep its own national variant of ISO 646 (inherited
from telegraphic systems), but it was finally abandoned: everybody was
already using windows 1252 or ISO 8859-1 (even early Unix adopters which
used a preliminary version made by Digital/DEC, then promoted by X11), or
otherwise used Adobe proprietary encodings. Unix itself had no standard (so
many different variants including with other OSes for industrial or
accounting systems, made notably by IBM,, which created so many variants,
almost one for each submarket, multiple ones in the same country, each time
split into mutliple variants between those based on ASCII, and those based
on EBCDIC...)

The truth is that publishers were forgotten, because their commercial
market was much narrower: each publisher then used its own internal
conventions. Even libaries used their own classifications. There was no
attempt to unifify the needs for publishers (working at document level) and
data processors (including OSes). This effort started only very late, when
W3C finally started to work seriously on fixing HTML, and make it more or
less interoperable with SGML (promoted by publishers). But at national
level, there were still lot of other competing standards (let's remember
teletext, including the Minitel terminal and Antiope for TV). People at
home did not have access to any system capable of rendering proportionaly
fonts. All early computers for personal use were based on fixed-width 8-bit
fonts (including in Japan). China and Korea were still not technology
advanced as they are today (there were some efforts but they were costly
and there was little return at that time).

The adoption of the UCS was extremely long, and it is still not competely
finished even if now its support is mandatory in all new computiong
standards and their revisions. The last segment where it still resists is
the mobile phone industry (how can the SMS be so restricted and so much
non-interoperable, and inefficient?)

So French has a long tradition for its "fine", its support was demanded
since long but constantly ignored by vendors making "the" standard.
Publishers themselves resisted against the adoption of the web as a
publishing platform: they prefered their legacy solutions as well, and did
not care much about interoperability, so they did not pressure enough the
standard makers to adopt the "fine". The same happened in US. There was no
"commercial" incentive to adopt it and littel money coming from that sector
(that has since suffered a lot from the loss of advertizing revenue, the
competition of online publishers, the explosion of paper cost, but as well
from the huge piracy level made on the Internet that reduced their sales
and then their effective measured audience; the same is happening now on
the TV and radio market; and on the Internet the adverizing market has been
concentrated a lot and its revenues are less and less balanced; photographs
and reporters have difficulties now to live from their work).

And there's little incentive now for creating quality products: so many
products are developed and distributed very fast, and not enough people
care about quality, or won't pay for it. The old good practives of
typographs and publishers are most often ignored, they look "exotic" or
"old-fashioned", and so many people say now these are "not needed" (just
like they'll say that supporting multiple languages is not necessary)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/45f8c2af/attachment.html>

From unicode at unicode.org  Thu Jan 17 05:31:29 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 17 Jan 2019 12:31:29 +0100
Subject: NNBSP
In-Reply-To: <20190117085857.33e703e5@JRWUBU2>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <20190111011445.1773182d@JRWUBU2>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <20190117085857.33e703e5@JRWUBU2>
Message-ID: <6bd3d417-28c8-7133-6d6a-606ea3a590f8@orange.fr>

On 17/01/2019 09:58, Richard Wordingham wrote:
>
> On Thu, 17 Jan 2019 04:51:57 +0100
> Marcel Schneider via Unicode <unicode at unicode.org> wrote:
> 
>> Also, at least one French typographer was extremely upset
>> about Unicode not gathering feedback from typographers.
>> That blame is partly wrong since at least one typographer
>> was and still is present in WG2, and even if not being a
>> Frenchman (but knowing French), as an Anglophone he might
>> have been aware of the most outstanding use case of NNBSP
>> with English (both British and American) quotation marks
>> when a nested quotation starts or ends a quotation, where
>> _???_ or _???_ and _???_ or _???_ are preferred over the
>> unspaced compounds (_??_ or _??_ and _??_ or _??_), at
>> least with proportional fonts.
> 
> There's an alternative view that these rules should be captured by the
> font and avoid the need for a spacing character.  There is an example
> in the OpenType documentation of the GPOS table where punctuation
> characters are moved rightwards for French.

Thanks, I didn?t know that this is already implemented. Sometimes one can
read in discussions that the issue is dismissed to font level. That looked
always utopistic to me, the more as people are trained to type spaces when
bringing in former typewriting expertise, and I always believed that it?s
a way for helpless keyboard layout designers to hand the job over.

Turns out there is more to it. But the high-end solution notwithstanding,
the use of an extra space character is recommended practice:

https://www.businesswritingblog.com/business_writing/2014/02/rules-for-single-quotation-marks.html

The source sums up in an overview: ?_The Associated Press Stylebook_
recommends a thin space, whereas _The Gregg Reference Manual_ promotes a
full space between the quotation marks. _The Chicago Manual of Style_ says
no space is necessary but adds that a space or a thin space can be inserted
as ?a typographical nicety.??? The author cites three other manuals for not
having retrieved any locus about the topic in them.

We note that all three style guides seem completely unconcerned with
non-breakability. Not so the author of the blog post: ?[?] If your software
moves the double quotation mark to the next line of type, use a nonbreaking
space between the two marks to keep them together.? Certainly she would
recommend using a NARROW NO-BREAK SPACE if only we had it on the keyboard
or if the software provided a handy shortcut by default.

> 
> This alternative conception hits the problem that mass market Microsoft
> products don't select font behaviour by language, unlike LibreOffice
> and Firefox.  (The downside is that automatic font selection may then
> favour a font that declares support for the language, which gets silly
> when most fonts only support that language and don't declare support.)

Another drawback is that most environments don?t provide OpenType support,
and that the whole scheme depends on language tags that could easily got
lost, and that the issue as being particular to French would quickly boil
down to dismiss support as not cost-effective, arguing that *if* some
individual locale has special requirements for punctuation layout, its
writers are welcome to pick an appropriate space from the UCS and key it
in as desired.

The same is also observed about Mongolian. Today, the preferred approach
for appending suffixes is to encode a Mongolian Suffix Connector to make
sure the renderer will use correct shaping, and to leave the space to the
writer?s discretion. That looks indeed much better than to impose a hard
space that unveiled itself as cumbersome in practice, and that is
reported to often get in the way of a usable text layout.

The problems related to NNBSP as encountered in Mongolian are completely
absent when NNBSP is used with French punctuation or as the regular
group separator in numbers. Hence I?m sure that everybody on this List
agrees in discouraging changes made to the character properties of NNBSP,
such as switching the line breaking class (as "GL" is non-tailorable), or
changing general category to Cf, which could be detrimental to French.

However we need to admit that NNBSP is basically not a Latin but a
Mongolian space, despite being readily attracted into Western typography.
A similar disturbance takes place in word processors, where except in
Microsoft Word 2013, the NBSP is not justifying as intended and as it is
on the web. It?s being hacked and hijacked despite being a bad compromise,
for the purpose of French punctuation spacing. That tailoring is in turn
very detrimental to Polish users, among others, who need a justifying
no-break space for the purpose of prepending one-letter prepositions.

Fortunately a Polish user found and shared a workaround using the string
<space><ZWNBSP>, the latter being still used in lieu of WORD JOINER as
long as Word keeps unsupporting latest TUS (an issue that raised concern
at Microsoft when it was reported, and will probably be fixed or has
already been fixed meanwhile).

> 
> Another spacing mess occurs with the Thai repetition mark U+0E46 THAI
> CHARACTER MAIYAMOK, which is supposed to be separated from the
> duplicated word by a space.  I'm not sure whether this space should
> expand for justification any more often than inter-letter spacing. Some
> fonts have taken to including the preceding space in the character's
> glyph, which messes up interoperability.  An explicit space looks ugly
> when the font includes the space in the repetition mark, and the lack of
> an explicit space looks illiterate when the font excludes the leading
> space.

It seems to me that these disturbances are a case of underspecification.
TUS treats U+0E46 thai  character  maiyamok  [1] on a single line in the
Thai section, while other marks are given more detailed descriptions.
That wouldn?t be problematic per se as long as things are obvious.
Obviously here they are not, but no attempt is made on Unicode level to
fix them, the less as the encoding proposal, if it could be retrieved,
probably would show that it didn?t provide any more details (otherwise
Unicode would have implemented them I figure out).

I suspect that the same holds true for French: Nobody among the relevant
people at the forefront cared about making demands and specifying, so
TUS authors (who anyway were ?falling like flies?) couldn?t help leaving
French alone ? possibly at the discretion of a trend to lock up this key
behavior inside proprietary text rendering systems (including proprietary
OTF typefaces). That isn?t really what Unicode is about, the less as
Latin script typically has scarce OpenType support at reach. It?s just
understandable in front of disinterest and unconcernedness. At the other
end, Vietnamese typographers didn?t wait for an invitation but started
an ?intense lobbying? on their own behalf to get precomposed letters
into the Unicode standard a long while before v1.0.

Marcel

[1] That?s what a copy-pasted snippet from TUS ends up as, despite my
     kind request about whether to set the character names in the plain
     text backbone to all-caps and to rather apply a resizing style.

From unicode at unicode.org  Thu Jan 17 05:50:08 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Thu, 17 Jan 2019 11:50:08 +0000
Subject: Encoding italic
In-Reply-To: <30bf5809-e89f-9b93-c0b9-021b96e98d48@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp>
 <30bf5809-e89f-9b93-c0b9-021b96e98d48@gmail.com>
Message-ID: <26f42a06-d636-ad62-2176-189fba605bdd@it.aoyama.ac.jp>

On 2019/01/17 17:51, James Kass via Unicode wrote:
> 
> On 2019-01-17 6:27 AM, Martin J. D?rst replied:

>  > ...
>  > Based by these data points, and knowing many of the people involved, my
>  > description would be that decisions about what to encode as characters
>  > (plain text) and what to deal with on a higher layer (rich text) were
>  > taken with a wide and deep background, in a gradually forming industry
>  > consensus.
> 
> (IMO) All of which had to deal with the existing font size limitations 
> of 256 characters and the need to reserve many of those for other 
> textual symbols as well as box drawing characters.? Cause and effect. 
> The computer fonts weren't designed that way *because* there was a 
> technical notion to create "layers".? It's the other way around.? (If 
> I'm not mistaken.)

Most probably not. I think Asmus has already alluded to it, but in good 
typography, roman and italic fonts are considered separate. They are 
often used in sets, but it's not impossible e.g. to cut a new italic to 
an existing roman or the other way round. This predates any 8-bit/256 
characters limitations. Also, Unicode from the start knew that it had to 
deal with more than 256 characters, not only for East Asia, and so I 
don't think such size limits were a major issue when designing Unicode.

On the other hand, the idea that all Unicode characters (or a 
significant and as yet undetermined subset of them) would need 
italic,... variants definitely will have let do shooting down such 
ideas, in particular because Unicode started as strictly 16-bit.

Regards,   Martin.


From unicode at unicode.org  Thu Jan 17 06:40:51 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 17 Jan 2019 12:40:51 +0000
Subject: Encoding italic
In-Reply-To: <26f42a06-d636-ad62-2176-189fba605bdd@it.aoyama.ac.jp>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp>
 <30bf5809-e89f-9b93-c0b9-021b96e98d48@gmail.com>
 <26f42a06-d636-ad62-2176-189fba605bdd@it.aoyama.ac.jp>
Message-ID: <0c239e30-b522-0cbe-effb-cf11e23bb18a@gmail.com>


On 2019-01-17 11:50 AM, Martin J. D?rst wrote:

 > Most probably not. I think Asmus has already alluded to it, but in good
 > typography, roman and italic fonts are considered separate.

So are Latin and Cyrillic fonts.? So are American English and Polish 
fonts, for that matter, even though they're both Latin based.? Times New 
Roman and Times New Roman Italic might be two separate font /files/ on 
computers, but they are the same type face.

The point I was trying to make WRT 256-glyph fonts is that they pre-date 
Unicode and I believe much of the "layering" is based on artifacts from 
that era.

Lead fonts were glyph based.? The technical concept of character came later.


From unicode at unicode.org  Thu Jan 17 07:36:32 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 17 Jan 2019 14:36:32 +0100
Subject: NNBSP (was: A last missing link for interoperable representation)
In-Reply-To: <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
Message-ID: <db9c4621-5382-8cc3-71d3-7fe94c6028e1@orange.fr>

On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:
>
> [quoted mail]
>
> But the French "espace fine ins?cable" was requested long long before Mongolian was discussed for encodinc in the UCS.

Then we should be able to read its encoding proposal in the UTC document registry, but Google Search seems unable to retrieve it, so there is a big risk that no such proposal does exist, despite the registry goes back until 1990.

The only thing that searches have brought up to me is that the part of UAX?#14 that I?ve quoted in the parent thread has been added by a Unicode Technical Director not mentioned in the author field, and that he did it on request from two gentlemen whose first names only are cited. I?m sure their full names are Martin J. D?rst and Patrick Andries, but I may be wrong.

I apologize for the comment I?ve made in my e?mail. Still it would be good to learn why the French use of NNBSP is sort of taken with a grain of salt, while all involved parties were knowing that this NNBSP was (as it still is) the only Unicode character ever encoded able to represent the so-long-asked-for ?espace fine ins?cable.?

There is also another question I?m asking since a while: Why the character U+2008 PUNCTUATION SPACE wasn?t given the line break property value "GL" like its sibling U+2007 FIGURE SPACE?

This addition to UAX #14 is dated as soon as ?2007-08-08?. Why was the Core Specification not updated in sync, but only a 7 years later? And was Unicode aware that this whitespace is hated by the industry to such an extent that a major vendor denied support in a major font at a major release of a major OS?

Or did they wait in vain that Martin and Patrick come knocking at their door to beg for font support?


Regards,

Marcel

> The problem is that the initial rush for French was made in a period where Unicode and ISO were competing and not in sync, so no agreement could be found, until there was a decision to merge the efforts. Tge early rush was in ISO still not using any character model but a glyph model, with little desire to support multiple whitespaces; on the Unicode side, there was initially no desire to encode all the languages and scripts, focusing initially only on trying to unify the existing vendor character sets which were already implemented by a limited set of proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national standard or de facto standards (Russia, Thailand, Japan, Korea).
> This early rush did not involve typographers (well there was Adobe at this time but still using another unrelated technology). Font standards were still not existing and were competing in incompatible ways, all was a mess at that time, so publishers were still required to use proprietary software solutions, with very low interoperability (at that time the only "standard" was PostScript, not needing any character encoding at all, but only encoding glyphs!)
>
> If publishers had been involded, they would have revealed that they all needed various whitespaces for correct typography (i.e. layout). Typographs themselves did not care about whitespaces because they had no value for them (no glyph to sell). Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained their typists or ads sellers to use it (that character was not "sold" in classified ads, it was necessary for correct layout, notably in narrow columns, not using it confused the readers (notably for the ":" colon): it had to be non-breaking, non-expanding by justification, narrower than digits and even narrower than standard non-justified whitespace, 
> and was consistently used as a decimal grouping separator.
>
> But at that time the most common OSes did not support it natively because there was no vendor charset supporting it (and in fact most OSes were still unable to render proportional fonts everywhere and were frequently limited to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early start). So intermediate solution was needed. Us chose not to use at all the non-breakable thin space because in English it was not needed for basic Latin, but also because of the huge prevalence of 7-bit ASCII for everything (but including its own national symbol for the "$", competing with other ISO 646 variants). There were tons of legacy applications developed ince decenials that did not support anything else and interoperability in US was available ony with ASCII, everything else was unreliable.
>
> If you remember the early years when the Internet started to develop outside US, you remember the nightmare of non-interoperable 8-bit charsets and the famous "mojibake" we saw everywhere. Then the competition between ISO and Unicode lasted too long. But it was considered "too late" for French to change anything (and Windows used in so many places by som many users promoted the use of the Windows-1252 charset (which had a few updates before it was frozen definitely: there was no place for NNBSP in it). Typographers and publishers were upset: to use the NNBSP they still needed to use proprietary *document* encodings. The W3C did not help much too (it was long to finally adopt the UCS as a mandatory component for HTML, before that it depended only on the old IANA charset database promoting only the work of vendors and a few ISO standards).
>
> France itself wanted to keep its own national variant of ISO 646 (inherited from telegraphic systems), but it was finally abandoned: everybody was already using windows 1252 or ISO 8859-1 (even early Unix adopters which used a preliminary version made by Digital/DEC, then promoted by X11), or otherwise used Adobe proprietary encodings. Unix itself had no standard (so many different variants including with other OSes for industrial or accounting systems, made notably by IBM,, which created so many variants, almost one for each submarket, multiple ones in the same country, each time split into mutliple variants between those based on ASCII, and those based on EBCDIC...)
>
> The truth is that publishers were forgotten, because their commercial market was much narrower: each publisher then used its own internal conventions. Even libaries used their own classifications. There was no attempt to unifify the needs for publishers (working at document level) and data processors (including OSes). This effort started only very late, when W3C finally started to work seriously on fixing HTML, and make it more or less interoperable with SGML (promoted by publishers). But at national level, there were still lot of other competing standards (let's remember teletext, including the Minitel terminal and Antiope for TV). People at home did not have access to any system capable of rendering proportionaly fonts. All early computers for personal use were based on fixed-width 8-bit fonts (including in Japan). China and Korea were still not technology advanced as they are today (there were some efforts but they were costly and there was little return at that time).
>
> The adoption of the UCS was extremely long, and it is still not competely finished even if now its support is mandatory in all new computiong standards and their revisions. The last segment where it still resists is the mobile phone industry (how can the SMS be so restricted and so much non-interoperable, and inefficient?)
>
> So French has a long tradition for its "fine", its support was demanded since long but constantly ignored by vendors making "the" standard. Publishers themselves resisted against the adoption of the web as a publishing platform: they prefered their legacy solutions as well, and did not care much about interoperability, so they did not pressure enough the standard makers to adopt the "fine". The same happened in US. There was no "commercial" incentive to adopt it and littel money coming from that sector (that has since suffered a lot from the loss of advertizing revenue, the competition of online publishers, the explosion of paper cost, but as well from the huge piracy level made on the Internet that reduced their sales and then their effective measured audience; the same is happening now on the TV and radio market; and on the Internet the adverizing market has been concentrated a lot and its revenues are less and less balanced; photographs and reporters have difficulties 
> now to live from their work).
>
> And there's little incentive now for creating quality products: so many products are developed and distributed very fast, and not enough people care about quality, or won't pay for it. The old good practives of typographs and publishers are most often ignored, they look "exotic" or "old-fashioned", and so many people say now these are "not needed" (just like they'll say that supporting multiple languages is not necessary)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/078b34f1/attachment.html>

From unicode at unicode.org  Thu Jan 17 07:40:22 2019
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 17 Jan 2019 14:40:22 +0100
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
Message-ID: <CAGa7JC0eQQgXb=yUwJvf8Q8ht2G5vk2Qw1DskSrS-PN5qjOHOw@mail.gmail.com>

If encoding italics means reencoding the normal linguistic usage, it is no
! We already have the nightmares caused by partial encoding of Latin and
Greek (als a few Hebrew characters) for maths notations or IPA notations,
but they are restricted to a well delimited scope of use and subset, and at
least they have relevant scientific sources and auditors for what is needed
in serious publications (Anyway these subsets may continue to evolve but
very slowly).
We could have exceptions added for chemical or electrical notations, if
there are standard bodies supporting them.
But for linguistic usage, there's no universal agreement and no single
authority. Characters are added according to common use (by statistic
survey, or because there are some national standard promoting them and
sometimes making their use mandatory with defined meanings, sometimes
legally binding).
For everything else, languages are not constrained and users around the
world invent their own letterforms, styles: there' no limit at all and if
we start accepting such reencoding, the situation would in fact be worse in
terms of interoperability ,because noone can support zillions variants if
they are not explicitly encoded separately as surrounding styles, or
scoping characters if needed (using contextual characters, possibly variant
selectors if these variants are most often isolated).
But italics encoded as varaint selectors would just pollute everything; and
anyway "italic" is not a single universal convention and does not apply
erqually to all scripts). The semantics attached to italic styles also
varies from document to documents, and the sema semantics also have
different typographic conventions depending on authors, and there's no
agreed meaning bout the distinctions they encode.
For this reason "italique/oblique/cursive/handwriting..." should remain in
styles (note also that even the italic transform can be variable, it could
also be later a subject of user preferences where people may want to adjust
the degree or slanting, according to their reading preferences, or its
orientation if they are left-handed to match how they write themselves, or
if the writer is a native RTL writer; the context of use (in BiDi) may also
adject this slanting orientation, e.g. inserting some Latin in Arabic could
present the Latin italic letters slanted backward, to better match the
slanting of Arabic itself and avoid collisions of Latin and Arabic glyphs
at BiDi boundaries...
One can still propose a contextual control character, but it would still be
insufficient for correctly representing the many stylistic variants
possible: we have better languages to do that now, and CSS (or even HTML)
is better for it (including for accessibility requirements: note that
there's no way to translate corretly these italics to Braille readers for
example; Braille or audio readers attempt to infer an heuristic to reduce
the number of contextual words or symbols they need to insert between each
character, but using VSn characters would complicate that: they are already
processing the standard HTML/CSS conventions to do that much more simply).
direct native encoding of italic characters for lingusitic use would fail
if it only covers English: it would worsen the language coverage if people
are then said to remove the essential diacritics common in their language,
only because of the partial coverage of their alphabet.
I don't think this is worth the effort (and it would in fact cause lot of
maintenance and would severely complicate the addition of new missing
letters; and let's not forget the case of common ligatures, correct
typograhpic features like kerning which would no longer be supported and
would render ugly text if many new kerning pairs are missing in fonts, many
fonts used today would no longer work properly, we would have a reduction
of stylistic options and less fonts usable, and we would fall into the trap
of proprietary solutions with a single provider; it would be too difficult
or any font designer to start defining a usable font sellable on various
market: these fonts would be reduced to niches, and would no longer find a
way to be economically defined and maintained at reasonable cost.
Consider the problems orthogonally: even if you use CSS/HTML styles in
document encoding (rather than the plain text character encoding) you can
also supply the additional semantics clearly in that document, and also
encode the intent of the author, or supply enough info to permit alternate
renderings (for accessibility, or for technical reasons such as small font
sizes on devices will low resolution, or for people with limited vision
capabilities). the same will apply to color (whose meaning is not clear,
except in specific notations supported by wellknown authorities, or by a
long tradition shared by many authors and kept in archives or important
text corpus, such as litterature, legal, and publications that have fallen
to the public domain after their ini?tial publisher disappeared and their
proprietary assets were dissolved: the original documents remain as
reliable sources sharable by many and which can guide the development of
reuse using them as an established convention that many can now reuse
without explaining them too much).
we can repeat this argument to the other common styles : monospaced, bold,
doublestruck, hollow, shadowed, 3D-like, underlining/striking/upperlining,
generic subscripts and superscripts (I don't like the partial encoding of
Latin letters in subscript/superscript working only for basic modern
English, this is an abuse of what was defined mostly for jsut a few
wellknown abbreviation or notations that have a long multilingual
tradition): authors have much more freedom of creation using separate
styles, encoding in an upper-layer protocol.
However we can admit that for use in documents not intended to be rendered
visually, but used technically, we would need some contextual control
characters (just like those for BiDi when HTML/CSS is not usable): these
are just needed for compatibility with technical contraints, provided that
there's an application support for that and such application is not
vendor-specific but sponsored by a wellknown standard (which should then be
explicited in Unicode, probably by character properties, just like
additional properties used for CJK characters specifying the dictionnary
sources). That referenced standard should be open, readable at least by all
(even if it is not republishable), and the standard body should have an
open contact with the community, and regular meetings to solve incoming
issues by defining some policies or the best practices, or the current
"state of the art" (if research is still continuing), as well as some rules
for making the transition and maintaining a good level of compatibility if
this standard evolves or switches to another supported standard.


Le jeu. 17 janv. 2019 ? 04:51, James Kass via Unicode <unicode at unicode.org>
a ?crit :

>
> Victor Gaultney wrote,
>
>  > Treating italic like punctuation is a win for a lot of people:
>
> Italic Unicode encoding is a win for a lot of people regardless of
> approach.  Each of the listed wins remains essentially true whether
> treated as punctuation, encoded atomically, or selected with VS.
>
>  > My main point in suggesting that Unicode needs these characters is that
>  > italic has been used to indicate specific meaning - this text is somehow
>  > special - for over 400 years, and that content should be preserved in
> plain
>  > text.
>
> ( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf )
>
> "Plain text must contain enough information to permit the text to be
> rendered legibly, and nothing more."
>
> The argument is that italic information can be stripped yet still be
> read.  A persuasive argument towards encoding would need to negate that;
> it would have to be shown that removing italic information results in a
> loss of meaning.
>
> The decision makers at Unicode are familiar with italic use conventions
> such as those shown in "The Chicago Manual of Style" (first published in
> 1906).  The question of plain-text italics has arisen before on this
> list and has been quickly dismissed.
>
> Unicode began with the idea of standardizing existing code pages for the
> exchange of computer text using a unique double-byte encoding rather
> than relying on code page switching.  Latin was "grandfathered" into the
> standard.  Nobody ever submitted a formal proposal for Basic Latin.
> There was no outreach to establish contact with the user community --
> the actual people who used the script as opposed to the "computer nerds"
> who grew up with ANSI limitations and subsequent ISO code pages.
> Because that's how Unicode rolled back then.  Unicode did what it was
> supposed to do WRT Basic Latin.
>
> When someone points out that italics are used for disambiguation as well
> as stress, the replies are consistent.
>
> "That's not what plain-text is for."  "That's not how plain-text
> works."  "That's just styling and so should be done in rich-text."
> "Since we do that in rich-text already, there's no reason to provide for
> it in plain-text."  "You can already hack it in plain-text by enclosing
> the string with slashes."  And so it goes.
>
> But if variant letter form information is stripped from a string like
> "Jackie Brown", the primary indication that the string represents either
> a person's name or a Tarantino flick title is also stripped.  "Thorstein
> Veblen" is either a dead economist or the name of a fictional yacht in
> the Travis McGee series.  And so forth.
>
> Computer text tradition aside, nobody seems to offer any legitimate
> reason why such information isn't worthy of being preservable in
> plain-text.  Perhaps there isn't one.
>
> I'm not qualified to assess the impact of italic Unicode inclusion on
> the rich-text world as mentioned by David Starner.  Maybe another list
> member will offer additional insight or a second opinion.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/fede0f7d/attachment.html>

From unicode at unicode.org  Thu Jan 17 07:57:23 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 17 Jan 2019 14:57:23 +0100
Subject: NNBSP (was: A last missing link for interoperable representation)
In-Reply-To: <db9c4621-5382-8cc3-71d3-7fe94c6028e1@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <db9c4621-5382-8cc3-71d3-7fe94c6028e1@orange.fr>
Message-ID: <5908f7a7-864b-2902-e41f-d77f99b7bbaa@orange.fr>

On 17/01/2019 14:36, I wrote:
> [?]
> The only thing that searches have brought up

It was actually the best thing. Here?s an even more surprising hit:

                B. In the rules, allow these characters to bridge both alphabetic and numeric words, with:

                  * Replace MidLetter by (MidLetter | MidNumLet)
                  * Replace MidNum by (MidNum | MidNumLet)


                -------------------------

                4. In addition, the following are also sometimes used, or could be used, as numeric separators (we don't give much guidance as to the best choice in the standard):

                |0020 <http://unicode.org/cldr/utility/character.jsp?a=0020>|( ) SPACE
                |00A0 <http://unicode.org/cldr/utility/character.jsp?a=00A0>|( ? ) NO-BREAK SPACE
                |2007 <http://unicode.org/cldr/utility/character.jsp?a=2007>|( ? ) FIGURE SPACE
                |2008 <http://unicode.org/cldr/utility/character.jsp?a=2008>|( ? ) PUNCTUATION SPACE
                |2009 <http://unicode.org/cldr/utility/character.jsp?a=2009>|( ? ) THIN SPACE
                |202F <http://unicode.org/cldr/utility/character.jsp?a=202F>|( ? ) NARROW NO-BREAK SPACE

                If we had good reason to believe that if one of these only really occurred between digits in a single number, then we could add it. I don't have enough information to feel like a proposal for that is warranted, but others may. Short of that, we should at least document in the notes that some implementations may want to tailor MidNum to add some of these.


I fail to understand what hack is going on. Why didn?t Unicode wish to sort out which one of these is the group separator?

 1. SPACE: is breakable, hence exit.
 2. NO-BREAK SPACE:?is justifying, hence exit.
 3. FIGURE SPACE: has the full width of a digit, too wide, hence exit.
 4. PUNCTUATION SPACE: has been left breakable against all reason and evidence and consistency, hence exit?
 5. THIN SPACE: is part of the breakable spaces series, hence exit.
 6. NARROW NO-BREAK SPACE: is okay.

CLDR has been OK to fix this for French for release 34. At present survey?35 all is questioned again, must be assessed, may impact implementations, while all other locales using space are still impacted by bad display using NO-BREAK SPACE.

I know we have another public Mail List for that, but I feel it?s important to submit this to a larger community for consideration and eventually, for feedback.

Thanks.

Regards,

Marcel

P.S. For completeness:

http://unicode.org/L2/L2007/07370-punct.html

And also wrt my previous post:

https://www.unicode.org/L2/L2007/07209-whistler-uax14.txt


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/de9a6f2a/attachment-0001.html>

From unicode at unicode.org  Thu Jan 17 11:35:49 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 17 Jan 2019 18:35:49 +0100
Subject: NNBSP (was: A last missing link for interoperable representation)
In-Reply-To: <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <5ea63064-56e1-6dd8-08d9-353ced19c698@gmail.com>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
Message-ID: <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>

On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:
>
> [quoted mail]
>
> But the French "espace fine ins?cable" was requested long long before Mongolian was discussed for encodinc in the UCS. The problem is that the initial rush for French was made in a period where Unicode and ISO were competing and not in sync, so no agreement could be found, until there was a decision to merge the efforts. Tge early rush was in ISO still not using any character model but a glyph model, with little desire to support multiple whitespaces; on the Unicode side, there was initially no desire to encode all the languages and scripts, focusing initially only on trying to unify the existing vendor character sets which were already implemented by a limited set of proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national standard or de facto standards (Russia, Thailand, Japan, Korea).
> This early rush did not involve typographers (well there was Adobe at this time but still using another unrelated technology). Font standards were still not existing and were competing in incompatible ways, all was a mess at that time, so publishers were still required to use proprietary software solutions, with very low interoperability (at that time the only "standard" was PostScript, not needing any character encoding at all, but only encoding glyphs!)

Thank you for this insight. It is a still untold part of the history of Unicode.

It seems that there was little incentive to involve typographers because they have no computer science training, and because they were feared as trying to enforce requirements that Unicode were neither able nor willing to meet, such as distinct code points for italics, bold, small caps?

Among the grievances, Unicode is blamed for confusing Greek psili and dasia with comma shapes, and for misinterpreting Latin letter forms such as the u with descender taken for a turned h, and double u mistaken for a turned m, errors that subsequently misled font designers to apply misplaced serifs. Things were done in a hassle and a hurry, under the Damokles sword of a hostile ISO messing and menacing to unleash an unusable standard if Unicode wasn?t quicker.
>
> If publishers had been involded, they would have revealed that they all needed various whitespaces for correct typography (i.e. layout). Typographs themselves did not care about whitespaces because they had no value for them (no glyph to sell).

Nevertheless the whole range of traditional space forms was admitted, despite they were going to be of limited usability. And they were given properties.
Or can?t the misdefinition of PUNCTUATION SPACE be backtracked to that era?

> Adobe's publishing software were then completely proprietary (jsut like Microsoft and others like Lotus, WordPerfect...). Years ago I was working for the French press, and they absolutely required us to manage the [FINE] for use in newspapers, classified ads, articles, guides, phone books, dictionnaries. It was even mandatory to enter these [FINE] in the composed text and they trained their typists or ads sellers to use it (that character was not "sold" in classified ads, it was necessary for correct layout, notably in narrow columns, not using it confused the readers (notably for the ":" colon): it had to be non-breaking, non-expanding by justification, narrower than digits and even narrower than standard non-justified whitespace, and was consistently used as a decimal grouping separator.

No doubt they were confident that when an UCS is set up, such an important character wouldn?t be skipped.
So confident that they never guessed that they had a key role in reviewing, in providing feedback, in lobbying.
Too bad that we?re still so few people today, corporate vetters included, despite much things are still going wrong.
>
> But at that time the most common OSes did not support it natively because there was no vendor charset supporting it (and in fact most OSes were still unable to render proportional fonts everywhere and were frequently limited to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early start).

Was there a lack of foresightedness?
Turns out that today as those characters are needed, they aren?t ready. Not even the NNBSP.

Perhaps it?s the poetic ?justice of time? that since Unicode is on, the Vietnamese are the foremost, and the French the hindmost.
[I?m alluding to the early lobbying of Vietnam for a comprehensive set of precomposed letters, while French wasn?t even granted to come into the benefit of the NNBSP ? that according to PRI #308 [1] is today the only known use of NNBSP outside Mongolian ? and a handful ordinal indicators (possibly along with the rest of the alphabet, except q).

[1] ?The only other widely noted use for U+202F NNBSP is for representation of the thin non-breaking space (/espace fine ins?cable/) regularly seen next to certain punctuation marks in French style typography.? <http://www.unicode.org/review/pri308/pri308-background.html>

> So intermediate solution was needed. Us chose not to use at all the non-breakable thin space because in English it was not needed for basic Latin, but also because of the huge prevalence of 7-bit ASCII for everything (but including its own national symbol for the "$", competing with other ISO 646 variants). There were tons of legacy applications developed ince decenials that did not support anything else and interoperability in US was available ony with ASCII, everything else was unreliable.

Probably because it wouldn?t have made much sense as long as people are unwilling to key in anything more, due to the requirement of maintaining a duplicate Alt key.
>
> If you remember the early years when the Internet started to develop outside US, you remember the nightmare of non-interoperable 8-bit charsets and the famous "mojibake" we saw everywhere.

We can still have mojibake in Windows terminal, at least on Windows 7, and when Latin-1 is coded in UTF-8 and rendered while CP1252 is default.
> Then the competition between ISO and Unicode lasted too long. But it was considered "too late" for French to change anything (and Windows used in so many places by som many users promoted the use of the Windows-1252 charset (which had a few updates before it was frozen definitely: there was no place for NNBSP in it).
In the wake it could have been relegated to history. What was the plot in keeping bothering end-users with an unusable legacy encoding?
> Typographers and publishers were upset: to use the NNBSP they still needed to use proprietary *document* encodings.
They still needed? Why didn?t they just refuse to buy it? That would have changed the vendors? minds, I guess.
> The W3C did not help much too (it was long to finally adopt the UCS as a mandatory component for HTML, before that it depended only on the old IANA charset database promoting only the work of vendors and a few ISO standards).
The W3C hasn?t even defined a named entity for &nnbsp;, like the have done for &zwnj; Who instructed them to obstruct?
>
> France itself wanted to keep its own national variant of ISO 646 (inherited from telegraphic systems), but it was finally abandoned: everybody was already using windows 1252 or ISO 8859-1 (even early Unix adopters which used a preliminary version made by Digital/DEC, then promoted by X11), or otherwise used Adobe proprietary encodings. Unix itself had no standard (so many different variants including with other OSes for industrial or accounting systems, made notably by IBM,, which created so many variants, almost one for each submarket, multiple ones in the same country, each time split into mutliple variants between those based on ASCII, and those based on EBCDIC...)
Was that the era when the industry wasn?t ready for 16-bit computing? What a nightmare, indeed?

But today the problem is that despite that?s all over and pass?, part of the industry seems to keep bullying the NNBSP as if they didn?t want French and other languages to use it right now.
>
> The truth is that publishers were forgotten, because their commercial market was much narrower: each publisher then used its own internal conventions. Even libaries used their own classifications. There was no attempt to unifify the needs for publishers (working at document level) and data processors (including OSes). This effort started only very late, when W3C finally started to work seriously on fixing HTML, and make it more or less interoperable with SGML (promoted by publishers).
Forgetting the publishers is really bad. Now the point is that NNBSP is not only relevant to publishers, but to every single end-user trying to write in French.

> But at national level, there were still lot of other competing standards (let's remember teletext, including the Minitel terminal and Antiope for TV). People at home did not have access to any system capable of rendering proportionaly fonts. All early computers for personal use were based on fixed-width 8-bit fonts (including in Japan). China and Korea were still not technology advanced as they are today (there were some efforts but they were costly and there was little return at that time).
Proportional fonts at home started likely with the Macintosh, IIRC.
>
> The adoption of the UCS was extremely long, and it is still not competely finished even if now its support is mandatory in all new computiong standards and their revisions. The last segment where it still resists is the mobile phone industry (how can the SMS be so restricted and so much non-interoperable, and inefficient?)
I thought that is a limitation proper to the type of cellphone I?m using.
>
> So French has a long tradition for its "fine", its support was demanded since long but constantly ignored by vendors making "the" standard.
So here we have it. The need for NNBSP was ignored by UTC?

I?m already fearing that UTC instructed CLDR TC to roll back the NNBSP instead of completing its implementation.

Not every company has a principled house policy about doing no evil. All my suspicions about lawless lobbying and malicious marketing are hereby confirmed.

That?s driving me mad. I need to stop posting to this list, and mind my business.

> Publishers themselves resisted against the adoption of the web as a publishing platform: they prefered their legacy solutions as well, and did not care much about interoperability, so they did not pressure enough the standard makers to adopt the "fine". The same happened in US. There was no "commercial" incentive to adopt it and littel money coming from that sector (that has since suffered a lot from the loss of advertizing revenue, the competition of online publishers, the explosion of paper cost, but as well from the huge piracy level made on the Internet that reduced their sales and then their effective measured audience; the same is happening now on the TV and radio market; and on the Internet the adverizing market has been concentrated a lot and its revenues are less and less balanced; photographs and reporters have difficulties now to live from their work).
>
> And there's little incentive now for creating quality products: so many products are developed and distributed very fast, and not enough people care about quality, or won't pay for it. The old good practives of typographs and publishers are most often ignored, they look "exotic" or "old-fashioned", and so many people say now these are "not needed" (just like they'll say that supporting multiple languages is not necessary)
If the users you?re referring to don?t deserve the right to type in their language?s interoperable representation, there?s no hope.

You?re talking about a fringe that is generating part of the information feed on social media. The overwhelming majority of end-users are full of good will, and are very learned people. Like education is set up against illiteracy, fighting in-typography is a matter of training. There?s a mass of fine blogs out there. What may remain to do is only adding to it.


Many thanks to Philippe Verdy for this valuable feedback.

Best regards,

Marcel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/dd4c389f/attachment.html>

From unicode at unicode.org  Thu Jan 17 12:50:55 2019
From: unicode at unicode.org (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?= via Unicode)
Date: Thu, 17 Jan 2019 19:50:55 +0100
Subject: wws dot org
In-Reply-To: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
Message-ID: <843f5142-13a3-a156-b180-4b1762fb7d3e@gmail.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/6a4f37de/attachment.html>

From unicode at unicode.org  Thu Jan 17 13:06:48 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 17 Jan 2019 11:06:48 -0800
Subject: NNBSP
In-Reply-To: <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
Message-ID: <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/5e127dc1/attachment.html>

From unicode at unicode.org  Thu Jan 17 13:11:40 2019
From: unicode at unicode.org (=?utf-8?B?5qKB5rW3IExpYW5nIEhhaQ==?= via Unicode)
Date: Thu, 17 Jan 2019 11:11:40 -0800
Subject: NNBSP
In-Reply-To: <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <cc8e16e4-cdfa-a359-70a0-83e984449ad4@gmail.com>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
Message-ID: <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>

[Just a quick note to everyone that, I?ve just subscribed to this public list, and will look into this ongoing Mongolian-related discussion once I?ve mentally recovered from this week?s UTC stress. :)]

Best,
?? Liang Hai
https://lianghai.github.io

> On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode <unicode at unicode.org> wrote:
> 
> On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote:
>>> [quoted mail]
>>> 
>>> But the French "espace fine ins?cable" was requested long long before Mongolian was discussed for encodinc in the UCS. The problem is that the initial rush for French was made in a period where Unicode and ISO were competing and not in sync, so no agreement could be found, until there was a decision to merge the efforts. Tge early rush was in ISO still not using any character model but a glyph model, with little desire to support multiple whitespaces; on the Unicode side, there was initially no desire to encode all the languages and scripts, focusing initially only on trying to unify the existing vendor character sets which were already implemented by a limited set of proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national standard or de facto standards (Russia, Thailand, Japan, Korea).
>>> This early rush did not involve typographers (well there was Adobe at this time but still using another unrelated technology). Font standards were still not existing and were competing in incompatible ways, all was a mess at that time, so publishers were still required to use proprietary software solutions, with very low interoperability (at that time the only "standard" was PostScript, not needing any character encoding at all, but only encoding glyphs!)
>> 
>> Thank you for this insight. It is a still untold part of the history of Unicode.
> This historical summary does not square in key points with my own recollection (I was there). I would therefore not rely on it as if gospel truth.
> 
> In particular, one of the key technologies that brought industry partners to cooperate around Unicode was font technology, in particular the development of the TrueType Standard. I find it not credible that no typographers were part of that project :).
> 
> Covering existing character sets (National, International and Industry) was an (not "the") important goal at the time: such coverage was understood as a necessary (although not sufficient) condition that would enable data migration to Unicode as well as enable Unicode-based systems to process and display non-Unicode data (by conversion). 
> 
> The statement: "there was initially no desire to encode all the languages and scripts" is categorically false.
> 
> (Incidentally, Unicode does not "encode languages" - no character encoding does).
> 
> What has some resemblance of truth is that the understanding of how best to encode whitespace evolved over time. For a long time, there was a confusion whether spaces of different width were simply digital representations of various metal blanks used in hot metal typography to lay out text. As the placement of these was largely handled by the typesetter, not the author, it was felt that they would be better modeled by variable spacing applied mechanically during layout, such as applying indents or justification.
> 
> Gradually it became better understood that there was a second use for these: there are situations where some elements of running text have a gap of a specific width between them, such as a figure space, which is better treated like a character under authors or numeric formatting control than something that gets automatically inserted during layout and rendering.
> 
> Other spaces were found best modeled with a minimal width, subject to expansion during layout if needed.
> 
> 
> 
> There is a wide range of typographical quality in printed publication. The late '70s and '80s saw many books published by direct photomechanical reproduction of typescripts. These represent perhaps the bottom end of the quality scale: they did not implement many fine typographical details and their prevalence among technical literature may have impeded the understanding of what character encoding support would be needed for true fine typography. At the same time, Donald Knuth was refining TeX to restore high quality digital typography, initially for mathematics.
> 
> However, TeX did not have an underlying character encoding; it was using a completely different model mediating between source data and final output. (And it did not know anything about typography for other writing systems).
> 
> Therefore, it is not surprising that it took a while and a few false starts to get the encoding model correct for space characters.
> 
> Hopefully, well complete our understanding and resolve the remaining issues.
> 
> A./
> 
> 
> 
> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/12b1f761/attachment.html>

From unicode at unicode.org  Thu Jan 17 14:21:03 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 Jan 2019 20:21:03 +0000
Subject: NNBSP
In-Reply-To: <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
Message-ID: <20190117202103.30162bf4@JRWUBU2>

On Thu, 17 Jan 2019 18:35:49 +0100
Marcel Schneider via Unicode <unicode at unicode.org> wrote:


> Among the grievances, Unicode is blamed for confusing Greek psili and
> dasia with comma shapes, and for misinterpreting Latin letter forms
> such as the u with descender taken for a turned h, and double u
> mistaken for a turned m, errors that subsequently misled font
> designers to apply misplaced serifs.

And I suppose that the influence was so great that it travelled back in
time to 1976, affecting the typography of the Pelican book 'Phonetics'
as reprinted in 1976.

Those IPA characters originated in a tradition where new characters had
been derived by rotating other characters so as to avoid having to have
new type cut.  Misplaced serifs appear to be original.

Richard.


From unicode at unicode.org  Thu Jan 17 16:56:27 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 17 Jan 2019 23:56:27 +0100
Subject: NNBSP
In-Reply-To: <20190117202103.30162bf4@JRWUBU2>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <10221c85-e055-6fdf-fcdd-83d47a5877f7@it.aoyama.ac.jp>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <20190117202103.30162bf4@JRWUBU2>
Message-ID: <19e42ead-284a-6527-2546-9b5fe2baf9cd@orange.fr>

On 17/01/2019 21:21, Richard Wordingham via Unicode wrote:
>
> On Thu, 17 Jan 2019 18:35:49 +0100
> Marcel Schneider via Unicode <unicode at unicode.org> wrote:
> 
> 
>> Among the grievances, Unicode is blamed for confusing Greek psili and
>> dasia with comma shapes, and for misinterpreting Latin letter forms
>> such as the u with descender taken for a turned h, and double u
>> mistaken for a turned m, errors that subsequently misled font
>> designers to apply misplaced serifs.
> 
> And I suppose that the influence was so great that it travelled back in
> time to 1976, affecting the typography of the Pelican book 'Phonetics'
> as reprinted in 1976.
> 
> Those IPA characters originated in a tradition where new characters had
> been derived by rotating other characters so as to avoid having to have
> new type cut.  Misplaced serifs appear to be original.

I merely reported what had been brought up by O. Randier [1].
Thanks for shedding the right light on these issues.
The paper comes out diminished, definitely full of errors.

This confirms a trend to criticize Unicode instead of cooperating. That
would be enough of an explanation why UTC is not ready to make any gifts to
French, neither up to now nor after my interoperable-representation-whining.
The most that French could get was granted to the Canadian French and only
thanks to Patrick Andries and to Martin J. D?rst. I?d like to thank these
gentlemen and Ken Whistler who lent an ear and was ready to add a mention
in UAX #14.

That is probably the most what the French language can expect, because it
ultimately may not deserve any more, due to French wrongdoing along history
and fresh in memories after the terrorist attack against Greenpeace.
The moral strength needed for a lobbying effort was gone, and the most that
people could do is being upset when NNBSP stayed missing, as Philippe Verdy
reported, but not take any action.

It wasn?t until Canadian French Patrick Andries asked for a small concession
based on what falls off from Mongolian, ending up in General Punctuation due
to the foresight of the UTC, that Unicode started supporting French, in a
merciful gesture granted through the service door in the backyard.

Now I?m likely to be scared into silence, deeply ashamed.
(But I?m committed to keep on the job.)

Nothing happens, or does not happen, without a good reason.
Finding out what reason is key to recoverage.
If we want to get what we need, we must do our homework first.

Thanks for helping bring it to the point.

Kind regards,

Marcel

[1] http://www.cairn.info/article.php?ID_REVUE=DN&ID_NUMPUBLIE=DN_063&ID_ARTICLE=DN_063_0089&FRM=B#pa29

From unicode at unicode.org  Thu Jan 17 17:44:50 2019
From: unicode at unicode.org (=?utf-8?B?IkouwqBTLiBDaG9pIg==?= via Unicode)
Date: Thu, 17 Jan 2019 18:44:50 -0500
Subject: Loose character-name matching
Message-ID: <60797095-B703-4770-8F85-F045DDED4431@icloud.com>

I?m implementing a Unicode names library. I?m confused about loose character-name matching, even after rereading The Unicode Standard ? 4.8, UAX #34 ? 4, #44 ? 5.9.2 ? as well as [L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt <http://www.unicode.org/L2/L2013/13142-name-match.txt>), [L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035 <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035>), and the [meeting in which those two items were resolved](https://www.unicode.org/L2/L2014/14026.htm <https://www.unicode.org/L2/L2014/14026.htm>).

In particular, I?m confused by the claim in The Unicode Standard ? 4.8 saying, ?Because Unicode character names do not contain any underscore (?_?) characters, a common strategy is to replace any hyphen-minus or space in a character name by a single ?_? when constructing a formal identifier from a character name. This strategy automatically results in a syntactically correct identifier in most formal languages. Furthermore, such identifiers are guaranteed to be unique, because of the special rules for character name matching.?

I?m also confused by the relationship between UAX34-R3 and UAX44-LM2.

To make these issues concrete, let?s say that my library provides a function called getCharacter that takes a name argument, tries to find a loosely matching character, and then returns it (or a null value if there is no currently loosely matching character). So then what should the following expressions return?

getCharacter(?HANGUL-JUNGSEONG-O-E?)

getCharacter(?HANGUL_JUNGSEONG_O_E?)

getCharacter(?HANGUL_JUNGSEONG_O_E_?)

getCharacter(?HANGUL_JUNGSEONG_O__E?)

getCharacter(?HANGUL_JUNGSEONG_O_-E?)

getCharacter(?HANGUL JUNGSEONGCHARACTERO E?)

getCharacter(?HANGUL JUNGSEONG CHARACTER OE?)

getCharacter(?TIBETAN_LETTER_A?)

getCharacter(?TIBETAN_LETTER__A?)

getCharacter(?TIBETAN_LETTER _A?)

getCharacter(?TIBETAN_LETTER_-A?)

Thanks,
J. S. Choi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/bb55df43/attachment.html>

From unicode at unicode.org  Thu Jan 17 18:04:11 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 18 Jan 2019 00:04:11 +0000
Subject: Encoding italic
In-Reply-To: <CAGa7JC0eQQgXb=yUwJvf8Q8ht2G5vk2Qw1DskSrS-PN5qjOHOw@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAGa7JC0eQQgXb=yUwJvf8Q8ht2G5vk2Qw1DskSrS-PN5qjOHOw@mail.gmail.com>
Message-ID: <50524911-3be8-e307-2249-c7a7eb47f6ca@gmail.com>


For web searching, using the math-string ?????????????? ???????????? as 
the keywords finds John Maynard Keynes in web pages.? Tested this in 
both Google and DuckDuckGo.? Seems like search engines are accomodating 
actual user practices.? This suggests that social media data is possibly 
already being processed for the benefit of the users (and future 
historians) by software people who care about such things.


From unicode at unicode.org  Fri Jan 18 02:11:49 2019
From: unicode at unicode.org (Johannes Bergerhausen via Unicode)
Date: Fri, 18 Jan 2019 09:11:49 +0100
Subject: wws dot org
In-Reply-To: <843f5142-13a3-a156-b180-4b1762fb7d3e@gmail.com>
References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
 <843f5142-13a3-a156-b180-4b1762fb7d3e@gmail.com>
Message-ID: <3277BE48-2A5A-4060-B6F5-1A0B3566CC78@bergerhausen.com>

Thanks a lot for this input!

We?ll check this with Deborah Anderson from SEI Berkeley.
The update of the web site to Unicode 12.0 will be an opportunity to make some corrections.

All the best,
Johannes


> Am 17.01.2019 um 19:50 schrieb Fr?d?ric Grosshans <frederic.grosshans at gmail.com>:
> 
> Thanks for this nice website !
> 
> Some feedback:
> 
> Given the number of scripts in this period, I think that splitting 10c-19c in two (or even three) would be a good idea
> A finer unicode status would be nice
> 
> Coptic is listed as European, while, I think it is Africac, (even if a member of the LGC (LAtin-Greek-Cyrillic) family since, to my knowledge, it has only be used in Africa for African llanguages (Coptic and Old Nubian). 
> Coptic still used for religious purpose today. Why to you write it dead in the 14th century ?
> Khitan Small Script: According to Wikipedia, it ?was invented in about 924 or 925 CE?, not           920 (that is the date of the Khitan Large Script
> Cyrillic I think its birth date is 890s, slightly more precice than the 10c you write
> You include two well known Tolkienian scripts (Cirth and Tengwar), but you omit the third (first ?) one, the Sarati (see e.g. http://at.mansbjorkman.net/sarati.htm <http://at.mansbjorkman.net/sarati.htm> andhttps://en.wikipedia.org <https://en.wikipedia.org/>
> On a side note, you the site considers visible speech as a living-script, which surprised be. This information is indeed in the Wikipedia infobox and implied by its ?HMA status? on the Berkeley SEI page, but the text of the wikipedia page says ?However, although heavily promoted [...] in 1880, after a period of a dozen years or so in which it was applied to the education of the deaf, Visible Speech was found to be more cumbersome [...] compared to other methods, and eventually faded from use.?
> 
> My (cursory) research failed to show a more recent date than this for the system than this ?dosen of year or so [past 1880]? . Is there any indication of the system to be used later? (say, any date in the 20th century)
> 
> 
> All the best,
> 
> 
>    Fr?d?ric
> Le 15/01/2019 ? 19:22, Johannes Bergerhausen via Unicode a ?crit :
>> Dear list,
>> 
>> I am happy to report that www.worldswritingsystems.org <http://www.worldswritingsystems.org/> is now online.
>> 
>> The web site is a joint venture by
>> 
>> ? Institut Designlabor Gutenberg (IDG), Mainz, Germany,
>> ? Atelier National de Recherche Typographique (ANRT), Nancy, France and
>> ? Script Encoding Initiative (SEI), Berkeley, USA.
>> 
>> For every known script, we researched and designed a reference glyph.
>> 
>> You can sort these 292 scripts by Time, Region, Name, Unicode version and Status.
>> Exactly half of them (146) are already encoded in Unicode.
>> 
>> Here you can find more about the project:
>> www.youtube.com/watch?v=CHh2Ww_bdyQ <http://www.youtube.com/watch?v=CHh2Ww_bdyQ>
>> 
>> And is a link to see the poster:
>> https://shop.designinmainz.de/produkt/the-worlds-writing-systems-poster/ <https://shop.designinmainz.de/produkt/the-worlds-writing-systems-poster/>
>> 
>> All the best,
>> Johannes
>> 
>> 
>> 
>> 
>> ? Prof. Bergerhausen
>> Hochschule Mainz, School of Design, Germany
>> www.designinmainz.de <http://www.designinmainz.de/>
>> www.decodeunicode.org <http://www.decodeunicode.org/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/406ba392/attachment.html>

From unicode at unicode.org  Fri Jan 18 06:56:05 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Fri, 18 Jan 2019 07:56:05 -0500
Subject: wws dot org
In-Reply-To: <843f5142-13a3-a156-b180-4b1762fb7d3e@gmail.com>
References: <8D2BAF76-515A-48AC-893D-779444A5A636@bergerhausen.com>
 <843f5142-13a3-a156-b180-4b1762fb7d3e@gmail.com>
Message-ID: <568fbd33-e203-2459-f3db-d5d1d986f673@kli.org>

On 1/17/19 1:50 PM, Fr?d?ric Grosshans via Unicode wrote:
>
> On a side note, you the site considers visible speech as a 
> living-script, which surprised be. This information is indeed in the 
> Wikipedia infobox and implied by its ?HMA status? on the Berkeley SEI 
> page, but the text of the wikipedia page says ?However, although 
> heavily promoted [...] in 1880, after a period of a dozen years or so 
> in which it was applied to the education of the deaf, Visible Speech 
> was found to be more cumbersome [...] compared to other methods,and 
> eventually faded from use.?
>
> My (cursory) research failed to show a more recent date than this for 
> the system than this ?dosen of year or so [past 1880]? . Is there any 
> indication of the system to be used later? (say, any date in the 20th 
> century)
>
I just got email a few days ago from someone who wants to use it on an 
album cover...

But on the whole I think you are correct; I have not seen much use or 
even study of it (outside of my own and a very few others) in recent 
times.? And I *still* have to submit a proposal for it to be included in 
Unicode.

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/d9c15ead/attachment.html>

From unicode at unicode.org  Fri Jan 18 09:27:17 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 18 Jan 2019 16:27:17 +0100
Subject: NNBSP
In-Reply-To: <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <001e01d4a99a$8d3c1610$a7b44230$@xencraft.com>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
Message-ID: <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>

On 17/01/2019 20:11, ?? Liang Hai via Unicode wrote:
> [Just a quick note to everyone that, I?ve just subscribed to this public list, and will look into this ongoing Mongolian-related discussion once I?ve mentally recovered from this week?s UTC stress. :)]

Welcome to Unicode Public.

Hopefully this discussion helps sort things out so that we?ll know both what to do wrt Mongolian and what to do wrt French.

On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
> On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote:
>> ?[On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:]
>>>
>>> [quoted mail]
>>>
>>> But the French "espace fine ins?cable" was requested long long before Mongolian was discussed for encodinc in the UCS. The problem is that the initial rush for French was made in a period where Unicode and ISO were competing and not in sync, so no agreement could be found, until there was a decision to merge the efforts. Tge early rush was in ISO still not using any character model but a glyph model, with little desire to support multiple whitespaces; on the Unicode side, there was initially no desire to encode all the languages and scripts, focusing initially only on trying to unify the existing vendor character sets which were already implemented by a limited set of proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national standard or de facto standards (Russia, Thailand, Japan, Korea).
>>> This early rush did not involve typographers (well there was Adobe at this time but still using another unrelated technology). Font standards were still not existing and were competing in incompatible ways, all was a mess at that time, so publishers were still required to use proprietary software solutions, with very low interoperability (at that time the only "standard" was PostScript, not needing any character encoding at all, but only encoding glyphs!)
>> Thank you for this insight. It is a still untold part of the history of Unicode.
> This historical summary does *not *square in key points with my own recollection (I was there). I would therefore not rely on it as if gospel truth.
>
> In particular, one of the key technologies that _brought industry partners to cooperate around Unicode_ was font technology, in particular the development of the /TrueType /Standard. I find it not credible that no typographers were part of that project :).
>
It is probably part of the (unintentional) fake blames spread by the cited author?s paper. My apologies for not sufficiently assessing the reliability of my sources. I?d already identified a number of errors but wasn?t savvy enough for seeing the other one reported by Richard Wordingham. Now the paper ends up as a mere libel. It doesn?t mention the lack of NNBSP, instead it piles up a bunch of gratuitous calumnies. Should that be the prevailing mood of average French professionals with respect to Unicode ? indeed Patrick Andries is the only French tech writer on Unicode I found whose work is acclaimed, the others are either disliked or silent (or libellists) ? then I understand only better why a significant majority of UTC is hating French.

Francophobia is also palpable in Canada, beyond any technical reasons, especially in the IT industry. Hence the position of UTC is far from isolated. If ethic and personal considerations inflect decision-making, they should consistently be an integral part of discussions here. In that vein, I?d mention that by the time when Unicode was developed, there was a global hatred against France, that originated in French colonial and foreign politics since WWII, and was revived a few years ago by the French government sinking ????????????????????????????? and killing the crew?s photographer, in the port of Auckland. That crime triggered a peak of anger.
>
> Covering existing character sets (National, International and Industry) was _an_ (not "the") important goal at the time: such coverage was understood as a necessary (although not sufficient) condition that would enable data migration to Unicode as well as enable Unicode-based systems to process and display non-Unicode data (by conversion).
>
I?d take this as a touchstone to infer that there were actual data files including standard typographic spaces as encoded in U+2000..U+2006, and electronic table layout using these: ?U+2007 figure space has a fixed width, known as tabular width, which is the same width as digits used in tables. U+2008 punctuation space is a space defined to be the same width as a period.?
Is that correct?
>
> The statement: "there was initially no desire to encode all the languages and scripts" is categorically false.
>
Though Unicode was designed as being limited to a 65?000 characters, and it was stated that historic scripts were out of scope, only living scripts should be encoded, for interchange.
>
> (Incidentally, Unicode does not "encode languages" - no character encoding does).
>
In an often used sense every ?language? has its ?alphabet?, although one does not currently refer to Latin as multiple scripts.
>
> What has some resemblance of truth is that the understanding of how best to encode whitespace evolved over time. For a long time, there was a confusion whether spaces of different width were simply digital representations of various metal blanks used in hot metal typography to lay out text. As the placement of these was largely handled by the typesetter, not the author, it was felt that they would be better modeled by variable spacing applied mechanically during layout, such as applying indents or justification.
>
Indeed it is stated that the multiple typographic spaces that made it into the Standard were not used in electronic typesetting and layout.
>
> Gradually it became better understood that there was a second use for these: there are situations where some elements of running text have a gap of a specific width between them, such as a figure space, which is better treated like a character under authors or numeric formatting control than something that gets automatically inserted during layout and rendering.
>
There seems to be a confusion about the figure space. What is this space really for?

  * The Unicode Standard hints that it was used to fill up empty positions in numeric tables.
  * The Unicode Line Break Algorithm UAX #14 understands that it is the group separator, although as such it is neither SI- and ISO?80000 conformant, nor is it implemented in CLDR. (Fortunately it is not, given it isn?t SI/ISO compliant, but it would have been a better pick than NBSP, because unlike NBSP, it is not justifying.)

As you were there, did you see or hear how it happened that, well, FIGURE SPACE (U+2007) was declared non-breakable, and how it happened that at the same time, PUNCTUATION SPACE (U+2008) was not declared non-breakable?

Hint: Was it understood (certainly it was) that a non-breakable PUNCTUATION SPACE would have been the ?espace fine ins?cable? (narrow no-break space) that the French users of character sets were languishing after?

> Other spaces were found best modeled with a minimal width, subject to expansion during layout if needed.
>
> There is a wide range of typographical quality in printed publication. The late '70s and '80s saw many books published by direct photomechanical reproduction of typescripts. These represent perhaps the bottom end of the quality scale: they did not implement many fine typographical details and their prevalence among technical literature may have impeded the understanding of what character encoding support would be needed for true fine typography.
>
By that time, electronic typewriters became widespread, featuring interchangeable fonts (on type wheels), proportional advance width (for use with appropriate fonts), and bold weight (by double-typing with a tiny offset). Additionally some models had an input buffer with a linear LCD display, mitigating the expense in correction ribbon as typewriters became more and more popular.

With ordinary typewriter spacing, the narrow space was not a demand, but with proportional advance width that could have changed.
Do you remember the ratio fixed width / proportional width in the photomechanically reproduced printed matters you are referring to?
How were typewriters with proportional width shaping the perception of typography in general, and of whitespace in particular, among the authors of Unicode?

*Fine typography:* There is a current misunderstanding of ?fine typography? with respect to the NARROW NO-BREAK SPACE. The use of this character **is not** part of fine typography. It is simply part of the ordinary digital representation of the French language. To declare NNBSP as belonging to ?fine typography? is to make it optional. In French and in languages grouping digits with spaces, *NNBSP* is not optional, it*is mandatory.* In the actual state of Unicode, NNBSP is the only usable space for the purpose of grouping digits and of spacing off French punctuation (except some old-style French layout of the colon).

That space would be *PSP* (PUNCTUATION SPACE) **if** Unicode had made it non-breakable. In that case, the *MONGOLIAN SPACE (MSP) would eventually have been encoded, or rather the *MONGOLIAN SUFFIX CONNECTOR (MSC), for the purpose of particular shaping. If the *MONGOLIAN SPACE had actually been encoded, it would be tailorable ad libitum, and Unicode could change its properties as desired (referring to a proposed change of General category of NNBSP from Zs to Cf, and/or of line-breaking class from GL to BB IIRC).
>
> At the same time, Donald Knuth was refining TeX to restore high quality digital typography, initially for mathematics.
>
That is very interesting an certainly worth noting here, but it cannot be enough underscored how this is off-topic to this thread, and brings us away from the matter we?re actually discussing, that is writing Mongolian and French in a functional way, also in plain text. Again, NNBSP is not fine typography and it has nothing to do with high-quality typography. NNBSP is simply a matter of not ending up with messy text. Not to use NNBSP is to mess up the text.
>
> However, TeX did not have an underlying character encoding; it was using a completely different model mediating between source data and final output. (And it did not know anything about typography for other writing systems).
>
> Therefore, it is not surprising that it took a while and a few false starts to get the encoding model correct for space characters.
>
Isn?t that overstating the complexity of whitespaces in Unicode?

As seen from today, getting it right was as simple as giving the same GL class to both spaces allegedly encoded for tabular typesetting, but readily repurposed.

As it is, PUNCTUATION SPACE is a totally useless duplicate encoding, until/unless proven otherwise.
>
> Hopefully, well complete our understanding and resolve the remaining issues.
>
> A./
>
That is a great promise. Hopefully you are being backed by UTC in making it!

Best regards,

Marcel

P. S.: The name of the Greenpeace flagship has been typeset in italics thanks to Andrew West?s online utility, [1] in respectfulness towards the organization, and with implicit reference to parent and sibling threads. //Please don?t interpret this gesture as backing demands for Unicode representation of italics.//

We?re (at least I?m) actually trying to understand more in detail why UTC is struggling against NNBSP as a space (thinking at changing its Gc to Cf), while at encoding time, UTC prompted Mongolian OPs to refrain from requesting a dedicated Mongolian Space rather than shifting the new space into General Punctuation for other scripts? joint convenience.

Admittedly, French has been the only script to make extensive use of it [2] ? a highly partial impression given many many other locales are using a space to group digits, and that space is then mandatorily NNBSP; anything else being highly unprofessional.

So we?ll look even harder at the new TUS text wrt NNBSP in Mongolian, that Richard Wordingham draw our attention to, and we?d like to understand the role of UTC acting in favor or against NNBSP, possibly with various antagonistic components within UTC.

[1] http://www.babelstone.co.uk/Unicode/text.html

[2] http://www.unicode.org/review/pri308/pri308-background.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/84dda8a6/attachment.html>

From unicode at unicode.org  Fri Jan 18 09:51:18 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Fri, 18 Jan 2019 10:51:18 -0500
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
Message-ID: <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>

On 1/16/19 6:23 AM, Victor Gaultney via Unicode wrote:
>
> Encoding 'begin italic' and 'end italic' would introduce difficulties 
> when partial strings are moved, etc. But that's no different than with 
> current punctuation. If you select the second half of a string that 
> includes an end quote character you end up with a mismatched pair, 
> with the same problems of interpretation as selecting the second half 
> of a string including an 'end italic' character. Apps have to deal 
> with it, and do, as in code editors.
>
It kinda IS different.? If you paste in half a string, you get a 
mismatched or unmatched paren or quote or something.? A typo, but a 
transient one.? It looks bad where it is, but everything else is 
unaffected.? It's no worse than hitting an extra key by mistake. If you 
paste in a "begin italic" and miss the "end italic", though, then *all* 
your text from that point on is affected!? (Or maybe "all until a 
newline" or some other stopgap ending, but that's just damage-control, 
not damage-prevention.)? Suddenly, letters and symbols five 
words/lines/paragraphs/pages look different, the pagination is all 
altered (by far more than merely a single extra punctuation mark, since 
italic fonts generally are narrower than roman).? It's a disaster.

No.? This kind of statefulness really is beyond what Unicode is designed 
to cope with.? Bidi controls are (almost?) the sole exception, and even 
they cause their share of headaches.? Encoding separate _text_ 
italics/bold is IMO also a disastrous idea, but I'm not putting out 
reasons for that now.? The only really feasible suggestion I've heard is 
using a VS in some fashion. (Maybe let it affect whole words instead of 
individual characters?? Makes for fewer noisy VSs, but introduces a 
whole other host of limitations (how to italicize part of a word, how to 
italicize non-letters...) and is also just damage-control, though stronger.)

> Apps (and font makers) can also choose how to deal with presenting 
> strings of text that are marked as italic. They can choose to present 
> visual symbols to indicate begin/end, such as /this/. Or they can 
> present it using the italic variant of the font, if available.
>
At which point, you have invented markdown.? Instead of making Unicode 
declare it, just push for vendors everywhere to recognize /such 
notation/ as italics (OK, I know, you want dedicated characters for it 
which can't be confused for anything else.)


> - Those who develop plain text apps (social media in particular) don't 
> have to build in a whole markup/markdown layer into their apps
>
With the complexity of writing an social media app, a markup layer is 
really the least of the concerns when it comes to simplifying.
>
> - Misuse of math chars for pseudo-italic would likely disappear
>
> - The text runs between markers remain intact, so they need no special 
> treatment in searching, selecting, etc.
>
> - It finally, and conclusively, would end the decades of the mess in 
> HTML that surrounds <em> and <italic>.
>
Adding _another_ solution to something will *never* "conclusively end" 
anything.? On a good day, you can hope it will swamp the others, but 
they'll remain at least in legacy.? More likely, it will just add one 
more way to be confused and another side to the mess.? (People have 
pointed out here about the difficulties of distinguishing or 
not-distinguishing between HTML-level <i> and putative plain-text 
italics.? And yes, that is an issue, and one that already exists with 
styling that can change case and such.? As with anything, the question 
is not whether there are going to be problems, but how those problems 
weigh against potential benefits.? That's an open question.)

> My main point in suggesting that Unicode needs these characters is 
> that italic has been used to indicate specific meaning - this text is 
> somehow special - for over 400 years, and that content should be 
> preserved in plain text.
>
There is something to this: people have been *emphasizing* text in some 
fashion or another for ages.? There is room to call this plain text.

~mark


From unicode at unicode.org  Fri Jan 18 09:58:59 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Fri, 18 Jan 2019 10:58:59 -0500
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <CAGJ7U-VkgMCjFmi0MNLU=nOZj-q1HdM7bnmnCzbSSQsZkzGPwA@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <CAGJ7U-VkgMCjFmi0MNLU=nOZj-q1HdM7bnmnCzbSSQsZkzGPwA@mail.gmail.com>
Message-ID: <59a8901e-3f7e-25c6-96da-b3d290df96e4@kli.org>

On 1/16/19 7:16 AM, Andrew Cunningham via Unicode wrote:
> HI Victor, an off list reply. The contents are just random thoughts 
> sparked by an interesting conversation.
>
> On Wed, 16 Jan 2019 at 22:44, Victor Gaultney via Unicode 
> <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
>
>
>     - It finally, and conclusively, would end the decades of the mess
>     in HTML that surrounds <em> and <italic>.
>
>
> I am not sure that would fix the issue, more likely compound the issue 
> making it even more blurry what the semantic purpose is. HTML5 make 
> both <i> and <e> semantic ... and by the definition the style of the 
> elements is not necessarily italic. <em> for instance would be script 
> dependant, <i> may be partially script dependant when another 
> appropriate semantic tag is missing. A character/encoding level 
> distinction is just going to compound the mess.

A good point, too.? While italics are being used sort of as an example, 
what the "evidence" really is for (and by evidence I mean what I alluded 
to at the end of my last post, over centuries of writing) is that people 
like to *emphasize* things from time to time.? It's really more the 
semantic side of "this text should be read louder."? So not so much 
"italic marker" but "emphasis marker."

But... that ignores some other points made here, about specific meanings 
attached to italics (or underlining, in some settings), like 
distinguishing book or movie titles (or vessel names) from common or 
proper nouns.? Is it better to lump those with emphasis as "italic", or 
better to distinguish them semantically, as "emphasis marker" vs "title 
marker"?? And if we did the latter, would ordinary folks know or care to 
make that distinction?? I tend to doubt it.

>     My main point in suggesting that Unicode needs these characters is
>     that italic has been used to indicate specific meaning - this text
>     is somehow special - for over 400 years, and that content should
>     be preserved in plain text.
>
>
> Underlying, bold text, interletter spacing, colour change, font style 
> change all are used to apply meaning in various ways. Not sure why 
> italic is special in this sense. Additionally without encoding the 
> meaning of italic, all you know is that it is italic, not what 
> convention of semantic meaning lies behind it.

Um... yeah.? That's what I meant, also.

>
> And I am curious on your thoughts, if we distinguish italic in 
> Unicode, encode some way of spacifying italic text, wouldn't it make 
> more sense to do away with italic fonts all together? and just roll 
> the italic glyphs into the regular font?

Eh.? Fonts are not really relevant to this.? Unicode already has more 
characters than you can put into a single font.? It's just as sensible, 
still, to have italic fonts and switch to them, just like you have to 
switch to your Thai font when you hit Thai text that your default font 
doesn't support.? (However, this knocks out the simplicity of using 
OpenType to handle it, as has been suggested.)

~mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/903624a3/attachment.html>

From unicode at unicode.org  Fri Jan 18 10:12:58 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Fri, 18 Jan 2019 11:12:58 -0500
Subject: Encoding italic
In-Reply-To: <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <59de396f-8955-cbb2-c501-1eff0065f8ec@it.aoyama.ac.jp>
Message-ID: <85665441-b4cf-d911-c2c4-98b994546267@kli.org>

On 1/17/19 1:27 AM, Martin J. D?rst via Unicode wrote:
>
> This lead to the layering we have now: Case distinctions at the
> character level, but style distinctions at the rich text level. Any good
> technology has layers, and it makes a lot of sense to keep established
> layers unless some serious problem is discovered. The fact that Twitter
> (currently) doesn't allow styled text and that there is a small number
> of people who (mis)use Math alphabets for writing italics,... on Twitter
> doesn't look like a serious problem to me.
How small a number?? How big?? I don't know either.? To mention Second 
Life again, which is pretty strongly defensible as a plain-text 
environment (with some exceptions, as for hyperlinks), I note that the 
viewers for it (and the servers?) don't seem to support Unicode 
characters outside of the BMP.? Which leads the flip-side of the "gappy" 
mathematical alphabets: you can say SOME things in italic or fraktur or 
double-struck... but only if they have the correct few letters that 
happen to be in the BMP already. Obviously, this can and should be 
blamed on incomplete Unicode support by the software vendors, but it 
still matters in the same way that "incomplete" markup support (i.e. 
none) matters to Twitter users: people make do with what they have, and 
will (mis)use even the few characters they can, though that leads to odd 
situations (see earlier list of display names.)

~mark


From unicode at unicode.org  Fri Jan 18 10:44:00 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Fri, 18 Jan 2019 16:44:00 +0000 (GMT)
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
Message-ID: <3f2ae1a9.c1d3.16861d90bc2.Webtop.229@btinternet.com>

Mark E. Shoulson wrote:

> ?, since italic fonts generally are narrower than roman).

I remember reading years ago that that was why italic type was invented 
in the first place in the fifteenth century, so that more text could be 
got into small format books that could conveniently be carried around. 
That is, used for all of the text of a book. So not invented for 
expressing emphasis.

The only modern use of all italics text that I can remember seeing in 
printed books is when poems are typeset in italics.

William Overington
Friday 18 January 2019


From unicode at unicode.org  Fri Jan 18 12:02:45 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 18 Jan 2019 10:02:45 -0800
Subject: NNBSP
In-Reply-To: <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
Message-ID: <6acfa6d0-3f10-db86-eb63-50160c6276b7@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/8772ac9c/attachment.html>

From unicode at unicode.org  Fri Jan 18 12:20:22 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 18 Jan 2019 10:20:22 -0800
Subject: NNBSP
In-Reply-To: <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
Message-ID: <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/eede831e/attachment.html>

From unicode at unicode.org  Fri Jan 18 13:09:48 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 18 Jan 2019 11:09:48 -0800
Subject: NNBSP
In-Reply-To: <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <6f53152f-3edc-2ad4-b1c9-1767279dc4ae@gmail.com>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
Message-ID: <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/2e84f1b7/attachment.html>

From unicode at unicode.org  Fri Jan 18 13:18:10 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 18 Jan 2019 11:18:10 -0800
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
Message-ID: <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/df88f1fe/attachment.html>

From unicode at unicode.org  Fri Jan 18 13:33:14 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 18 Jan 2019 20:33:14 +0100
Subject: NNBSP
In-Reply-To: <6acfa6d0-3f10-db86-eb63-50160c6276b7@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <6acfa6d0-3f10-db86-eb63-50160c6276b7@ix.netcom.com>
Message-ID: <623def57-deca-e316-47de-a303484faa72@orange.fr>

On 18/01/2019 19:02, Asmus Freytag via Unicode wrote:
> On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:
>> ....I understand only better why a significant majority of UTC is hating French.
>>
>> Francophobia is also palpable in Canada, beyond any technical reasons, especially in the IT industry. Hence the position of UTC is far from isolated. If ethic and personal considerations inflect decision-making, they should consistently be an integral part of discussions here. In that vein, I?d mention that by the time when Unicode was developed, there was a global hatred against France, that originated in French colonial and foreign politics since WWII, and was revived a few years ago by the French government sinking ????????????????????????????? and killing the crew?s photographer, in the port of Auckland. That crime triggered a peak of anger.
>
> Again, my recollections do *not support* any issues of _Francophobia_.
>
> The Unicode Technical committee has always had French people on board, from the beginning, and I have witnessed no issues where they took up a different technical position based on language. Quite the opposite, the UTC generally appreciates when someone can provide native insights into the requirements for supporting a given language. How best to realize these requirements then becomes a joint effort.
>
> If anything, the Unicode Consortium saw itself from the beginning in contrast to an IT culture for which internationalization at times was still something of an afterthought.
>
> Given all that, I find your suggestions and? implications deeply hurtful and hope you will find a way to avoid a repetition in the future.
>
> May I suggest that trying to rake over the past and apportion blame is generally less productive than _moving forward _and addressing the outstanding problems.
>
It is my last-resort track that I?m deeply convinced of. But I?m thankfully eased by not needing to discuss it here further.

To point a well-founded behavior is not to blame. You?ll note that I carefully founded how UTC was right in doing so if they did. I wasn?t aware that I was hurtful. You tell me, so I apologize. Please note, though, based on my past e?mail, that I see UTC as a compound of multiple, sometimes antagonistic tendencies. Just an example to help understand what I mean: When Karl Pentzlin proposed to encode a missing French abbreviation indicator, a typographer was directed to argue (on behalf of his employer IIUC) that this would be a case of encoding all scripts in bold and italic. The OP protested that it wasn?t, but he was overheard. That example raises much concern, the more as we were told on this List that decision makers in UTC are refusing to join in open and public discussions here, are only ?duelling ballot comments.?

Now since regardless of being right in doing so, they did not at all, I?m plunged again into disarray. May I quote Germaine Tillion, a French ethnologue: It?s important to understand what happens to us; to understand is to exist. ? Originally, ?to exist? meant ?to stand out.? That is still somewhat implied in the strong sense of ?to exist.? Understanding does also help to overcome. That?s why I wrote one e?mail before:

Nothing happens, or does not happen, without a good reason.
Finding out what reason is key to recoverage.
If we want to get what we need, we must do our homework first.

Thanks for helping bring it to the point.

Kind regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/b8c66bb5/attachment.html>

From unicode at unicode.org  Fri Jan 18 14:41:38 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 18 Jan 2019 21:41:38 +0100
Subject: NNBSP
In-Reply-To: <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
Message-ID: <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr>

On 18/01/2019 19:20, Asmus Freytag via Unicode wrote:
> On 1/18/2019 7:27 AM, Marcel Schneider via Unicode wrote:
>>>
>>> Covering existing character sets (National, International and Industry) was _an_ (not "the") important goal at the time: such coverage was understood as a necessary (although not sufficient) condition that would enable data migration to Unicode as well as enable Unicode-based systems to process and display non-Unicode data (by conversion).
>>>
>> I?d take this as a touchstone to infer that there were actual data files including standard typographic spaces as encoded in U+2000..U+2006, and electronic table layout using these: ?U+2007 figure space has a fixed width, known as tabular width, which is the same width as digits used in tables. U+2008 punctuation space is a space defined to be the same width as a period.?
>> Is that correct?
>
> May I remind you that the beginnings of Unicode predate the development of the world wide web. By 1993 the web had developed to where it was possible to easily access material written in different scripts and language, and by today it is certainly possible to "sample" material to check for character usage.
>
> When Unicode was first developed, it was best to work from the definition of character sets and to assume that anything encoded in a give set was also used somewhere. Several corporations had assembled supersets of character sets that their products were supporting. The most extensive was a collection from IBM. (I'm blanking out on the name for this).
>
> These collections, which often covered international standard character sets as well, were some of the prime inputs into the early drafts of Unicode. With the merger with ISO 10646 some characters from that effort, but not in the early Unicode drafts, were also added.
>
> The code points from U+2000..U+2008 are part of that early collection.
>
> Note, that prior to Unicode, no character set standard described in detail how characters were to be used (with exception, perhaps of control functions). Mostly, it was assumed that users knew what these characters were and the function of the character set was just to give a passive enumeration.
>
> Unicode's character property model changed all that - but that meant that properties for all of the characters had to be determined long after they were first encoded in the original sources, and with only scant hints of the identity of what these were intended to be. (Often, the only hint was a character name and a rather poor bitmapped image).
>
> If you want to know the "legacy" behavior for these characters, it is more useful, therefore, to see how they have been supported in existing software, and how they have been used in documents since then. That gives you a baseline for understanding whether any change or clarification of the properties of one of these code points will break "existing practice".
>
> Breaking existing practice should be a dealbreaker, no matter how well-intentioned a change is. The only exception is where existing implementations are de-facto useless, because of glaring inconsistencies or other issues. In such exceptional cases, deprecating some interpretations of? character may be a net win.
>
> However, if there's a consensus interpretation of a given character the you can't just go in and change it, even if it would make that character work "better" for a given circumstance: you simply don't know (unless you research widely) how people have used that character in documents that work for them. Breaking those documents retroactively, is not acceptable.
>
That is however what was proposed to do in PRI #308: change Gc of NNBSP from Zs to Pc (not to Cf, as I mistakenly quoted from memory, confusing with the *MONGOLIAN SUFFIX CONNECTOR, that would be a format control). That would break for example those implementations relying on Gc=Zs for the purpose of applying a background color to all (otherwise invisible) space characters.

By the occasion of that Public Review Issue, J. S. Choi reported another use case of NNBSP: between an integer and a vulgar fraction, pointing an error in TUS version 8.0 by the way: ?the THIN SPACE does not prevent line breaking from occurring, which is required in style guides such as the Chicago Manual of Style?. ? In version 11.0 the erroneous part is still uncorrected: ?If the fraction is to be separated from a previous number, then a space can be used, choosing the appropriate width (normal, thin, zero width, and so on). For example, 1 + thin space + 3 + fraction slash + 4 is displayed as 1?.?? Note that TUS has typeset this with the precomposed U+00BE, not with plain digits and fraction slash.

If U+2008 PUNCTUATION SPACE is used as intended, changing its line break property from A to GL does not break any implementation nor document. As of possible misuse of the character in ways other than intended, generally there is no point in using as breakable space a space that is actually just a thin variant of U+2007 FIGURE SPACE.

Hence the question, again: Why was PUNCTUATION SPACE not declared as non-breakable?

Marcel

That sample also raises concern, as it showcases how much is done or not done, as appropriate, to keep NNBSP off the usage in Latin script. To what avail?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/79a65b4b/attachment.html>

From unicode at unicode.org  Fri Jan 18 15:03:55 2019
From: unicode at unicode.org (Shawn Steele via Unicode)
Date: Fri, 18 Jan 2019 21:03:55 +0000
Subject: NNBSP
In-Reply-To: <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
 <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr>
Message-ID: <MW2PR2101MB10975E189A104430DCF7B865829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>

I've been lurking on this thread a little.

This discussion has gone ?all over the place?, however I?d like to point out that part of the reason NBSP has been used for thousands separators is because that it exists in all of those legacy codepages that were mentioned predating Unicode.

Whether or not NNBSP provides a better typographical experience, there are a lot of legacy applications, and even web services, that depend on legacy codepages.  NNBSP may be best for layout, but I doubt that making it work perfectly for thousand separators is going to be some sort of magic bullet that solves any problems that NBSP provides.

If folks started always using NNBSP, there are a lot of legacy applications that are going to start giving you ? in the middle of your numbers.?

Here?s a partial ?dir > out.txt? after changing my number thousands separator to NNBSP in French on Windows (for example).
13/01/2019  09:48            15?360 AcXtrnal.dll
13/01/2019  09:46            54?784 AdaptiveCards.dll
13/01/2019  09:46            67?584 AddressParser.dll
13/01/2019  09:47            24?064 adhapi.dll
13/01/2019  09:47            97?792 adhsvc.dll
10/04/2013  08:32           154?624 AdjustCalendarDate.exe
10/04/2013  08:32         1?190?912 AdjustCalendarDate.pdb
13/01/2019  10:47           534?016 AdmTmpl.dll
13/01/2019  09:48            58?368 adprovider.dll
13/01/2019  10:47           136?704 adrclient.dll
13/01/2019  09:48           248?832 adsldp.dll
13/01/2019  09:46           251?392 adsldpc.dll
13/01/2019  09:48           101?376 adsmsext.dll
13/01/2019  09:48           350?208 adsnt.dll
13/01/2019  09:46           849?920 adtschema.dll
13/01/2019  09:45           146?944 AdvancedEmojiDS.dll

There are lots of web services that still don?t expect UTF-8 (I know, bad on them), and many legacy applications that don?t have proper UTF-8 or Unicode support (I know, they should be updated).  It doesn?t seem to me that changing French thousands separator to NNBSP solves all of the perceived problems.

-Shawn

???? ?????
http://blogs.msdn.com/shawnste

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/9a65b022/attachment.html>

From unicode at unicode.org  Fri Jan 18 16:05:21 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 18 Jan 2019 23:05:21 +0100
Subject: NNBSP
In-Reply-To: <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <slrnq3ji0l.j02.jcb@home.stevens-bradfield.com>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com>
Message-ID: <cc8ff86e-0168-fbba-eec3-00b4b1827470@orange.fr>

On 18/01/2019 20:09, Asmus Freytag via Unicode wrote:
>
> Marcel,
>
> about your many detailed *technical* questions about the history of character properties, I am afraid I have no specific recollection.
>
Other List Members are welcome to join in, many of whom are aware of how things happened. My questions are meant to be rather simple. Summing up the premium ones:

 1. Why does UTC ignore the need of a non-breakable thin space?
 2. Why did UTC not declare PUNCTUATION SPACE non-breakable?

A less important information would be how extensively typewriters with proportional advance width were used to write books ready for print.

Another question you do answer below:

> French is not the only language that uses a space to group figures. In fact, I grew up with thousands separators being spaces, but in much of the existing publications or documents there was certainly a full (ordinary) space being used. Not surprisingly, because in those years documents were typewritten and even many books were simply reproduced from typescript.
>
> When it comes to figures, there are two different types of spaces.
>
> One is a space that has the same width a digit and is used in the layout of lists. For example, if you have a leading currency symbol, you may want to have that lined up on the left and leave the digits representing the amounts "ragged". You would fill the intervening spaces with this "lining" space character and everything lines up.
>
That is exactly how I understood hot-metal typesetting of tables. What surprises me is why computerized layout does work the same way instead of using tabulations and appropriate tab stops (left, right, centered, decimal [with all decimal separators lining up vertically).
>
> In lists like that, you can get away with not using a narrow thousands separator, because the overall context of the list indicates which digits belong together and form a number. Having a narrow space may still look nicer, but complicates the space fill between the symbol and the digits.
>
It does not, provided that all numbers have thousands separators, even if filling with spaces. It looks nicer because it?s more legible.
>
> Now for numbers in running text using an ordinary space has multiple drawbacks. It's definitely less readable and, in digital representation, if you use 0020 you don't communicate that this is part of a single number that's best not broken across lines.
>
Right.
>
> The problem Unicode had is that it did not properly understand which of the two types of "numeric" spaces was represented by "figure space". (I remember that we had discussions on that during the early years, but that they were not really resolved and that we moved on to other issues, of which many were demanding attention).
>
You were discussing whether the thousands separator should have the width of a digit or the width of a period? Consistently with many other choices, the solution would have been to encode them both as non-breakable, the more as both were at hand, leaving the choice to the end-user.

Current practice in electronic publishing was to use a non-breakable thin space, Philippe Verdy reports. Did that information come in somehow?

ISO 31-0 was published in 1992, perhaps too late for Unicode. It is normally understood that the thousands separator should not have the width of a digit. The allaged reason is security. Though on a typewriter, as you state, there is scarcely any other option. By that time, all computerized text was fixed width, Philippe Verdy reports. On-screen, I figure out, not in book print
>
> If you want to do the right thing you need:
>
> (1) have a solution that works as intended for ALL language using some form of blank as a thousands separator - solving only the French issue is not enough. We should not do this a language at a time.
>
That is how CLDR works. But as soon as that was set up, I started lobbying for support of all relevant locales at once:

https://unicode.org/cldr/trac/ticket/11423

https://unicode.org/pipermail/cldr-users/2018-September/000842.html

https://unicode.org/pipermail/cldr-users/2018-September/000843.html
and
https://unicode.org/cldr/trac/ticket/11423#comment:2

> Do you have colleagues in Germany and other countries that can confirm whether their practice matches the French usage in all details, or whether there are differences? (Including differently acceptability of fallback renderings...).
>
No I don?t but people may wish to read German Wikipedia:

https://de.wikipedia.org/wiki/Zifferngruppierung#Mit_dem_Tausendertrennzeichen

Shared in ticket #11423:
https://unicode.org/cldr/trac/ticket/11423#comment:15

> (2) have a solution that works for lining figures as well as separators.
>
> (3) have a solution that understands ALL uses of spaces that are narrower than normal space. Once a character exists in Unicode, people will use it on the basis of "closest fit" to make it do (approximately) what they want. Your proposal needs to address any issues that would be caused by reinterpreting a character more narrowly that it has been used. Only by comprehensively identifying ALL uses of comparable spaces in various languages and scripts, you can hope to develop a solution that doesn't simply break all non-French text in favor of supporting French typography.
>
There is no such problem except that NNBSP has never worked properly in Mongolian. It was an encoding error, and that is the reason why to date, all font developers unanimously request the Mongolian Suffix Connector. That leaves the NNBSP for what it is consistently used outside Mongolian: a non-breakable thin space, kind of a belated avatar of what PUNCTUATION SPACE should have been since the beginning.
>
> Perhaps you see why this issue has languished for so long: getting it right is not a simple matter.
>
Still it is as simple as not skipping PUNCTUATION SPACE when FIGURE SPACE was made non-breakable. Now we ended up with a mutated Mongolian Space that does not work properly for Mongolian, but does for French and other Latin script using languages. It would even more if TUS was blunter, urging all foundries to update their whole catalogue soon.

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/dcc2ee5d/attachment.html>

From unicode at unicode.org  Fri Jan 18 16:25:06 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 18 Jan 2019 23:25:06 +0100
Subject: NNBSP
In-Reply-To: <MW2PR2101MB10975E189A104430DCF7B865829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
 <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr>
 <MW2PR2101MB10975E189A104430DCF7B865829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
Message-ID: <f33e96a5-b089-9d76-167c-08d689508525@orange.fr>

On 18/01/2019 22:03, Shawn Steele via Unicode wrote:
>
> I've been lurking on this thread a little.
>
> This discussion has gone ?all over the place?, however I?d like to point out that part of the reason NBSP has been used for thousands separators is because that it exists in all of those legacy codepages that were mentioned predating Unicode.
>
> Whether or not NNBSP provides a better typographical experience, there are a lot of legacy applications, and even web services, that depend on legacy codepages.? NNBSP may be best for layout, but I doubt that making it work perfectly for thousand separators is going to be some sort of magic bullet that solves any problems that NBSP provides.
>
> If folks started always using NNBSP, there are a lot of legacy applications that are going to start giving you ? in the middle of your numbers.?
>
> Here?s a partial ?dir > out.txt? after changing my number thousands separator to NNBSP in French on Windows (for example).
>
> 13/01/2019? 09:48??????????? 15?360 AcXtrnal.dll
>
> 13/01/2019? 09:46? ??????????54?784 AdaptiveCards.dll
>
> 13/01/2019? 09:46??????????? 67?584 AddressParser.dll
>
> 13/01/2019? 09:47??????????? 24?064 adhapi.dll
>
> 13/01/2019? 09:47??????????? 97?792 adhsvc.dll
>
> 10/04/2013? 08:32?????????? 154?624 AdjustCalendarDate.exe
>
> 10/04/2013? 08:32???????? 1?190?912 AdjustCalendarDate.pdb
>
> 13/01/2019? 10:47?????????? 534?016 AdmTmpl.dll
>
> 13/01/2019? 09:48??????????? 58?368 adprovider.dll
>
> 13/01/2019? 10:47?????????? 136?704 adrclient.dll
>
> 13/01/2019? 09:48?????????? 248?832 adsldp.dll
>
> 13/01/2019? 09:46?????????? 251?392 adsldpc.dll
>
> 13/01/2019? 09:48?????????? 101?376 adsmsext.dll
>
> 13/01/2019? 09:48?????????? 350?208 adsnt.dll
>
> 13/01/2019? 09:46?????????? 849?920 adtschema.dll
>
> 13/01/2019? 09:45?????????? 146?944 AdvancedEmojiDS.dll
>
> There are lots of web services that still don?t expect UTF-8 (I know, bad on them), and many legacy applications that don?t have proper UTF-8 or Unicode support (I know, they should be updated).? It doesn?t seem to me that changing French thousands separator to NNBSP solves all of the perceived problems.
>
Keeping these applications outdated has no other benefit than providing a handy lobbying tool against support of NNBSP. What are all these expected to do while localized with scripts outside Windows code pages?

Also when you need those apps, just tailor your French accordingly. That should not impact all other users out there interested in a civilized layout, that we cannot get with NBSP, as this is justifying and numbers are torn apart in justified layout, nor with FIGURE SPACE as recommended in UAX#14 because it?s too wide and has no other benefit. BTW figure space is the same question mark in Windows terminal I guess, based on the above.

As long as SegoeUI has NNBSP support, no worries, that?s what CLDR data is for. Any legacy program can always use downgraded data, you can even replace NBSP if the expected output is plain ASCII. Downgrading is straightforward, the reverse is not true, that is why vetters are working so hard during CLDR surveys. CLDR data is kind of high-end, that is the only useful goal. Again downgrading is easy, just run a tool on the data and the job is done. You?ll end up with two libraries instead of one, but at least you?re able to provide a good UX in environments supporting any UTF.

Best,

Marcel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/dbac11ca/attachment.html>

From unicode at unicode.org  Fri Jan 18 16:46:54 2019
From: unicode at unicode.org (Shawn Steele via Unicode)
Date: Fri, 18 Jan 2019 22:46:54 +0000
Subject: NNBSP
In-Reply-To: <f33e96a5-b089-9d76-167c-08d689508525@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
 <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr>
 <MW2PR2101MB10975E189A104430DCF7B865829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
 <f33e96a5-b089-9d76-167c-08d689508525@orange.fr>
Message-ID: <MW2PR2101MB109703844B2CA628BB97BCFA829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>

>> Keeping these applications outdated has no other benefit than providing a handy lobbying tool against support of NNBSP.
I believe you?ll find that there are some French banks and other institutions that depend on such obsolete applications (unfortunately).
Additionally, I believe you?ll find that there are many scenarios where older applications and newer applications need to exchange data.  Either across the network, the web, or even on the same machine.  One app expecting NNBSP and another expecting NBSP on the same machine will likely lead to confusion.
This could be something a ?new? app running with the latest & greatest locale data and trying to import the legacy data users had saved on that app.  Or exchanging data with an application using the system settings which are perhaps older.
>> Also when you need those apps, just tailor your French accordingly.
Having the user attempt to ?correct? their settings may not be sufficient to resolve these discrepancies because not all applications or frameworks properly consider the user overrides on all platforms.
>> That should not impact all other users out there interested in a civilized layout.
I?m not sure that the choice of the word ?civilized? adds value to the conversation.  We have pretty much zero feedback that the OS?s French formatting is ?uncivilized? or that the NNBSP is required for correct support.
>> As long as SegoeUI has NNBSP support, no worries, that?s what CLDR data is for.
For compatibility, I?d actually much prefer that CLDR have an alt ?best practice? field that maintained the existing U+00A0 behavior for compatibility, yet allowed applications wanting the newer typographic experience to opt-in to the ?best practice? alternative data.  As applications became used to the idea of an alternative for U+00A0, then maybe that could be flip-flopped and put U+00A0 into a ?legacy? alt form in a few years.
Normally I?m all for having the ?best? data in CLDR, and there are many locales that have data with limited support for whatever reasons.  U+00A0 is pretty exceptional in my view though, developers have been hard-coding dependencies on that value for ? a century without even realizing there might be other types of non-breaking spaces.  Sure, that?s not really the best practice, particularly in modern computing, but I suspect you?ll still find it taught in CS classes with little regard to things like NNBSP.
-Shawn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/98b04205/attachment.html>

From unicode at unicode.org  Fri Jan 18 18:05:31 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 19 Jan 2019 01:05:31 +0100
Subject: NNBSP
In-Reply-To: <MW2PR2101MB109703844B2CA628BB97BCFA829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
 <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr>
 <MW2PR2101MB10975E189A104430DCF7B865829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
 <f33e96a5-b089-9d76-167c-08d689508525@orange.fr>
 <MW2PR2101MB109703844B2CA628BB97BCFA829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
Message-ID: <9babf7ec-af68-f0f5-c04d-1cfaafdad8ac@orange.fr>

On 18/01/2019 23:46, Shawn Steele wrote:
>
> *>> *Keeping these applications outdated has no other benefit than providing a handy lobbying tool against support of NNBSP.
>
> I believe you?ll find that there are some French banks and other institutions that depend on such obsolete applications (unfortunately).
>
If they are obsolete apps, they don?t use CLDR / ICU, as these are designed for up-to-date and fully localized apps. So one hassle is off the table.
>
> Additionally, I believe you?ll find that there are many scenarios where older applications and newer applications need to exchange data. ?Either across the network, the web, or even on the same machine.? One app expecting NNBSP and another expecting NBSP on the same machine will likely lead to confusion.
>
I didn?t look into these date interchanges but I suspect they won?t use any thousands separator at all to interchange data. The group separator is only for display and print, and there you may wish to use a compat library for obsolete apps, and a newest library for apps with Unicode support. If an app is so obsolete it will keep working without new data from ICU.
>
> This could be something a ?new? app running with the latest & greatest locale data and trying to import the legacy data users had saved on that app.? Or exchanging data with an application using the system settings which are perhaps older.
>
Again I don?t believe that apps are storing numbers with thousands separators in them. Not even spreadsheet software does do that. I say not even because these are high-end apps with latest locale data expected.

Sorry you did skip this one:

 >> What are all these expected to do while localized with scripts outside Windows code pages?

Indeed that is the paradox, that Tirhuta users are entitled to use correct display with newest data, while Latin users are bothered indefinitely with old data and legacy display.
>
> >> Also when you need those apps, just tailor your French accordingly.
>
> Having the user attempt to ?correct? their settings may not be sufficient to resolve these discrepancies because not all applications or frameworks properly consider the user overrides on all platforms.
>
Not the user. I?m addressing your concerns as coming from the developer side. I meant you should use the data as appropriate, and if a character is beyond support, just replace it for convenience.
>
> >> That should not impact all other users out there interested in a civilized layout.
>
> I?m not sure that the choice of the word ?civilized? adds value to the conversation.
>
That is to express in a mouthful of English what user feedback is or can be, even if not all the time. Users are complaining about quotation marks spaced off too far when typeset with NBSP like Word does. It?s really ugly they say. NBSP is a character with precise usage, it?s not a one-size-fits-all. BTW as you are in the job, why does Word not provide an option with a checkbox letting the user set the space as desired? NBSP or NNBSP.
>
> ? We have pretty much zero feedback that the OS?s French formatting is ?uncivilized? or that the NNBSP is required for correct support.
>
That is, at some point users stop submitting feedback when they see of how little use it is spending time to post it. From the pretty much zero you may wish to pick the one or two you get, guessing that for one you get there are one thousand other users out there having the same feedback but not submitting it. One thousand or one million, it?s hard to be precise?
>
> >> As long as SegoeUI has NNBSP support, no worries, that?s what CLDR data is for.
>
> For compatibility, I?d actually much prefer that CLDR have an alt ?best practice? field that maintained the existing U+00A0 behavior for compatibility, yet allowed applications wanting the newer typographic experience to opt-in to the ?best practice? alternative data.? As applications became used to the idea of an alternative for U+00A0, then maybe that could be flip-flopped and put U+00A0 into a ?legacy? alt form in a few years.
>
You dont need that field in CLDR. Here?s how it works: Take the locale data, search-and-replace all NNBSP with NBSP, and here?s the library you?ll use.
Because NNBSP is not only in the group separator. I?d suggest to download common/main/fr.xml and check all instances of NNBSP. The legacy apps you?re referring to don?t use that data for sure. That data is for fine high-end apps and for user interfaces of Windows and any other OS. If you want your employer be well-served, you?d rather prefer the correct data, not legacy fallbacks.
>
> Normally I?m all for having the ?best? data in CLDR, and there are many locales that have data with limited support for whatever reasons.? U+00A0 is pretty exceptional in my view though, developers have been hard-coding dependencies on that value for ? a century without even realizing there might be other types of non-breaking spaces.? Sure, that?s not really the best practice, particularly in modern computing, but I suspect you?ll still find it taught in CS classes with little regard to things like NNBSP.
>
There have been threads about Unicode in CS curricula. I don?t believe that teachers would be doing any good to their students by training them to ignore Unicode. These people would be unresponsive through not preparing their students for real life. But I won?t base any utterings on mere suspicions.

BTW Latin-1 did not exist 50 years ago. As a rough guess it has come up in the early eighties, and NBSP with it, but I may be wrong.

The point in sticking with old charsets is, again, to deny Unicode support to one third of mankind. I don?t think that this is doing any good.

Marcel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190119/ca002d87/attachment.html>

From unicode at unicode.org  Fri Jan 18 18:21:12 2019
From: unicode at unicode.org (Shawn Steele via Unicode)
Date: Sat, 19 Jan 2019 00:21:12 +0000
Subject: NNBSP
In-Reply-To: <9babf7ec-af68-f0f5-c04d-1cfaafdad8ac@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
 <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr>
 <MW2PR2101MB10975E189A104430DCF7B865829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
 <f33e96a5-b089-9d76-167c-08d689508525@orange.fr>
 <MW2PR2101MB109703844B2CA628BB97BCFA829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
 <9babf7ec-af68-f0f5-c04d-1cfaafdad8ac@orange.fr>
Message-ID: <MW2PR2101MB10973F4C95D0A44EF9563258829D0@MW2PR2101MB1097.namprd21.prod.outlook.com>

>> If they are obsolete apps, they don?t use CLDR / ICU, as these are designed for up-to-date and fully localized apps. So one hassle is off the table.

Windows uses CLDR/ICU.  Obsolete apps run on Windows.  That statement is a little narrowminded.

>> I didn?t look into these date interchanges but I suspect they won?t use any thousands separator at all to interchange data.

Nope

>> The group separator is only for display and print

Yup, and people do the wrong thing so often that I even blogged about it. https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/

>> Sorry you did skip this one:

Oops, I did mean to respond to that one and accidentally skipped it.

>> What are all these expected to do while localized with scripts outside Windows code pages?

(We call those ?unicode-only? locales FWIW)

The users that are not supported by legacy apps can?t use those apps (obviously).  And folks are strongly encouraged to write apps (and protocols) that Use Unicode (I?ve blogged about that too).  However, the fact that an app may run very poorly in Cherokee or whatever doesn?t mean that there aren?t a bunch of French enterprises that depend on that app for their day-to-day business.

In order for the ?unicode-only? locale users to use those apps, the app would need to be updated, or another app with the appropriate functionality would need to be selected.

However, that still doesn?t impact the current French users that are ?ok? with their current non-Unicode app.  Yes, I would encourage them to move to Unicode, however they tend to not want to invest in migration when they don?t see an urgent need.

Since Windows depends on CLDR and ICU data, updates to that data means that those customers can experience pain when trying to upgrade to newer versions of Windows.  We get those support calls, they don?t tend to pester CLDR.

Which is why I suggested an ?opt-in? alt form that apps wanting ?civilized? behavior could opt-into (at least for long enough that enough badly behaved apps would be updated to warrant moving that to the default.)

The data for locales like French tends to have been very stable for decades.  Changes to data for major locales like that are more disruptive than to newer emerging markets where the data is undergoing more churn.

-Shawn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190119/71a96ecd/attachment.html>

From unicode at unicode.org  Fri Jan 18 18:53:16 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 19 Jan 2019 00:53:16 +0000
Subject: Loose character-name matching
In-Reply-To: <60797095-B703-4770-8F85-F045DDED4431@icloud.com>
References: <60797095-B703-4770-8F85-F045DDED4431@icloud.com>
Message-ID: <20190119005316.7fbb0469@JRWUBU2>

On Thu, 17 Jan 2019 18:44:50 -0500
"J.?S. Choi" via Unicode <unicode at unicode.org> wrote:

> I?m implementing a Unicode names library. I?m confused about loose
> character-name matching, even after rereading The Unicode Standard ?
> 4.8, UAX #34 ? 4, #44 ? 5.9.2 ? as well as
> [L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt
> <http://www.unicode.org/L2/L2013/13142-name-match.txt>),
> [L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035
> <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035>), and
> the [meeting in which those two items were
> resolved](https://www.unicode.org/L2/L2014/14026.htm
> <https://www.unicode.org/L2/L2014/14026.htm>).
> 
> In particular, I?m confused by the claim in The Unicode Standard ?
> 4.8 saying, ?Because Unicode character names do not contain any
> underscore (?_?) characters, a common strategy is to replace any
> hyphen-minus or space in a character name by a single ?_? when
> constructing a formal identifier from a character name. This strategy
> automatically results in a syntactically correct identifier in most
> formal languages. Furthermore, such identifiers are guaranteed to be
> unique, because of the special rules for character name matching.?

Unfortunately, the loose matching rules don't distinguish '__' and
'_'.  Note that '__' is sometimes forbidden in identifiers.

> I?m also confused by the relationship between UAX34-R3 and UAX44-LM2.
> 
> To make these issues concrete, let?s say that my library provides a
> function called getCharacter that takes a name argument, tries to
> find a loosely matching character, and then returns it (or a null
> value if there is no currently loosely matching character). So then
> what should the following expressions return?
> 
Loose matching of names may be looser than prescribed; it shall not be
stricter.

> getCharacter(?HANGUL-JUNGSEONG-O-E?)
U+1180 HANGUL JUNGSEONG O-E, or just possibly null.

> getCharacter(?HANGUL_JUNGSEONG_O_E?)
U+116C HANGUL JUNGSEONG OE*

> getCharacter(?HANGUL_JUNGSEONG_O_E_?)
U+116C

> getCharacter(?HANGUL_JUNGSEONG_O__E?)
U+116C

> getCharacter(?HANGUL_JUNGSEONG_O_-E?)
U+1180

> getCharacter(?HANGUL JUNGSEONGCHARACTERO E?)
null or U+116C - up to you.  The sequence 'CHARACTER' shall not
distinguish names, but loose matching is not required to know this fact.

> getCharacter(?HANGUL JUNGSEONG CHARACTER OE?)
null or U+116C - up to you.

> getCharacter(?TIBETAN_LETTER_A?)
U+0F68 TIBETAN LETTER A

> getCharacter(?TIBETAN_LETTER__A?)
U+0F68 TIBETAN LETTER A**

> getCharacter(?TIBETAN_LETTER _A?)
U+0F68

> getCharacter(?TIBETAN_LETTER_-A?)
U+0F60 TIBETAN LETTER -A

*This is unfortunate, as the usual symbolic name for U+1180 would be
HANGUL_JUNGSEONG_O_E.

**This is also unfortunate, as the usual symbolic
name for U+0F60 would be TIBETAN_LETTER__A.

The key problem here is that the hyphen after a space is required in
names as understood by the name property.  The hyphen is also required
in  "HANGUL JUNGSEONG O-E".  The simple tactic is:

1)      Canonicalise, by stripping out spaces, underscores and medial
hyphens and lowercasing.  (It's probably better to fold the character
U+0131 LATIN SMALL LETTER I' to 'i'.)

2)      Look the result up.

3)      If you get the result U+116C but the input matches
".*[oO]-[eE][_- ]*$", convert to U+1180.

Symbolic identifiers in programs need not match the name; one may
choose to depend on the compiler or interpreter to catch duplicates;
some will, some won't.  Replacing '-' by '_' to convert a name to an
identifier looses the distinction between a hyphen and an arbitrarily
inserted space,

Richard.


From unicode at unicode.org  Fri Jan 18 18:55:07 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 18 Jan 2019 16:55:07 -0800
Subject: NNBSP
In-Reply-To: <cc8ff86e-0168-fbba-eec3-00b4b1827470@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com>
 <cc8ff86e-0168-fbba-eec3-00b4b1827470@orange.fr>
Message-ID: <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/038d0ba4/attachment.html>

From unicode at unicode.org  Fri Jan 18 18:58:05 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 18 Jan 2019 16:58:05 -0800
Subject: NNBSP
In-Reply-To: <MW2PR2101MB109703844B2CA628BB97BCFA829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
 <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr>
 <MW2PR2101MB10975E189A104430DCF7B865829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
 <f33e96a5-b089-9d76-167c-08d68 9508525@orange.fr>
 <MW2PR2101MB109703844B2CA628BB97BCFA829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
Message-ID: <c144440d-df95-d04a-853c-14f0ed1b44ee@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190118/b613f1fa/attachment.html>

From unicode at unicode.org  Fri Jan 18 19:49:09 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 19 Jan 2019 01:49:09 +0000
Subject: NNBSP
In-Reply-To: <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <20190112132221.7497fdea@JRWUBU2>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
Message-ID: <20190119014909.4b093988@JRWUBU2>

On Fri, 18 Jan 2019 10:20:22 -0800
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> However, if there's a consensus interpretation of a given character
> the you can't just go in and change it, even if it would make that
> character work "better" for a given circumstance: you simply don't
> know (unless you research widely) how people have used that character
> in documents that work for them. Breaking those documents
> retroactively, is not acceptable.

Unless the UCD contains a contrary definition only usable where the
character wouldn't normally be used, in which case it is fine to try
to kick the character's users in the teeth. I am referring to the
belief that ZWSP separated words, whereas the UCD only defined it as a
lay-out control.  That outlawed belief has recently been very helpful
to me in using (as opposed to testing) a nod-Lana spell-checker on
Firefox.

Richard.

From unicode at unicode.org  Sat Jan 19 01:34:13 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 19 Jan 2019 08:34:13 +0100
Subject: NNBSP
In-Reply-To: <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <2041ec23-95e8-e5c4-5702-0ad4f6a54f0b@gmail.com>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com>
 <cc8ff86e-0168-fbba-eec3-00b4b1827470@orange.fr>
 <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com>
Message-ID: <cfe67c8e-ad93-7cf3-db31-ad8d4109bfee@orange.fr>

On 19/01/2019 01:55, Asmus Freytag via Unicode wrote:
> On 1/18/2019 2:05 PM, Marcel Schneider via Unicode wrote:
>> On 18/01/2019 20:09, Asmus Freytag via Unicode wrote:
>>>
>>> Marcel,
>>>
>>> about your many detailed *technical* questions about the history of character properties, I am afraid I have no specific recollection.
>>>
>> Other List Members are welcome to join in, many of whom are aware of how things happened. My questions are meant to be rather simple. Summing up the premium ones:
>>
>>  1. Why does UTC ignore the need of a non-breakable thin space?
>>  2. Why did UTC not declare PUNCTUATION SPACE non-breakable?
>>
>> A less important information would be how extensively typewriters with proportional advance width were used to write books ready for print.
>>
>> Another question you do answer below:
>>
>>> French is not the only language that uses a space to group figures. In fact, I grew up with thousands separators being spaces, but in much of the existing publications or documents there was certainly a full (ordinary) space being used. Not surprisingly, because in those years documents were typewritten and even many books were simply reproduced from typescript.
>>>
>>> When it comes to figures, there are two different types of spaces.
>>>
>>> One is a space that has the same width a digit and is used in the layout of lists. For example, if you have a leading currency symbol, you may want to have that lined up on the left and leave the digits representing the amounts "ragged". You would fill the intervening spaces with this "lining" space character and everything lines up.
>>>
>> That is exactly how I understood hot-metal typesetting of tables. What surprises me is why computerized layout does work the same way instead of using tabulations and appropriate tab stops (left, right, centered, decimal [with all decimal separators lining up vertically).
>
> ==> At the time Unicode was first created (and definitely before that, during the time of non-universal character sets) many applications existed that used a "typewriter model" and worked by space fill rather than decimal-point tabulation.
>
If you are talking about applications, as opposed to typesetting tables for book printing, then I?d suggest that the fixed-width display of tables could be done much like still today?s source code layout, where normal space is used for that purpose. In this use case, line wrap is typically turned off. That could make non-breakable spaces sort of pointless (but I?m aware of your point below), except if people are expected to re-use the data in other environments. In that case, best practice is to use NNBSP as thousands separator while displaying it like other monospace characters. That?s at least how today?s monospace fonts work (provided they?re used in environments actually supporting Unicode, which may not happen with applications running in terminal).
>
> From today's perspective that older model is inflexible and not the best approach, but it is impossible to say how long this legacy approach hung on in some places and how much data might exist that relied on certain long-standing behaviors of these space characters.
>
My position since some time is that legacy apps should use legacy libraries. But I?ll come back on this when responding to Shawn Steele.
>
> For a good solution, you always need to understand
>
> (1) the requirement of your "index" case (French, in this case)
>
That?s okay.
>
> (2) how it relates to similar requirements in (all!) other languages / scripts
>
That?s rather up to CLDR as I suggested, given it has the means to submit a point to all vetters. See again below (in the part that you?ve cut off without consideration).
>
> (3) how it relates to actual legacy practice
>
That?s Shawn Steele?s point (see next reply).
>
> (3a) what will suddenly no longer work if you change the properties on some character
>
> (3b) what older data will no longer work if the effective behavior of newer applications changes
>
I?ll already note that this needs to be aware of actual use cases and/or to delve into the OSes, that is far beyond what I can currently do, both wrt time and wrt resources. The vetter?s role is to inform CLDR with correct data from their locale. CLDR is then welcome to sort things out and to get in touch with the industry, which CLDR TC is actually doing. But that has no impact on the data submitted at survey time. Changing votes to tell ?OK let the group separator be NBSP as long as?? would be a lie.
>
>>> In lists like that, you can get away with not using a narrow thousands separator, because the overall context of the list indicates which digits belong together and form a number. Having a narrow space may still look nicer, but complicates the space fill between the symbol and the digits.
>>>
>> It does not, provided that all numbers have thousands separators, even if filling with spaces. It looks nicer because it?s more legible.
>>>
>>> Now for numbers in running text using an ordinary space has multiple drawbacks. It's definitely less readable and, in digital representation, if you use 0020 you don't communicate that this is part of a single number that's best not broken across lines.
>>>
>> Right.
>>>
>>> The problem Unicode had is that it did not properly understand which of the two types of "numeric" spaces was represented by "figure space". (I remember that we had discussions on that during the early years, but that they were not really resolved and that we moved on to other issues, of which many were demanding attention).
>>>
>> You were discussing whether the thousands separator should have the width of a digit or the width of a period? Consistently with many other choices, the solution would have been to encode them both as non-breakable, the more as both were at hand, leaving the choice to the end-user.
>
> ==> Right, but remember, we started off encoding a set of spaces that existed before Unicode (in some other character sets) and implicitly made the assumption that those were the correct set (just like we took punctuation from ASCII and similar sources and only added to it later, when we understood that they were missing things --- generally always added, generally did not redefine behavior or shape of existing code points).
>
Now I understand that what UAX #14 calls ?the preferred space for use in numbers? is actually preferred in the table layout you are referring to, because it is easier to code when only the empty decimal separator position uses PUNCTUATION SPACE, while grouping is performed with FIGURE SPACE.

That raises two questions, one of which has been often asked in this thread:

 1. How is FIGURE SPACE supposed to be supported in legacy environments? (UAX #14 mentions both its line breaking behavior and its width, but makes no concessions for legacy apps?)
 2. Why did PUNCTUATION SPACE not be declared non-breakable? (If it had, it could have been re-purposed to space off French punctuation since the beginning of Unicode, and never French users had have a reason to be upset by lack of a narrow non-breaking space.)

>>
>> Current practice in electronic publishing was to use a non-breakable thin space, Philippe Verdy reports. Did that information come in somehow?
>
> ==> probably not in the early days. Y
>
Perhaps it was ignored from the beginning on, like Philippe Verdy reports that UTC ignored later demands, getting users upset.
That leaves us with the question why it did so, downstream your statement that it was not what I ended up suspecting.

Does "Y" stand for the peace symbol?
>
>>
>> ISO 31-0 was published in 1992, perhaps too late for Unicode. It is normally understood that the thousands separator should not have the width of a digit. The allaged reason is security. Though on a typewriter, as you state, there is scarcely any other option. By that time, all computerized text was fixed width, Philippe Verdy reports. On-screen, I figure out, not in book print
>
> ==> much book printing was also done by photomechanically reproducing typescript at that time. Not everybody wanted to pay typesetters and digital typesetting wasn't as advanced. I actually did use a digital phototypesetter of the period a few years before I joined Unicode, so I know. It was more powerful than a typewriter, but not as powerful as TeX or later the Adobe products.
>
> For one, you didn't typeset a page, only a column of text, and it required manual paste-up etc.
>
Did you also see typewriters with proportional advance width (and interchangeable type wheels)? That was the high end on the typewriter market. (Already mentioned these typewriters in a previous e?mail.) Books typeset this way could use bold and (less easy) italic spans.
>
>>> If you want to do the right thing you need:
>>>
>>> (1) have a solution that works as intended for ALL language using some form of blank as a thousands separator - solving only the French issue is not enough. We should not do this a language at a time.
>>>
>> That is how CLDR works. 
>
> CLDR data is by definition per-language. Except for inheritance, languages are independent.
>
> There are no "French" characters. When you encode characters, at best, some code points may be script-specific. For punctuation and spaces not even that may be the case. Therefore, as long as you try to solve this as if it *only* was a French problem, you are not doing proper character encoding.
>
Again, I did not do that (and BTW CLDR is not doing ?character encoding?). Actually, to be able to post that blame you needed to cut off all the URLs I provided you with. These links are documenting that i did not ?try to solve this as if it only was a French problem[.]?

Here they are again, this time with copy-pasted snippets below.
I wrote: ?But as soon as that was set up, I started lobbying for support of all relevant locales at once:?

https://unicode.org/cldr/trac/ticket/11423
https://unicode.org/pipermail/cldr-users/2018-September/000842.html

  * ?To be cost-effective, locales using space as numbers group separator should migrate at once from the wrong U+00A0 to the correct U+202F. I didn?t aim at making French stand out, but at correcting an error in CLDR. Having even the Canadian French sublocale stick with the wrong value makes no sense and is mainly due to opaque inheritance relationships and to severe constraints on vetters applying for fr-FR and subsequently reduced to look on helpless from the sidelines when sublocales are not getting fixed.?

  * ?After having painstakingly catched up support of some narrow fixed-width no-break space (U+202F). the industry is now ready to migrate from U+00A0 to U+202F. Doing it in a single rush is way more cost-effective than migrating one locale this time, another locale next time, a handful locales the time after, possibly splitting them up in sublocales with different migration schedules. I really believed that now Unicode proves ready to adopt the real group separator in French, all relevant locales would be consistently pushed for correcting that value in release 34. The v34 alpha overview makes clear they are not. ?
    http://cldr.unicode.org/index/downloads/cldr-34#TOC-Migration

    I aimed at correcting an error in CLDR, not at making French stand out. Having many locales and sublocales stick with the wrong value makes no sense any more.
    ?https://www.unicode.org/cldr/charts/34/by_type/numbers.symbols.html#a1ef41eaeb6982d

    The only effect is implementers skipping migration for fr-FR while waiting for the others to catch up, then doing it for all at once.

    There seems to be a misunderstanding: The*locale setting *is whether to use period, comma, space, apostrophe, U+066C ARABIC THOUSANDS SEPARATOR, or another graphic. Whether "space" is NO-BREAK SPACE or NARROW NO-BREAK SPACE is *not a locale setting,* but it?s all about Unicode *design* and Unicode *implementation.* I really thought that that was clear and that there?s no need to heavily insist on the ST "French" forum. When referring to the "French thousands separator" I only meant that unlike comma- or period-using locales, the French locale uses space and that the group separator space should be the correct one. That did *not* mean that French should use *another* space than the other locales using space.?

https://unicode.org/pipermail/cldr-users/2018-September/000843.html
and
https://unicode.org/cldr/trac/ticket/11423#comment:2

  * ?I've to confess that I did focus on French and only applied for fr-FR, but there was a lot of work, see ?
    http://cldr.unicode.org/index/downloads/cldr-34#TOC-Growth
    waiting for very few vetters. Nevertheless I also cared for English (see various tickets), and also posted on CLDR-users in a belated P.S. that fr-CA hadn?t caught up the group separator correction yet:
    ?https://unicode.org/pipermail/cldr-users/2018-August/000825.html

    Also I?m sorry for failing to provide appropriate feedback after beta release and to post upstream messages urging to make sure all locales using space for group separator be kept in synchrony.

    I think the point about not splitting up all the data into locales is a very good one.

    There should be a common pool so that all locales using Arabic script have automatically group separator set to ARABIC THOUSANDS SEPARATOR (provided it actually fits all), and those locales using space should only need to specify "space" to automatically get the correct one, ie NARROW NO-BREAK SPACE as soon as Unicode is ready to give it currency in that role.?


Do these recommendations meet your requirements and sound okay to you?
>>>
>>> Do you have colleagues in Germany and other countries that can confirm whether their practice matches the French usage in all details, or whether there are differences? (Including differently acceptability of fallback renderings...).
>>>
>> No I don?t but people may wish to read German Wikipedia:
>>
>> https://de.wikipedia.org/wiki/Zifferngruppierung#Mit_dem_Tausendertrennzeichen
>>
>> Shared in ticket #11423:
>> https://unicode.org/cldr/trac/ticket/11423#comment:15
>
>
> ==> for your proposal to be effective, you need to reach out.
>
Basically we vetters are just reporting the locale date. Beyond that, I?ve already conceded a huge effort in reporting bugs in English data and in communicating on lists and fora, including German (since the current survey that has a very limited scope). I have limited time and resources.

Normally reaching out to all relevant locales is what CLDR can do best, by posting guidelines. by e-mailing (on behalf of CLDR administrator and/or on the public CLDR-users Mail List), and by prioritizing the items on the vetters? dashboards.

If I can do something else, I?m ready but people should not abuse since I?ve many other tasks I won?t be going to deprioritize any longer. At some point I?ll just start reporting to end-users that we?ve strived to get locale data in synch, but that CLDR ended up rolling back our efforts, alleging other priorities. If that is what you wish, I?d say that there?s no problem for me except that I strongly dislike documenting an ugly mess.
>
>>
>>> (2) have a solution that works for lining figures as well as separators.
>>>
>>> (3) have a solution that understands ALL uses of spaces that are narrower than normal space. Once a character exists in Unicode, people will use it on the basis of "closest fit" to make it do (approximately) what they want. Your proposal needs to address any issues that would be caused by reinterpreting a character more narrowly that it has been used. Only by comprehensively identifying ALL uses of comparable spaces in various languages and scripts, you can hope to develop a solution that doesn't simply break all non-French text in favor of supporting French typography.
>>>
>> There is no such problem except that NNBSP has never worked properly in Mongolian. It was an encoding error, and that is the reason why to date, all font developers unanimously request the Mongolian Suffix Connector. That leaves the NNBSP for what it is consistently used outside Mongolian: a non-breakable thin space, kind of a belated avatar 
>> of what PUNCTUATION SPACE should have been since the beginning.
>
> ==> I mentioned before that if something is universally "broken" it can sometimes be resurrected, because even if you change its behavior retroactively, it will not change something that ever worked correctly. (But you need to be sure that nobody repurposed the NNBSP for something useful that is different from what you intend to use it for, otherwise you can't change anything about it).
>
You may wish to look up Unicode?s own PRI#308 background page, where they already hinted they?ve made sure it isn?t.
http://www.unicode.org/review/pri308/pri308-background.html
https://www.unicode.org/review/pri308/
https://www.unicode.org/review/pri308/feedback.html

> If, however, you are merely adding a use for some existing character that does not affect its properties, that is usually not as much of a problem - as long as we can have some confidence that both usages will continue to be possible.
>
Actually, again, there is a problem with NNBSP in Mongolian.

Richard Wordingham reported at thread launch that Unicode have started tweaking that space in a way that makes it unfit for French.

Now since you are aware that this operating mode is wrong, I?d suggest that you reach back to them providing feedback about inappropriateness of last changes. Other people (including me) may do that as well, but I see better chances for your recommendations to get implemented. I say that because lastly I strongly recommended in several pieces of feedback that the math symbols should not be bidi-mirrored on a tilde?reversed-tilde basis, because mirroring these compromises legibility of the tilde symbol in low-end environments relying on glyph-exchange-bidi-mirroring for best-fit display, but UTC took no action, and off-list I was taught that UTC is not interested. Nothing else than that, in private mail. UTC are just not interested, without providing any technical reasons. Perhaps you better understand now why I posted what I suspected to be the reason why UTC is not interested, or was not interested, in supporting a narrow non-breaking space unless Mongolian was encoded and 
needed the same for the purpose of appending suffixes (as opposed to separating vowels, which is performed by a similar space with another shaping behavior, and proper to Mongolian). A hypothesis that you firmly dissipated in the wake, but without answering my question about */why UTC was ignoring the demand for a narrow non-breaking space, delaying support for French and heavily impacting French implementations still today/* due to less font support than if that space were in Unicode from version?1.1 on.
>
>>> Perhaps you see why this issue has languished for so long: getting it right is not a simple matter.
>>>
>> Still it is as simple as not skipping PUNCTUATION SPACE when FIGURE SPACE was made non-breakable. Now we ended up with a mutated Mongolian Space that does not work properly for Mongolian, but does for French and other Latin script using languages. It would even more if TUS was blunter, urging all foundries to update their whole catalogue soon.
>
> ==> You realize that I'm giving you general advice here, not something utterly specific to NNBSP - I don't have the inputs and background to know whether your approach is feasible or perhaps the best possible?
>
It is not ?my approach?.

Other List Members may wish to help you answer my questions.
>
> As for PUNCTUATION SPACE - some of the spaces have acquired usage in math (as part of the added math support in Unicode 3.2). We need to be sure that the assumptions about these that may have been made in math typesetting? are not invalidated.
>
That adds to the reasons why I?m asking why PUNCTUATION SPACE was not made non-breakable when FIGURE SPACE was. The math usage has probably originated in repurposing that space on the basis of it?s line breaking behavior. I don?t suggest to make it non-breakable now. That deal was broken and will remain broken. Now we must live with NNBSP and get more font support, while trying to stop Unicode from making a mess of it that neither helps Mongolian nor French nor all (other) locales grouping digits with a narrow space.
>
> Not sure offhand whether UTR#25 captures all of that, but if you ever feel like proposing a property change you MUST research that first (with the current maintainers of that UTR or other experts).
>
I have NOT proposed any property change, and PUNCTUATION SPACE or "2008" are NOT found in UTR #25 (Unicode Support for Mathematics).
>
> This is the way Unicode is different from CLDR.
>
Marcel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190119/7beac313/attachment.html>

From unicode at unicode.org  Sat Jan 19 02:42:55 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 19 Jan 2019 00:42:55 -0800
Subject: NNBSP
In-Reply-To: <cfe67c8e-ad93-7cf3-db31-ad8d4109bfee@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <slrnq3mr68.vm5.jcb@home.stevens-bradfield.com>
 <e94058b2-6ff8-83ef-e747-a91162a02f53@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com>
 <cc8ff86e-0168-fbba-eec3-00b4b1827470@orange.fr>
 <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com>
 <cfe67c8e-ad93-7cf3-db31-ad8d4109bfee@orange.fr>
Message-ID: <0f2461c9-376a-1161-7c57-9b08ef5d2478@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190119/59b3da97/attachment.html>

From unicode at unicode.org  Sat Jan 19 02:58:27 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 19 Jan 2019 09:58:27 +0100
Subject: NNBSP
In-Reply-To: <MW2PR2101MB10973F4C95D0A44EF9563258829D0@MW2PR2101MB1097.namprd21.prod.outlook.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <7d22eaca-2394-2ba0-233f-f831682ee92a@ix.netcom.com>
 <5edb058a-1808-3586-1cb6-4720706fd914@orange.fr>
 <MW2PR2101MB10975E189A104430DCF7B865829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
 <f33e96a5-b089-9d76-167c-08d689508525@orange.fr>
 <MW2PR2101MB109703844B2CA628BB97BCFA829C0@MW2PR2101MB1097.namprd21.prod.outlook.com>
 <9babf7ec-af68-f0f5-c04d-1cfaafdad8ac@orange.fr>
 <MW2PR2101MB10973F4C95D0A44EF9563258829D0@MW2PR2101MB1097.namprd21.prod.outlook.com>
Message-ID: <25b6b9c4-d994-599c-e798-a3798e04b1f1@orange.fr>

On 19/01/2019 01:21, Shawn Steele wrote:
>
> *>> *If they are obsolete apps, they don?t use CLDR / ICU, as these are designed for up-to-date and fully localized apps. So one hassle is off the table.
>
> Windows uses CLDR/ICU.? Obsolete apps run on Windows.? That statement is a little narrowminded.
>
> >> I didn?t look into these date interchanges but I suspect they won?t use any thousands separator at all to interchange data.
>
> Nope
>
> >> The group separator is only for display and print
>
> Yup, and people do the wrong thing so often that I even blogged about it. https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/
>
Thanks for sharing. As it happens, I like most the first reason you provide:

  * ?The most obvious reason is that there is a bug in the data and we had to make a change. (Believe it or not we make mistakes ;-))? In this case our users (and yours too) want culturally correct data, so we have to fix the bug even if it breaks existing applications.?


No comment :)

> >> Sorry you did skip this one:
>
> Oops, I did mean to respond to that one and accidentally skipped it.
>
No problem.
>
> >> What are all these expected to do while localized with scripts outside Windows code pages?
>
> (We call those ?unicode-only? locales FWIW)
>
Noted.
>
> The users that are not supported by legacy apps can?t use those apps (obviously).? And folks are strongly encouraged to write apps (and protocols) that Use Unicode (I?ve blogged about that too).
>

Like here:
https://blogs.msdn.microsoft.com/shawnste/2009/06/01/writing-fields-of-data-to-an-encoded-file/

You?re showcasing that despite ?The moral here is ?Use Unicode??? some people are still not using it. The stuff gets even weirder as you state that code pages and Unicode are not 1:1, contradicting the Unicode design principle of roundtrip compatibility.

The point in not using Unicode, and likewise in not using verbose formats, is limited hardware resources. Often new implementations are built on top of old machines and programs, for example in the energy and shipping industies. This poses a security threat, ending up in power outages and logistic breakdowns. That is making our democracies vulnerable. Hence maintaining obsolete systems does not pay back. We?re all better off when recycling all the old hardware and investing in latest technologies, implementing Unicode by the way.

What you are advocating in this thread seems like a non-starter.

> However, the fact that an app may run very poorly in Cherokee or whatever doesn?t mean that there aren?t a bunch of French enterprises that depend on that app for their day-to-day business.
>
They?re ill-advised in doing so (see above).
>
> In order for the ?unicode-only? locale users to use those apps, the app would need to be updated, or another app with the appropriate functionality would need to be selected.
>
To be ?selected?, not developed and built. The job is already done. What are people waiting for?
>
> However, that still doesn?t impact the current French users that are ?ok? with their current non-Unicode app.? Yes, I would encourage them to move to Unicode, however they tend to not want to invest in migration when they don?t see an urgent need.
>
They may not see it because they?re lacking appropriate training in cyber security. You seem to be backing that unresponsive behavior. I can?t see that you may be doing any good by doing so, and I?d strongly advise you to reach out to your customers, or check the issue with your managers. We?re in a time where companies are still making huge benefits, and it is unclear where all that money goes once paid out to shareholders. The money is there, you only need to market the security. That job would better use your time than tampering with legacy apps.
>
> Since Windows depends on CLDR and ICU data, updates to that data means that those customers can experience pain when trying to upgrade to newer versions of Windows.? We get those support calls, they don?t tend to pester CLDR.
>
Am I pestering CLDR?

Keeping CLDR in synch is just the right way to go.

Since we?re on it: Do you have any hints about why some powerful UTC members seem to hate NNBSP in French?
I?m mainly talking about French punctuation spacing here.
>
> Which is why I suggested an ?opt-in? alt form that apps wanting ?civilized? behavior could opt-into (at least for long enough that enough badly behaved apps would be updated to warrant moving that to the default.)
>

Asmus Freytag?s proposal seems better:

                ?having information on "common fallbacks" would be useful. If formatting numbers, I may be free to pick the "best",
                but when parsing for numbers I may want to know what deviations from "best" practice I can expect.?


Because if you let your customers ?opt in? instead of urging them to update, some will never opt in, given they?re not even ready to care about cyber security.
>
> The data for locales like French tends to have been very stable for decades.? Changes to data for major locales like that are more disruptive than to newer emerging markets where the data is undergoing more churn.
>

Happy for them. Ironically the old wealthy markets are digging the trap they?ll be caught in, instead of investing in cybersecurity.

Best wishes,

Marcel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190119/96ba393a/attachment.html>

From unicode at unicode.org  Sat Jan 19 03:51:46 2019
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 19 Jan 2019 10:51:46 +0100
Subject: NNBSP
In-Reply-To: <0f2461c9-376a-1161-7c57-9b08ef5d2478@ix.netcom.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <81f7b6cd-3a50-b876-e3d6-1f0965adfa83@gmail.com>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com>
 <cc8ff86e-0168-fbba-eec3-00b4b1827470@orange.fr>
 <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com>
 <cfe67c8e-ad93-7cf3-db31-ad8d4109bfee@orange.fr>
 <0f2461c9-376a-1161-7c57-9b08ef5d2478@ix.netcom.com>
Message-ID: <2ff452c5-a7c1-f6d0-74ea-c08252470cad@orange.fr>

On 19/01/2019 09:42, Asmus Freytag via Unicode wrote:
> [?]
>
> For one, many worthwhile additions / changes to Unicode depend on getting written up in proposal form and then championed by dedicated people willing to see through the process. Usually, Unicode has so many proposals to pick from that at each point there are more than can be immediately accommodated. There's no automatic response to even issues that are "known" to many people.
>
> "Demands" don't mean a thing, formal proposals, presented and then refined based on feedback from the committee is what puts issues on the track of being resolved.
>
That is also what I suspected, that the French were not eager enough to get French supported, as opposed to the Vietnamese who lobbied long before the era of proposals and UTC meetings.

Please,/where can we find the proposals for FIGURE SPACE to become non-breakable, and for PUNCTUATION SPACE to stay or become breakable?/

(That is not a rhetoric question. The ideal answer is a URL.
Also, that is not about pre-Unicode documentation, but about the action that Unicode took in that era.)
>
> [?]
>
> Yes, I definitely used an IBM Selectric for many years with interchangeable type wheels, but I don't remember using proportional spacing, although I've seen it in the kinds of "typescript" books I mentioned. Some had that crude approximation of typesetting.
>
Thanks for reporting.
>
> When Unicode came out, that was no longer the state of the art as TeX and laser printers weren't limited that way.
>
> However, the character sets from which Unicode was assembled (or which it had to match, effectively) were designed earlier - during those times. And we inherited some things (that needed to be supported so round-trip mapping of data was possible) but that weren't as well documented in their particulars.
>
> I'm sure we'll eventually deprecate some and clean up others, like the Mongolian encoding (which also included some stuff that was encoded with an understanding that turned out less solid in retrospect than we had thought at the time).
>
> Something the UTC tries very hard to avoid, but nobody is perfect. It's best therefore to try not to ascribe non-technical motives to any action or inaction of the UTC. What outsiders see is rarely what actually went down,
>
That is because the meeting minutes would gain in being more explicit.
>
> and the real reasons for things tend to be much less interesting from an interpersonal? or intercultural perspective.
>
I don?t care about ?interesting? reasons. I?d just appreciate to know the truth.
>
> So best avoid that kind of topic altogether and never use it as basis for unfounded recriminations.
>
When you ask for knowing the foundations and that knowledge is persistently refused, you end up believing that those foundations just can?t be told.

Note, too, that I readily ceased blaming UTC, and shifted the blame elsewhere, where it actually belongs to. I?d kindly request not to be considered a hypocrite that in reality keeps blaming the UTC.
>
> A./
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190119/79c535e0/attachment.html>

From unicode at unicode.org  Sat Jan 19 05:53:01 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 19 Jan 2019 11:53:01 +0000
Subject: NNBSP
In-Reply-To: <2ff452c5-a7c1-f6d0-74ea-c08252470cad@orange.fr>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <2c8e064a-97f9-8fa1-a9ad-29e133e0b10d@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com>
 <cc8ff86e-0168-fbba-eec3-00b4b1827470@orange.fr>
 <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com>
 <cfe67c8e-ad93-7cf3-db31-ad8d4109bfee@orange.fr>
 <0f2461c9-376a-1161-7c57-9b08ef5d2478@ix.netcom.com>
 <2ff452c5-a7c1-f6d0-74ea-c08252470cad@orange.fr>
Message-ID: <c29d7700-94d4-15e7-baf5-d5711ffd60c3@gmail.com>


Marcel Schneider wrote,

 > When you ask for knowing the foundations and that knowledge is 
persistently refused,
 > you end up believing that those foundations just can?t be told.
 >
 > Note, too, that I readily ceased blaming UTC, and shifted the blame 
elsewhere, where it
 > actually belongs to.

Why not think of it as a learning curve?? Early concepts and priorities 
were made from a lower position on that curve.? We can learn from the 
past and apply those lessons to the future, but a post-mortem seldom 
benefits the cadaver.

Minutiae about decisions made long ago probably exist, but may be 
presently poorly indexed/organized and difficult to search/access. As 
the collection of encoding history becomes more sophisticated and the 
searching technology becomes more civilized, it may become easier to 
glean information from the archives.

(OT - A little humor, perhaps...
On the topic of Francophobia, it is true that some of us do not like 
dead generalissimos.? But most of us adore the French for reasons beyond 
Brigitte Bardot and bon-bons.? Cuisine, fries, dip, toast, curls, 
culture, kissing, and tarts, for instance.? Not to mention cognac and 
champagne!)


From unicode at unicode.org  Sat Jan 19 12:19:35 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Sat, 19 Jan 2019 18:19:35 +0000 (GMT)
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
Message-ID: <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>

Asmus Freytag wote:

>  This is an effort that's out of scope for Unicode to implement, or, I 
> should say, if the Consortium were to take it on, it would be a 
> separate technical standard from The Unicode Standard.

I note what you say, but what concerns me is that there seem to be an 
increasing number of matters where things are being done and neither The 
Unicode Standard nor ISO/IEC 10646 include them but they are in 
side-documents just at the Unicode website.

My understanding is that in some countries they will only use ISO/IEC 
19646 and not relate (is that the word?) to Unicode.

There are already issues over emoji ZWJ sequences that produce new 
meanings such as man ZWJ rocket producing the new meaning of astronaut 
and the 'base character plus tag characters' sequences to indicate a 
Welsh flag and a Scottish flag and if something is now done for italics 
(depending upon what it is that is done) the divergence between the two 
'groups of documents' widens even if at a precise 'definition of scope' 
meaning ISO/IEC and The Unicode Standard do not diverge.

> PS: I really hate the creeping expansion of pseudo-encoding via VS 
> characters.

Well, a variation sequence character is being used for requesting emoji 
display (is that a control code?), so it seems there is no lack of 
precedent to use one for italics. It seems that someone only has to say 
'out of scope' and then that is the veto for any consideration of a new 
idea for ISO/IEC 10646 or The Unicode Standard. There seems to be no way 
for a request to the committee to consider a widening of the scope to 
even be put before the committee if such a request is from someone 
outside the inner circle.

> The only worse thing is adding novel control functions.

For example? Would you be including things like changing the colour of 
the jacket that an emojiperson is wearing?

It seems to me that it would be useful to have some codes that are 
ordinary characters in some contexts yet are control codes in others, 
for example for drawing simple line graphic diagrams within a document, 
such that they are just ordinary characters in a text document but, say, 
draw an image when included within a PDF (Portable Text Format) 
document. Their use would be optional so that people who did not want to 
use them could just ignore them and applications that did not use them 
as control codes could just display a glyph for each character. Yet 
there could be great possibilities for them if the chance to get them 
into ISO/IEC 10646 and The Unicode Standard were possible.

William Overington
Saturday 19 January 2019


William Over

From unicode at unicode.org  Sat Jan 19 14:34:48 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 19 Jan 2019 20:34:48 +0000
Subject: Encoding italic
In-Reply-To: <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
Message-ID: <19faf5d9-bfea-b732-f940-b937c852d5d8@gmail.com>


On 2019-01-19 6:19 PM, wjgo_10009 at btinternet.com wrote:

 > It seems to me that it would be useful to have some codes that are
 > ordinary characters in some contexts yet are control codes in others, ...

Italics aren't a novel concept.? The approach for encoding new 
characters is that? conventions for them exist and that people *are* 
exchanging them, people have exchanged them in the past, or that people 
demonstrably *need* to exchange them.

Excluding emoji, any suggestion or proposal whose premise is "It seems 
to me that? it would be useful if characters supporting <this or 
that>..." is doomed to be deemed out of scope for the standard.


From unicode at unicode.org  Sat Jan 19 15:17:34 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 19 Jan 2019 13:17:34 -0800
Subject: Encoding italic
In-Reply-To: <19faf5d9-bfea-b732-f940-b937c852d5d8@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <19faf5d9-bfea-b732-f940-b937c852d5d8@gmail.com>
Message-ID: <2dfeeebe-86b5-aeed-1b6e-3588ef2d654b@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190119/c479c81a/attachment.html>

From unicode at unicode.org  Sat Jan 19 15:24:26 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 19 Jan 2019 13:24:26 -0800
Subject: NNBSP
In-Reply-To: <c29d7700-94d4-15e7-baf5-d5711ffd60c3@gmail.com>
References: <d8ca02bb-62dd-a682-0f18-edbb2f2202de@orange.fr>
 <CAGa7JC0Uch2ubJAbmqcGvfrdP5kcQgkpK53C-LSYMbpkVQtH0Q@mail.gmail.com>
 <be54309b-5c60-f0ee-258d-43cd86e966b1@orange.fr>
 <CAGa7JC1ohSHJZfSdAy=EhCjKdvShYHZau271QWY95_H3hGS2Fg@mail.gmail.com>
 <20190116205305.213b335d@JRWUBU2>
 <a9cfba71-c9b1-fecc-778b-3c2a7090d274@orange.fr>
 <CAGa7JC3am67wA1Zv7aAXMFvDVjvU6VA2YFC9itewCwzRxS872A@mail.gmail.com>
 <ee2debe3-f6c7-4eea-0b64-acf0580a9413@orange.fr>
 <71c41481-ff63-9dfa-45a2-8dfea581b936@ix.netcom.com>
 <CCF3ACFF-0A23-4BC7-A292-3035BFEB546C@gmail.com>
 <20eb457f-9f23-6018-d1d0-44ae3fb7206f@orange.fr>
 <2c2bce74-02ee-aa9e-d17e-f3dba37c7425@ix.netcom.com>
 <cc8ff86e-0168-fbba-eec3-00b4b1827470@orange.fr>
 <67910e81-215c-209f-49ce-1b7422ed218d@ix.netcom.com>
 <cfe67c8e-ad93-7cf3-db31-ad8d4109bfee@orange.fr>
 <0f2461c9-376a-1161-7c57-9b08ef5d2478@ix.netcom.com>
 <2ff452c5-a7c1-f6d0-74ea-c08252470cad@orange.fr>
 <c29d7700-94d4-15e7-baf5-d5711ffd60c3@gmail.com>
Message-ID: <ef5babd6-4c9d-b978-b5b8-969caae0bbb0@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190119/6451eafd/attachment.html>

From unicode at unicode.org  Sat Jan 19 19:18:19 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 20 Jan 2019 01:18:19 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
Message-ID: <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>


Victor Gaultney wrote,

 > If however, we say that this "does not adequately consider the harm done
 > to the text-processing model that underlies Unicode", then that exposes a
 > weakness in that model. That may be a weakness that we have to accept for
 > a variety of reasons (technical difficulty, burden on developers, UI 
impact,
 > cost, maturity).

Unicode's character encoding principles and underlying text-processing 
model remain robust.? They are the foundation of modern computer text 
processing.? The goal of ???? ???????? ??????????? needs to accommodate 
the best expectations of the end users and the fact that the consistent 
approach of the model eases the software people's burdens by ensuring 
that effective programming solutions to support one subset or range of 
characters can be applied to the other subsets of the Unicode 
repertoire.? And that those solutions can be shared with other 
developers in a standard fashion.

Assigning properties to characters gives any conformant application 
clear instructions as to what exactly is expected as the app encounters 
each character in a string.? In simpler times, the only expectation was 
that the application would splat a glyph onto a screen (and/or sheet of 
paper) and store a binary string for later retrieval.? We've moved forward.

'Unicode encodes characters, not glyphs' is a core principle. There's a 
legitimate concern whenever anyone is perceived as heading into the 
general direction of turning the character encoding into a glyph 
registry, as it suggests a possible step backwards and might lead to a 
slippery slope.? For example, if italics are encoded, why not fraktur 
and Gaelic??

The notion that any given system can't be improved is static.? ("System" 
refers to Unicode's repertoire and coverage rather than its core 
principles.? Core principles are rock solid by nature.)

? /ne plus ultra/
? "Conversely, significant differences in writing style for the same 
script may be reflected in the bibliographical classification?for 
example, Fraktur or Gaelic styles for the Latin script. Such stylistic 
distinctions are ignored in the Unicode Standard, which treats them as 
presentation styles of the Latin script."? Ken Whistler, 
http://unicode.org/reports/tr24/
? "Static" can be interpreted as either virtually catatonic or radio 
noise.? Either is applicable here.


From unicode at unicode.org  Sat Jan 19 19:30:37 2019
From: unicode at unicode.org (Kent Karlsson via Unicode)
Date: Sun, 20 Jan 2019 02:30:37 +0100
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
Message-ID: <D8698DCD.445A1%kent.karlsson14@telia.com>

(I have skipped some messages in this thread, so maybe the following
has been pointed out already. Apologies for this message if so.)

You will not like this... But...

There is already a standardised, "character level" (well, it is from
a character standard, though a more modern view would be that it is
a higher level protocol) way of specifying italics (and bold, and
underline, and more):

\u001b[3mbla bla bla\u001b[0m

Terminal emulators implement some such escape sequences. The terminaI
emulators I use support bold (1 after the [) but not italic (3). Every time
you
use the "man"-command in a Linux/Unix/similar terminal you "use" the
escape sequences for bold and underline... Other terminal based programs
often use bold as well as colour esc-sequences for emphasis as well as for
warning/error messages, and other "hints" of various kinds. For xterm,
see: https://www.xfree86.org/4.8.0/ctlseqs.html.

So I don't see these esc-sequences becoming obsolete any time soon.
But I don't foresee them being supported outside of terminal emulators
either... (Though for style esc-sequences it would certainly be possible.
And a "smart" cut-and-paste operation could auto-insert an esc-sequence
that sets the the style after the paste to the one before the paste...)

Had HTML (somehow, magically) been invented before terminals, maybe
terminals (terminal emulators) would have used some kind of "mini-HTML"
instead. But things are like they are on that point.

/Kent Karlsson

PS
The cut-and-paste I used here convert (imperfectly: bold is lost and
spurious ! inserted) to HTML
(surely going through some internal attribute-based representation, the HTML
being generated
when I press send):

man(1)             
man(1)

NAME
       man - format and display the on-line manual pages

SYNOPSIS
       man  [-acdfFhkKtwW]  [--path]  [-m system] [-p string] [-C
config_file]
       [-M pathlist] [-P pager] [-B browser] [-H htmlpager] [-S
section_list]
       [section] name ...


Den 2019-01-18 20:18, skrev "Asmus Freytag via Unicode"
<unicode at unicode.org>:

>    
> 
> I would full agree and I think Mark puts it really well in the message below
> why some of the proposals brandished here are no longer plain text but
> "not-so-plain" text.
>  
> 
> I think we are better served with a solution that provides some form of
> "light" rich text, for basic emphasis in short messages. The proper way for
> this would be some form of MarkDown standard shared across vendors, and
> perhaps implemented in a way that users don't necessarily need to type
> anything special, but that, if exported to "true" plain text, it turns into
> the source format for the "light" rich text.
>  
> 
> This is an effort that's out of scope for Unicode to implement, or, I should
> say, if the Consortium were to take it on, it would be a separate technical
> standard from The Unicode Standard.
>  
>  
> 
> A./
>  
> 
> PS: I really hate the creeping expansion of pseudo-encoding via VS characters.
> The only worse thing is adding novel control functions.
>  
>  
> 
>  
>  
> On 1/18/2019 7:51 AM, Mark E. Shoulson via Unicode wrote:
>  
>  
>> On 1/16/19 6:23 AM, Victor Gaultney via Unicode wrote:
>>  
>>>  
>>>  Encoding 'begin italic' and 'end italic' would introduce difficulties when
>>> partial strings are moved, etc. But that's no different than with current
>>> punctuation. If you select the second half of a string that includes an end
>>> quote character you end up with a mismatched pair, with the same problems of
>>> interpretation as selecting the second half of a string including an 'end
>>> italic' character. Apps have to deal with it, and do, as in code editors.
>>>  
>>>  
>>  It kinda IS different.? If you paste in half a string, you get a mismatched
>> or unmatched paren or quote or something.? A typo, but a transient one.? It
>> looks bad where it is, but everything else is unaffected.? It's no worse than
>> hitting an extra key by mistake. If you paste in a "begin italic" and miss
>> the "end italic", though, then *all* your text from that point on is
>> affected!? (Or maybe "all until a newline" or some other stopgap ending, but
>> that's just damage-control, not damage-prevention.)? Suddenly, letters and
>> symbols five words/lines/paragraphs/pages look different, the pagination is
>> all altered (by far more than merely a single extra punctuation mark, since
>> italic fonts generally are narrower than roman).? It's a disaster.
>>  
>>  No.? This kind of statefulness really is beyond what Unicode is designed to
>> cope with.? Bidi controls are (almost?) the sole exception, and even they
>> cause their share of headaches.? Encoding separate _text_ italics/bold is IMO
>> also a disastrous idea, but I'm not putting out reasons for that now.? The
>> only really feasible suggestion I've heard is using a VS in some fashion.
>> (Maybe let it affect whole words instead of individual characters?? Makes for
>> fewer noisy VSs, but introduces a whole other host of limitations (how to
>> italicize part of a word, how to italicize non-letters...) and is also just
>> damage-control, though stronger.)
>>  
>>  
>>> Apps (and font makers) can also choose how to deal with presenting strings
>>> of text that are marked as italic. They can choose to present visual symbols
>>> to indicate begin/end, such as /this/. Or they can present it using the
>>> italic variant of the font, if available.
>>>  
>>>  
>>  At which point, you have invented markdown.? Instead of making Unicode
>> declare it, just push for vendors everywhere to recognize /such notation/ as
>> italics (OK, I know, you want dedicated characters for it which can't be
>> confused for anything else.)
>>  
>>  
>>  
>>> - Those who develop plain text apps (social media in particular) don't have
>>> to build in a whole markup/markdown layer into their apps
>>>  
>>>  
>>  With the complexity of writing an social media app, a markup layer is really
>> the least of the concerns when it comes to simplifying.
>>  
>>>  
>>>  - Misuse of math chars for pseudo-italic would likely disappear
>>>  
>>>  - The text runs between markers remain intact, so they need no special
>>> treatment in searching, selecting, etc.
>>>  
>>>  - It finally, and conclusively, would end the decades of the mess in HTML
>>> that surrounds <em> and <italic>.
>>>  
>>>  
>>  Adding _another_ solution to something will *never* "conclusively end"
>> anything.? On a good day, you can hope it will swamp the others, but they'll
>> remain at least in legacy.? More likely, it will just add one more way to be
>> confused and another side to the mess.? (People have pointed out here about
>> the difficulties of distinguishing or not-distinguishing between HTML-level
>> <i> and putative plain-text italics.? And yes, that is an issue, and one that
>> already exists with styling that can change case and such.? As with anything,
>> the question is not whether there are going to be problems, but how those
>> problems weigh against potential benefits.? That's an open question.)
>>  
>>  
>>> My main point in suggesting that Unicode needs these characters is that
>>> italic has been used to indicate specific meaning - this text is somehow
>>> special - for over 400 years, and that content should be preserved in plain
>>> text. 
>>>  
>>>  
>>  There is something to this: people have been *emphasizing* text in some
>> fashion or another for ages.? There is room to call this plain text.
>>  
>>  ~mark 
>>  
>>  
>>  
>  
> 
> 
>  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190120/0ba7603f/attachment.html>

From unicode at unicode.org  Sat Jan 19 21:14:21 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 20 Jan 2019 03:14:21 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
Message-ID: <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>


(In the event that a persuasive proposal presentation prompts the 
possibility of italics encoding...)
Possible approaches include:

1 - Liberating the italics from the Members Only Math Club
...which has been an ongoing practice since they were encoded.? It 
already works, but the set is incomplete and the (mal)practice is 
frowned upon.? Many of the older "shortcomings" of the set can now be 
overcome with combining diacritics.? These italics decompose to ASCII.

2 - Character level
Variation selectors work with today's tech.? Default ignorable property 
suggests that apps that don't want to deal with them won't.? Many see VS 
as pseudo-encoding.? Stripping VS leaves ASCII behind.

3 - Open/Close punctuation treatment
Stateful.? Works on ranges.? Not currently supported in plain-text. 
Could be supported in applications which can take a text string URL and 
make it a clickable link.? Default appearance in nonsupporting apps may 
resemble existing plain-text italic kludges such as slashes.? The ASCII 
is already in the character string.

4 - Leave it alone
This approach requires no new characters and represents the default 
condition.? ASCII.

-

Number 1 would require that anything not already covered would have to 
be eventually proposed and accepted, 2 would require no new characters 
at all, and 3 would require two control characters for starters.

As "food for thought" questions, if a persuasive case is presented for 
encoding italics, and excluding 4, which approach would have the least 
impact on the rich-text world?? Which would have the least impact on 
existing plain-text technology?? Which would be least likely to conflict 
with Unicode principles/encoding model?


From unicode at unicode.org  Sat Jan 19 23:30:39 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 20 Jan 2019 05:30:39 +0000
Subject: Encoding italic
In-Reply-To: <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
Message-ID: <20190120053039.5d98f9a7@JRWUBU2>

On Fri, 18 Jan 2019 10:51:18 -0500
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 1/16/19 6:23 AM, Victor Gaultney via Unicode wrote:
> >
> > Encoding 'begin italic' and 'end italic' would introduce
> > difficulties when partial strings are moved, etc. But that's no
> > different than with current punctuation. If you select the second
> > half of a string that includes an end quote character you end up
> > with a mismatched pair, with the same problems of interpretation as
> > selecting the second half of a string including an 'end italic'
> > character. Apps have to deal with it, and do, as in code editors.
> >  
> It kinda IS different.? If you paste in half a string, you get a 
> mismatched or unmatched paren or quote or something.? A typo, but a 
> transient one.? It looks bad where it is, but everything else is 
> unaffected.? It's no worse than hitting an extra key by mistake. If
> you paste in a "begin italic" and miss the "end italic", though, then
> *all* your text from that point on is affected!? (Or maybe "all until
> a newline" or some other stopgap ending, but that's just
> damage-control, not damage-prevention.)? Suddenly, letters and
> symbols five words/lines/paragraphs/pages look different, the
> pagination is all altered (by far more than merely a single extra
> punctuation mark, since italic fonts generally are narrower than
> roman).? It's a disaster.

The problem is worst when you have a small amount of italicisable text
scattered within unitalicisable text.  Unlike the case with bidi
controls, the text usually remains intelligible with some work, and
one can generally see where the missing italic should go.  However,
damage-limitation is desirable - I would suggest cancelling effects
at the end of paragraph, as with bidi controls.  On the other hand, the
corresponding stateful ISCII character settings (for font effects and
script) are ended at the end of line, which might be a finer concept.

There are several stateful control characters for Arabic, mostly
affecting numbers.  However, as far as I can see, their effect is
limited to one word (typically a string of digits).  That seems too
limited for italics, though it would be reasonable for switching
between Antiqua and black letter.

One minor problem with the stateful encoding, which seems to be in the
original spirit of ISO 10646, is that redundant instances of the
italic controls would build up in heavily edited text.  I see that
effect with ZWSP when I don't have a display mode that shows it.  One
solution would be for tricks such as "start italic" having a visible
glyph in italic mode when the contrast between italic and non-italic
mode is displayed. I don't believe italicity should be nested.
However, such a build-up is a very minor problem.

Richard.


From unicode at unicode.org  Sat Jan 19 23:49:04 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 20 Jan 2019 05:49:04 +0000
Subject: Encoding italic
In-Reply-To: <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
Message-ID: <20190120054904.0e587666@JRWUBU2>

On Sun, 20 Jan 2019 03:14:21 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> (In the event that a persuasive proposal presentation prompts the 
> possibility of italics encoding...)

The use of italic script isn't just restricted to the Latin script,
which includes base characters not supported by the mathematical sets
for variables.  It isn't hard to find their sober use in Thai - I
found it in the first Thai magazine I flipped, where it was being used
for quotations and names of publication, both Thai and
English-language titles.

> Possible approaches include:
> 
> 1 - Liberating the italics from the Members Only Math Club

Doesn't help with Thai.

> 2 - Character level

Works with Thai.

> 3 - Open/Close punctuation treatment

Works with Thai.

> 4 - Leave it alone

No change.

Richard.

From unicode at unicode.org  Sun Jan 20 04:35:19 2019
From: unicode at unicode.org (Andrew West via Unicode)
Date: Sun, 20 Jan 2019 10:35:19 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
Message-ID: <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>

On Sun, 20 Jan 2019 at 03:16, James Kass via Unicode
<unicode at unicode.org> wrote:
>
> Possible approaches include:
>
> 3 - Open/Close punctuation treatment
> Stateful.  Works on ranges.  Not currently supported in plain-text.
> Could be supported in applications which can take a text string URL and
> make it a clickable link.  Default appearance in nonsupporting apps may
> resemble existing plain-text italic kludges such as slashes.  The ASCII
> is already in the character string.

A possibility that I don't think has been mentioned so far would be to
use the existing tag characters (E0020..E007F). These are no longer
deprecated, and as they are used in emoji flag tag sequences, software
already needs to support them, and they should just be ignored by
software that does not support them. The advantages are that no new
characters need to be encoded, and they are flexible so that tag
sequences for start/end of italic, bold, fraktur, double-struck,
script, sans-serif styles could be defined. For example start and end
of italic styling could be defined as the tag sequences <i> and </i>
(E003C E0069 E003E and E003C E002F E0069 E003E).

Andrew

From unicode at unicode.org  Sun Jan 20 16:13:08 2019
From: unicode at unicode.org (=?utf-8?B?IkouwqBTLiBDaG9pIg==?= via Unicode)
Date: Sun, 20 Jan 2019 17:13:08 -0500
Subject: Loose character-name matching
In-Reply-To: <20190119005316.7fbb0469@JRWUBU2>
References: <60797095-B703-4770-8F85-F045DDED4431@icloud.com>
 <20190119005316.7fbb0469@JRWUBU2>
Message-ID: <EBE39BC3-F130-4DF1-98E7-85BFA0DC054C@icloud.com>

Thanks for the reply. These answers make sense. 

However, I am still confused by that passage from the Standard in ? 4.8. To review, it says: ?Because Unicode character names do not contain any underscore (?_?) characters, a common strategy is to replace any hyphen-minus or space in a character name by a single ?_? when constructing a formal identifier from a character name. This strategy automatically results in a syntactically correct identifier in most formal languages. Furthermore, such identifiers are guaranteed to be unique, because of the special rules for character name matching.? 

How is this system supposed encode names with non-medial hyphens (or U+116C?s name)? Many (most?) programming languages disallow both spaces and hyphens in identifiers. For instance, among the most-popular programming languages as ranked by TIOBE, *none* of them allow hyphens in identifiers as far as I can tell, and many of them (e.g., C, Python, MATLAB) do not allow *any* other ASCII identifier characters, including the dollar sign $.

Does this mean that, it would impossible to create valid identifiers in these popular programming languages for characters with non-medial hyphens (or U+116C HANGUL JUNGSEONG O-E), contrary to the Standard?s claim in ? 4.8?

One system of making valid identifiers in those languages is to make the underscore equivalent to hyphen-minus and then use camel case on space-separated words. For instance:
hangulJunseongOE for U+116C HANGUL JUNGSEONG OE,
hangulJunseongO_E for U+116C HANGUL JUNGSEONG O-E,
tibetanLetterA for U+0F68 TIBETAN LETTER A,
tibetanLetter_A for U+0F60 TIBETAN LETTER -A.

A second albeit clunky method is to make the double underscore equivalent to a space then hyphen-minus (or vice versa) and then use single underscores on space-separated words. For instance:
Hangul_Junseong_OE for U+116C HANGUL JUNGSEONG OE,
Hangul_Junseong_O__E for U+116C HANGUL JUNGSEONG O-E,
Tibetan_Letter_A for U+0F68 TIBETAN LETTER A,
Tibetan_Letter__A for U+0F60 TIBETAN LETTER -A.

Lastly, if the programming language allows the dollar sign $ to be in identifiers, as several such as Java and JavaScript do, then the dollar sign could be used instead of the underscore:
hangulJunseongOE for U+116C HANGUL JUNGSEONG OE,
hangulJunseongO$E for U+116C HANGUL JUNGSEONG O-E,
tibetanLetterA for U+0F68 TIBETAN LETTER A,
tibetanLetter$A for U+0F60 TIBETAN LETTER -A.
?or:
Hangul_Junseong_OE for U+116C HANGUL JUNGSEONG OE,
Hangul_Junseong_O$E for U+116C HANGUL JUNGSEONG O-E,
Tibetan_Letter_A for U+0F68 TIBETAN LETTER A,
Tibetan_Letter_$A for U+0F60 TIBETAN LETTER -A.

Unfortunately, the first and second systems are not compatible with loose matching as prescribed by UAX44-LM2, so I daresay that they are not what the Standard?s claim in ? 4.8 has in mind. (The second system also assumes that there are no two characters whose names differ only by switching the positions of a space and an adjacent hyphen, which cannot be guaranteed forever without a stability policy.) But the third system is not possible in numerous popular programming languages (C, Python, etc.). How is the Standard?s system in ? 4.8 supposed encode names with non-medial hyphens (or U+116C?s name)?

?Oh, wait, I get it. This system is not supposed to necessarily be compatible with standard loose matching. I had the impression that they were supposed to be compatible, but rereading the original paragraph shows that they don?t actually mention loose matching, which is explained elsewhere in the chapter. That?s unfortunate.

Thanks again for your help.

> On Jan 18, 2019, at 7:53 PM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Thu, 17 Jan 2019 18:44:50 -0500
> "J. S. Choi" via Unicode <unicode at unicode.org> wrote:
> 
>> I?m implementing a Unicode names library. I?m confused about loose
>> character-name matching, even after rereading The Unicode Standard ?
>> 4.8, UAX #34 ? 4, #44 ? 5.9.2 ? as well as
>> [L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt
>> <http://www.unicode.org/L2/L2013/13142-name-match.txt>),
>> [L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035
>> <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035>), and
>> the [meeting in which those two items were
>> resolved](https://www.unicode.org/L2/L2014/14026.htm
>> <https://www.unicode.org/L2/L2014/14026.htm>).
>> 
>> In particular, I?m confused by the claim in The Unicode Standard ?
>> 4.8 saying, ?Because Unicode character names do not contain any
>> underscore (?_?) characters, a common strategy is to replace any
>> hyphen-minus or space in a character name by a single ?_? when
>> constructing a formal identifier from a character name. This strategy
>> automatically results in a syntactically correct identifier in most
>> formal languages. Furthermore, such identifiers are guaranteed to be
>> unique, because of the special rules for character name matching.?
> 
> Unfortunately, the loose matching rules don't distinguish '__' and
> '_'.  Note that '__' is sometimes forbidden in identifiers.
> 
>> I?m also confused by the relationship between UAX34-R3 and UAX44-LM2.
>> 
>> To make these issues concrete, let?s say that my library provides a
>> function called getCharacter that takes a name argument, tries to
>> find a loosely matching character, and then returns it (or a null
>> value if there is no currently loosely matching character). So then
>> what should the following expressions return?
>> 
> Loose matching of names may be looser than prescribed; it shall not be
> stricter.
> 
>> getCharacter(?HANGUL-JUNGSEONG-O-E?)
> U+1180 HANGUL JUNGSEONG O-E, or just possibly null.
> 
>> getCharacter(?HANGUL_JUNGSEONG_O_E?)
> U+116C HANGUL JUNGSEONG OE*
> 
>> getCharacter(?HANGUL_JUNGSEONG_O_E_?)
> U+116C
> 
>> getCharacter(?HANGUL_JUNGSEONG_O__E?)
> U+116C
> 
>> getCharacter(?HANGUL_JUNGSEONG_O_-E?)
> U+1180
> 
>> getCharacter(?HANGUL JUNGSEONGCHARACTERO E?)
> null or U+116C - up to you.  The sequence 'CHARACTER' shall not
> distinguish names, but loose matching is not required to know this fact.
> 
>> getCharacter(?HANGUL JUNGSEONG CHARACTER OE?)
> null or U+116C - up to you.
> 
>> getCharacter(?TIBETAN_LETTER_A?)
> U+0F68 TIBETAN LETTER A
> 
>> getCharacter(?TIBETAN_LETTER__A?)
> U+0F68 TIBETAN LETTER A**
> 
>> getCharacter(?TIBETAN_LETTER _A?)
> U+0F68
> 
>> getCharacter(?TIBETAN_LETTER_-A?)
> U+0F60 TIBETAN LETTER -A
> 
> *This is unfortunate, as the usual symbolic name for U+1180 would be
> HANGUL_JUNGSEONG_O_E.
> 
> **This is also unfortunate, as the usual symbolic
> name for U+0F60 would be TIBETAN_LETTER__A.
> 
> The key problem here is that the hyphen after a space is required in
> names as understood by the name property.  The hyphen is also required
> in  "HANGUL JUNGSEONG O-E".  The simple tactic is:
> 
> 1)      Canonicalise, by stripping out spaces, underscores and medial
> hyphens and lowercasing.  (It's probably better to fold the character
> U+0131 LATIN SMALL LETTER I' to 'i'.)
> 
> 2)      Look the result up.
> 
> 3)      If you get the result U+116C but the input matches
> ".*[oO]-[eE][_- ]*$", convert to U+1180.
> 
> Symbolic identifiers in programs need not match the name; one may
> choose to depend on the compiler or interpreter to catch duplicates;
> some will, some won't.  Replacing '-' by '_' to convert a name to an
> identifier looses the distinction between a hyphen and an arbitrarily
> inserted space,
> 
> Richard.
> 


From unicode at unicode.org  Sun Jan 20 16:49:23 2019
From: unicode at unicode.org (Garth Wallace via Unicode)
Date: Sun, 20 Jan 2019 14:49:23 -0800
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
Message-ID: <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>

I think the real solution is for Twitter to just implement basic styling
and make this a moot point.

On Sun, Jan 20, 2019 at 2:37 AM Andrew West via Unicode <unicode at unicode.org>
wrote:

> On Sun, 20 Jan 2019 at 03:16, James Kass via Unicode
> <unicode at unicode.org> wrote:
> >
> > Possible approaches include:
> >
> > 3 - Open/Close punctuation treatment
> > Stateful.  Works on ranges.  Not currently supported in plain-text.
> > Could be supported in applications which can take a text string URL and
> > make it a clickable link.  Default appearance in nonsupporting apps may
> > resemble existing plain-text italic kludges such as slashes.  The ASCII
> > is already in the character string.
>
> A possibility that I don't think has been mentioned so far would be to
> use the existing tag characters (E0020..E007F). These are no longer
> deprecated, and as they are used in emoji flag tag sequences, software
> already needs to support them, and they should just be ignored by
> software that does not support them. The advantages are that no new
> characters need to be encoded, and they are flexible so that tag
> sequences for start/end of italic, bold, fraktur, double-struck,
> script, sans-serif styles could be defined. For example start and end
> of italic styling could be defined as the tag sequences <i> and </i>
> (E003C E0069 E003E and E003C E002F E0069 E003E).
>
> Andrew
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190120/7b541b26/attachment.html>

From unicode at unicode.org  Sun Jan 20 16:55:34 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 20 Jan 2019 22:55:34 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
Message-ID: <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>


On 2019-01-20 10:49 PM, Garth Wallace wrote:
> I think the real solution is for Twitter to just implement basic 
> styling and make this a moot point.

At which time it would only become a moot point for Twitter users.? 
There's also Facebook and other on-line groups.? Plus scholars and 
linguists.? And interoperability.


From unicode at unicode.org  Sun Jan 20 18:52:28 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sun, 20 Jan 2019 19:52:28 -0500
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
Message-ID: <e3459de0-1e95-2345-6fe1-b7d8d85d8fb7@kli.org>

On 1/19/19 10:14 PM, James Kass via Unicode wrote:
>
> (In the event that a persuasive proposal presentation prompts the 
> possibility of italics encoding...)
> Possible approaches include:
>
> 1 - Liberating the italics from the Members Only Math Club
> ...which has been an ongoing practice since they were encoded.? It 
> already works, but the set is incomplete and the (mal)practice is 
> frowned upon.? Many of the older "shortcomings" of the set can now be 
> overcome with combining diacritics.? These italics decompose to ASCII.
Provides italics the same way that ASCII provides letters.? You can use 
them with any alphabet you want, as long as it's Latin.? (Or Greek, 
true).? Essentially requires doubling of huge chunks of the Unicode 
repetoire.
> 2 - Character level
> Variation selectors work with today's tech.? Default ignorable 
> property suggests that apps that don't want to deal with them won't.? 
> Many see VS as pseudo-encoding.? Stripping VS leaves ASCII behind.
This, or something like this, is IMO the only possibility that has any 
chance at all.
>
> As "food for thought" questions, if a persuasive case is presented for 
> encoding italics, and excluding 4, which approach would have the least 
> impact on the rich-text world?? Which would have the least impact on 
> existing plain-text technology?? Which would be least likely to 
> conflict with Unicode principles/encoding model?

#2.

~mark


From unicode at unicode.org  Sun Jan 20 20:38:09 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Sun, 20 Jan 2019 18:38:09 -0800
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
 <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
Message-ID: <CAMZ=zj6X+eaPOWsCU03hXocbPCeaTx2dcMaxkLBxN-8bYw6iWQ@mail.gmail.com>

On Sun, Jan 20, 2019 at 2:57 PM James Kass via Unicode <unicode at unicode.org>
wrote:

> At which time it would only become a moot point for Twitter users.
> There's also Facebook and other on-line groups.  Plus scholars and
> linguists.  And interoperability.
>

How do you envision this working? In practice, English is still often
limited to ASCII, because smart quotes and dashes aren't on the top-level
of the keyboard, nor are accented characters. Adding italics to Unicode
isn't going to change much if input tools don't support it, and keyboards
aren't likely to change. Twitter and Facebook aren't going to change much
if the apps and webpages don't provide a tool to mark italics.

I don't see scholars and linguists demanding this. Scholars use markup
languages that can annotate the details they need annotated, far more than
just italics. Various dialects of SGML, XML and TeX do the job, not plain
text.

You've yet to demonstrate that interoperability is an actual problem.
Modern operating systems have ways of copying rich text including italics
around.  Maybe it would have been better to have standardized rich text,
either in Unicode or in a standard layer above Unicode, back in 1991. But
that train has left; you're just going to complicate systems that currently
handle and exchange rich text including italics.

To expand on what Mark E. Shoulson said, to add new italics characters,
you're going to need to not only copy all of Latin, but also Cyrillic (and
reopen the whole Macedonian italics argument, where ?, ?, ?, ?, and ? are
all different in italics from in Russian). But also, Chinese is sometimes
put in italics (cf.
http://multilingualtypesetting.co.uk/blog/chinese-italics-oblique-fonts/ )
even if that horrifies many people. That page argues for, among other
solutions, using what's effectively bold instead of italics. So we're
talking about reencoding all of Chinese at least once (for emphasis) or
twice (for italics and bold). That's a clear no-go.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190120/0a8b7b61/attachment.html>

From unicode at unicode.org  Sun Jan 20 23:42:31 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sun, 20 Jan 2019 21:42:31 -0800
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
Message-ID: <f0e82450-c6e2-d3d1-7a18-c2564dc4734c@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190120/eb245552/attachment.html>

From unicode at unicode.org  Sun Jan 20 23:49:13 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sun, 20 Jan 2019 21:49:13 -0800
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
 <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
Message-ID: <67e80c6e-c1e9-66c2-ef44-290a05e1bffd@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190120/a766dc2c/attachment.html>

From unicode at unicode.org  Mon Jan 21 01:51:19 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 21 Jan 2019 07:51:19 +0000
Subject: Encoding italic
In-Reply-To: <CAMZ=zj6X+eaPOWsCU03hXocbPCeaTx2dcMaxkLBxN-8bYw6iWQ@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
 <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
 <CAMZ=zj6X+eaPOWsCU03hXocbPCeaTx2dcMaxkLBxN-8bYw6iWQ@mail.gmail.com>
Message-ID: <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com>


Responding to David Starner,

It?s true that most users can?t be troubled to take the extra time 
needed to insert any kind of special characters which aren?t covered by 
the keyboard.? Even the enthusiasts among us seldom take the trouble to 
include ?proper? quotes and apostrophes in e-mails ? even for posting to 
specialized lists such as this one where other members might notice and 
appreciate the extra effort involved.? Even though /we/ know how to do 
it and have software installed to help us do it.

It?s also true that standard U.S. keyboards and drivers aren?t very 
helpful with diacritics.? Yet when we reply to list colleagues with 
surnames such as ?D?rst? or ?Bie??, we usually manage to get it right.? 
Sure, the ?reply? feature puts the surname into the response for us and 
the e-mail software adds the properly spelled names into our address 
books automatically.? But when we cite those colleagues in a post 
replying to some other list member, we typically take the time and 
trouble to write their names correctly.? Not only because we /can/, but 
because we /should/.

 > How do you envision this working?

Splendidly!? (smile)? Social platforms, plain-text editors, and other 
applications do enhance their interfaces based on user demand from time 
to time.? User demand, at least on Twitter, seems established.? As 
pointed out previously in this discussion, that demand doesn?t seem to 
result in much ?Chicago style? text (although I have personally observed 
some) and may only be a passing fad /for Twitter users/.? When corporate 
interests aren't interested, third-party developers develop tools.

 > You've yet to demonstrate that interoperability is an actual problem.

Copy/pasting from a web page into a plain-text editor removes any 
italics content, which is currently expected behavior.? Opinions differ 
as to whether that represents mere format removal or a loss of meaning.? 
Those who consider it as a loss of meaning would perceive a problem with 
interoperability.

Consider superscript/subscript digits as a similar styling issue. The 
Wikipedia page for Romanization of Chinese includes information about 
the Wade-Giles system?s tone marks, which are superscripted digits.

https://en.wikipedia.org/wiki/Romanization_of_Chinese

Copy/pasting an example from the page into plain-text results in ?ma1, 
ma2, ma3, ma4?, although the web page displays the letters as italic and 
the digits as (italic) superscripts.? IMO, that?s simply wrong with 
respect to the superscript digits and suboptimal with respect to the 
italic letters.

 > To expand on what Mark E. Shoulson said, to add new italics characters,
 > you're going to need to not only copy all of Latin, but also Cyrillic ...

I quite agree that expanding atomic italic encoding is off the table at 
this point.? (And that italicized CJK ideographs are daft.)


From unicode at unicode.org  Mon Jan 21 02:29:24 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Mon, 21 Jan 2019 08:29:24 +0000 (GMT)
Subject: Encoding italic
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
 <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
 <CAMZ=zj6X+eaPOWsCU03hXocbPCeaTx2dcMaxkLBxN-8bYw6iWQ@mail.gmail.com>
 <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com>
Message-ID: <slrnq4b0n4.8ol.jcb@home.stevens-bradfield.com>

On 2019-01-21, James Kass via Unicode <unicode at unicode.org> wrote:
> Consider superscript/subscript digits as a similar styling issue. The 
> Wikipedia page for Romanization of Chinese includes information about 
> the Wade-Giles system?s tone marks, which are superscripted digits.
>
> https://en.wikipedia.org/wiki/Romanization_of_Chinese
>
> Copy/pasting an example from the page into plain-text results in ?ma1, 
> ma2, ma3, ma4?, although the web page displays the letters as italic and 
> the digits as (italic) superscripts.? IMO, that?s simply wrong with 
> respect to the superscript digits and suboptimal with respect to the 
> italic letters.

Wade-Giles (which should be written with an en-dash, not a hyphen, if
we're going to be fussy - as indeed Wikipedia is) is obsolete, but one
could say the same about pinyin. However, printed pinyin with tones
almost invariably uses the combining diacritics; in email where most people
can't be bothered to write diacritics, tone numbers are written just
as you have written above, with a following ascii digit. (With the
proviso that Chinese speakers don't usually write tones at all when
they write in pinyin.) They're often written like that even in web
pages, where superscripts would be easy - see Victor Mair's frequent
Language Log posts about Chinese writing and printing.
This seems significantly less wrong to me that writing H2SO4 for
H<sub>2</sub>SO<sub>4</sub> which is also common in plain text...


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Mon Jan 21 02:29:42 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Mon, 21 Jan 2019 00:29:42 -0800
Subject: Encoding italic
In-Reply-To: <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
 <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
 <CAMZ=zj6X+eaPOWsCU03hXocbPCeaTx2dcMaxkLBxN-8bYw6iWQ@mail.gmail.com>
 <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com>
Message-ID: <CAMZ=zj5s8NxA8wgyxM-G3DY4jYfuLVdrzOee5YghxtL89hOdkw@mail.gmail.com>

On Sun, Jan 20, 2019 at 11:53 PM James Kass via Unicode
<unicode at unicode.org> wrote:
>  Even though /we/ know how to do
> it and have software installed to help us do it.

You're emailing from Gmail, which has support for italics in email.
The world has, in general, solved this problem.

>  > How do you envision this working?
>
> Splendidly!  (smile)  Social platforms, plain-text editors, and other
> applications do enhance their interfaces based on user demand from time
> to time.  User demand, at least on Twitter, seems established.

Then it would take six months, tops, for Twitter to produce and
release a rich-text interface for Twitter. Far less time than waiting
for Unicode to get around to it.

> When corporate
> interests aren't interested, third-party developers develop tools.

Where are these tools? As I said, third-party developers could develop
tools to convert a _underscore_ or /slash/ style italics to real
italics and back without waiting on Twitter or Unicode.

> Copy/pasting from a web page into a plain-text editor removes any
> italics content, which is currently expected behavior.  Opinions differ
> as to whether that represents mere format removal or a loss of meaning.
> Those who consider it as a loss of meaning would perceive a problem with
> interoperability.

Copy/pasting from a web page into a plain-text editor removes any
pictures and destuctures tables, which definitely loses meaning.

It also removes strike-out markup, which can have an even more
dramatic effect on meaning than removing italics. As you pointed out
below, it removes superscripts and subscripts; unless you wish to
press for automatic conversion of those to Unicode, that's going to
continue happening. It drops bold and font changes, and any number of
other things that can carry meaning.

> Copy/pasting an example from the page into plain-text results in ?ma1,
> ma2, ma3, ma4?, although the web page displays the letters as italic and
> the digits as (italic) superscripts.  IMO, that?s simply wrong with
> respect to the superscript digits and suboptimal with respect to the
> italic letters.

The superscripts show a problem with multiple encoding; even if you
think they should be Unicode superscripts, and they look like Unicode
superscripts, they might be HTML superscripts. Same thing would happen
with italics if they were encoded in Unicode.

-- 
Kie ekzistas vivo, ekzistas espero.


From unicode at unicode.org  Mon Jan 21 04:29:11 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 21 Jan 2019 10:29:11 +0000
Subject: Encoding italic
In-Reply-To: <CAMZ=zj5s8NxA8wgyxM-G3DY4jYfuLVdrzOee5YghxtL89hOdkw@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
 <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
 <CAMZ=zj6X+eaPOWsCU03hXocbPCeaTx2dcMaxkLBxN-8bYw6iWQ@mail.gmail.com>
 <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com>
 <CAMZ=zj5s8NxA8wgyxM-G3DY4jYfuLVdrzOee5YghxtL89hOdkw@mail.gmail.com>
Message-ID: <4922cf75-c0f2-7458-81ad-1e421a155c90@gmail.com>


David Starner wrote,

 > You're emailing from Gmail, which has support for italics in email.

But I compose e-mails in BabelPad, which has support for far more than 
italics in HTML mail.? And I'm using Mozilla Thunderbird to send and 
receive text e-mail via the Gmail account.

And if I wanted to /display/ italics in a web page, I would create the 
source file in a plain-text editor.? (HTML mark-up is fairly easy to 
type with the ASCII keyboard.)

If I compose a text file in BabelPad, it can be opened in many rich-text 
applications and the information survives intact.? Unless I am foolish 
enough to edit the file in the rich-text application and file-save it.? 
Because that mungs the plain-text file, and it can no longer be 
retrieved by the plain-text editor which created it.

 >> ...third-party...
 >
 > Where are these tools?

BabelPad is an outstanding example.? Earlier in this discussion a web 
search found at least a handful of third-party tools devoted to 
liberating the math-alphas for Twitter users.

 > The superscripts show a problem with multiple encoding; even if you
 > think they should be Unicode superscripts, and they look like Unicode
 > superscripts, they might be HTML superscripts. Same thing would happen
 > with italics if they were encoded in Unicode.

Hmmm.? Rich-text styled italics might be copied into other rich-text 
applications, but they cannot be copied into plain-text apps.? If 
Unicode-enabled italics existed, plain-text italics could be copy/pasted 
into either rich-text or plain-text applications and survive intact.? So 
Unicode-enabled italics would be interoperable. Anyone concerned about 
interoperability would be well advised to go with plain-text.? I am, so 
I do.? When I can.

Kie eksistas fumo, tie eksistas fajro.


From unicode at unicode.org  Mon Jan 21 14:31:46 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Mon, 21 Jan 2019 13:31:46 -0700
Subject: Encoding italic
Message-ID: <20190121133146.665a7a7059d7ee80bb4d670165c8327d.bc773d11ee.wbe@email03.godaddy.com>

James Kass wrote:
 
> Even the enthusiasts among us seldom take the trouble to include
> ?proper? quotes and apostrophes in e-mails ? even for posting to
> specialized lists such as this one where other members might notice
> and appreciate the extra effort involved.
 
Well, definitely not to this list, since the digest will clobber such
characters (quod vide). 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Mon Jan 21 14:46:56 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Mon, 21 Jan 2019 13:46:56 -0700
Subject: Encoding italic (was: A last missing link)
Message-ID: <20190121134656.665a7a7059d7ee80bb4d670165c8327d.7d41c065d6.wbe@email03.godaddy.com>

Kent Karlsson wrote:

> There is already a standardised, "character level" (well, it is from
> a character standard, though a more modern view would be that it is
> a higher level protocol) way of specifying italics (and bold, and
> underline, and more):
>
> \u001b[3mbla bla bla\u001b[0m
>
> Terminal emulators implement some such escape sequences.

And indeed, the forthcoming Unicode Technical Note we are going to be
writing to supplement the introduction of the characters in L2/19-025,
whether next year or later, will recommend ISO 6429 sequences like this
to implement features like background and foreground colors, inverse
video, and more, which are not available as plain-text characters. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Tue Jan 22 00:40:52 2019
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Tue, 22 Jan 2019 07:40:52 +0100
Subject: Encoding italic
In-Reply-To: <CAMZ=zj5s8NxA8wgyxM-G3DY4jYfuLVdrzOee5YghxtL89hOdkw@mail.gmail.com>
References: <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
 <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
 <CAMZ=zj6X+eaPOWsCU03hXocbPCeaTx2dcMaxkLBxN-8bYw6iWQ@mail.gmail.com>
 <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com>
 <CAMZ=zj5s8NxA8wgyxM-G3DY4jYfuLVdrzOee5YghxtL89hOdkw@mail.gmail.com>
Message-ID: <20190122064052.dh2ofinavzflrx2f@angband.pl>

On Mon, Jan 21, 2019 at 12:29:42AM -0800, David Starner via Unicode wrote:
> On Sun, Jan 20, 2019 at 11:53 PM James Kass via Unicode
> <unicode at unicode.org> wrote:
> >  Even though /we/ know how to do
> > it and have software installed to help us do it.
> 
> You're emailing from Gmail, which has support for italics in email.

... and how exactly can they send italics in an e-mail?  All they can do is
to bundle a web page as an attachment, which some clients display instead of
the main text.

The e-mail's body text supports anything Unicode does, including ???????????? and
even ?????? ????????????, but, remarkably, not italic umlauted characters, thai nor
han.

> > Splendidly!  (smile)  Social platforms, plain-text editors, and other
> > applications do enhance their interfaces based on user demand from time
> > to time.  User demand, at least on Twitter, seems established.
> 
> Then it would take six months, tops, for Twitter to produce and
> release a rich-text interface for Twitter. Far less time than waiting
> for Unicode to get around to it.

Similar to many mail clients, Twitter does have a rich-text interface. 
It will present that rich-text as a link -- it will even has specific
support to reduce the full URL to conserve the character count.

But the primary interface is plain text, which unlike anything "rich" is
interoperable with pretty much anything.

> > Copy/pasting from a web page into a plain-text editor removes any
> > italics content, which is currently expected behavior.  Opinions differ
> > as to whether that represents mere format removal or a loss of meaning.
> > Those who consider it as a loss of meaning would perceive a problem with
> > interoperability.
> 
> Copy/pasting from a web page into a plain-text editor removes any
> pictures and destuctures tables, which definitely loses meaning.
> 
> It also removes strike-out markup, which can have an even more
> dramatic effect on meaning than removing italics. As you pointed out
> below, it removes superscripts and subscripts; unless you wish to
> press for automatic conversion of those to Unicode, that's going to
> continue happening. It drops bold and font changes, and any number of
> other things that can carry meaning.

Ie, any non-standard additions.  There's a common base that's supposed to be
interoperable, developed by a certain consortium -- and that base is pretty
much guaranteed to work everywhere.  Even if a specific display engine can't
display some fancier elements, at least the underlying transport will
transfer the text unmolested.  There still are some issues here and there
(like eg. people rejecting UCS2/UTF-16 on Windows which Microsoft insisted
on, thus UTF-8 as system encoding is a new thing there and AFAIK even not
the default yet AFAIK) -- but pretty much we're there.  Last holdouts of
ancient encodings are dying fast.

There's a need to agree on a boundary between "this is what all means of
interchange are supposed to support" and "fancy client-specific markup",
and Unicode served at defining the former admirably.


Meow!
-- 
???????
??????? Remember, the S in "IoT" stands for Security, while P stands
??????? for Privacy.
???????

From unicode at unicode.org  Tue Jan 22 11:52:36 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Tue, 22 Jan 2019 17:52:36 +0000 (GMT)
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <65dfde71.ee61.16876ac0481.Webtop.70@btinternet.com>
References: <20190121134656.665a7a7059d7ee80bb4d670165c8327d.7d41c065d6.wbe@email03.godaddy.com>
 <2072693815.420691.1548178041006.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <65dfde71.ee61.16876ac0481.Webtop.70@btinternet.com>
Message-ID: <4b230aec.ee71.16876b147ff.Webtop.70@btinternet.com>

Doug Ewell wrote:

> And indeed, the forthcoming Unicode Technical Note we are going to be
writing to supplement the introduction of the characters in L2/19-025,
whether next year or later, will recommend ISO 6429 sequences like this
to implement features like background and foreground colors, inverse
video, and more, which are not available as plain-text characters.

Back in the late 1980s I had the opportunity for some time, from time to 
time, to use a colour terminal that was attached to a mainframe computer 
as if it were just another basic terminal attached to a mainframe. So it 
could be used just as a basic terminal attached to a mainframe, and it 
was often used in that manner.

Yet it also responded to Escape sequences which enabled it to do colour 
graphics, with, as best I remember now, commands to choose a colour and 
draw lines and so on.

I note with interest Doug's suggestion to use Escape routines.

However, these days systems tend to be more complicated at the 
underlying platform level and there is often communication between 
systems and so on and I wonder whether using Escape codes as such might 
be prone to strange problems in some circumstances before getting to the 
emulator software. With various platforms in common use I am wondering 
whether there might be problems in some cases. Maybe there is no issue 
and everything would be fine, yet I opine that that possibility of 
problems need to be looked at.

I wonder if a new character, say U+FFF6, in the Specials section, could 
be defined that could be regarded as just an ordinary printing character 
in many circumstances yet as having exactly the same meaning as the 
Escape character in some circumstances, such as in an emulator.

If that were done then the desired result could be achieved in a 
carefully structured manner rather than risk clashes over effectively 
sometimes trying to use the Escape character in two ways at the same 
time, perhaps with one of the ways being deep in the operating system 
and one in the terminal emulator with the way deep in the operating 
system usually winning.

William Overington

Tuesday 22 January 2019


From unicode at unicode.org  Tue Jan 22 17:26:09 2019
From: unicode at unicode.org (Kent Karlsson via Unicode)
Date: Wed, 23 Jan 2019 00:26:09 +0100
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <20190121134656.665a7a7059d7ee80bb4d670165c8327d.7d41c065d6.wbe@email03.godaddy.com>
Message-ID: <D86D6521.4464E%kent.karlsson14@telia.com>

Ok. One thing to note is that escape sequences (including control sequences,
for those who care to distinguish those) probably should be "default
ignorable" for display. Requiring, or even recommending, them to be default
ignorable for other processing (like sorting, searching, and other things)
may be a tall order. So, for display, (maximal) substrings that match:

\u001B[\u0020-\002F]*[\u0030-\007E]|
(\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]

should be default ignorable (i.e. invisible, but a "show invisibles" mode
would show them; not interpreted ones should be kept, even if interpreted
ones need not, just (re)generated on save). That is as far as Unicode
should go.

Some may be interpreted, this thread focuses on italic, but also bold
and underlined. There is a whole bunch of "style" control sequences
(those that have "m" at the end of the sequence) specified, and terminal
emulators implement several of them, but not all.

As for editing, if "style" control sequences ? la ISO 6429 were to be
supported in text editors, I would NOT expect users to type in those
escape/control sequences in any way, but use "ctrl/command-i" (etc.) or
menu commands as editors do now, and the representation as esc-sequences
be kept under wraps (and maybe only present in files, not in the internal
representation during editing), and not seen unless one starts to analyse
the byte sequences in files. So, even if you don't like this esc-sequence
business:
1) It would not be seen by most users, mostly by programmers (the same
goes for other ways of representing this, be it HTML, .doc, or whatever.
2) It is already standardised, and one can make (a slightly inaccurate)
argument that it is "plain text".

What one would need to do is:
1) Prioritise which "style" control sequences should be interpreted
(rather than be ignored).
2) Lobby to "plain" text editor makers to support those styles,
representing them (in files) as standard control sequences.

A selection of already standardised style codes (i.e., for control
sequences that end in ?m?):
 
0       default rendition (implementation-defined)

1       bold
(2      lean)
22      normal intensity (neither bold nor lean)

3       italicized
23      not italicized (i.e. upright)

4       singly underlined
(21     doubly underlined)
24      not underlined (neither singly nor doubly)

(9      crossed-out (strikethrough))
(29     not crossed out)

If you really want to go for colour as well (RGB values in 0?255)
(colour is popular in terminal emulators...):
 
(30-37  foreground: black, red, green, yellow, blue, magenta, cyan, white)
38      foreground colour as RGB. Next arguments 2;r;g;b
39      default foreground colour (implementation-defined)

(40-47  background: black, red, green, yellow, blue, magenta, cyan, white)
48      background colour as RGB. Next arguments 2;r;g;b
49      default background colour (implementation-defined)

There are some more (including some that assume a small font palette, for
changing font). But far enough for now. Maybe too far already. But do not
allow interpreting multiple style attribute codes in one control sequence;
quite unnecessary.


/Kent K


Den 2019-01-21 21:46, skrev "Doug Ewell via Unicode" <unicode at unicode.org>:

> Kent Karlsson wrote:
> 
>> There is already a standardised, "character level" (well, it is from
>> a character standard, though a more modern view would be that it is
>> a higher level protocol) way of specifying italics (and bold, and
>> underline, and more):
>> 
>> \u001b[3mbla bla bla\u001b[0m
>> 
>> Terminal emulators implement some such escape sequences.
> 
> And indeed, the forthcoming Unicode Technical Note we are going to be
> writing to supplement the introduction of the characters in L2/19-025,
> whether next year or later, will recommend ISO 6429 sequences like this
> to implement features like background and foreground colors, inverse
> video, and more, which are not available as plain-text characters.
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
> 


From unicode at unicode.org  Tue Jan 22 18:16:40 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 23 Jan 2019 00:16:40 +0000
Subject: Encoding italic
In-Reply-To: <CAMZ=zj5s8NxA8wgyxM-G3DY4jYfuLVdrzOee5YghxtL89hOdkw@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
 <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
 <CAMZ=zj6X+eaPOWsCU03hXocbPCeaTx2dcMaxkLBxN-8bYw6iWQ@mail.gmail.com>
 <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com>
 <CAMZ=zj5s8NxA8wgyxM-G3DY4jYfuLVdrzOee5YghxtL89hOdkw@mail.gmail.com>
Message-ID: <20190123001640.39964074@JRWUBU2>

On Mon, 21 Jan 2019 00:29:42 -0800
David Starner via Unicode <unicode at unicode.org> wrote:

> The superscripts show a problem with multiple encoding; even if you
> think they should be Unicode superscripts, and they look like Unicode
> superscripts, they might be HTML superscripts. Same thing would happen
> with italics if they were encoded in Unicode.

But if one strips the mark-up out, and searching is then based on
the collation elements of the text, then this is not a problem.
Mathematical and ASCII capitals differ only at the identity level.

Searching on the basis of codepoint sequences would come unstuck with
scriptio continua scripts - WJ and ZWSP can be optionally inserted to
improve line-breaking, and even to overcome spell-checkers.

Richard.


From unicode at unicode.org  Tue Jan 22 21:43:29 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 23 Jan 2019 03:43:29 +0000
Subject: Encoding italic
In-Reply-To: <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
Message-ID: <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>


Nobody has really addressed Andrew West's suggestion about using the tag 
characters.

It seems conformant, unobtrusive, requiring no official sanction, and 
could be supported by third-partiers in the absence of corporate 
interest if deemed desirable.

One argument against it might be:? Whoa, that's just HTML.? Why not just 
use HTML?? SMH

One argument for it might be:? Whoa, that's just HTML!? Most everybody 
already knows about HTML, so a simple subset of HTML would be recognizable.

After revisiting the concept, it does seem elegant and workable. It 
would provide support for elements of writing in plain-text for anyone 
desiring it, enabling essential (or frivolous) preservation of 
editorial/authorial intentions in plain-text.

Am I missing something?? (Please be kind if replying.)

On 2019-01-20 10:35 AM, Andrew West wrote:

> A possibility that I don't think has been mentioned so far would be to
> use the existing tag characters (E0020..E007F). These are no longer
> deprecated, and as they are used in emoji flag tag sequences, software
> already needs to support them, and they should just be ignored by
> software that does not support them. The advantages are that no new
> characters need to be encoded, and they are flexible so that tag
> sequences for start/end of italic, bold, fraktur, double-struck,
> script, sans-serif styles could be defined. For example start and end
> of italic styling could be defined as the tag sequences <i> and </i>
> (E003C E0069 E003E and E003C E002F E0069 E003E).
>
> Andrew

From unicode at unicode.org  Tue Jan 22 20:24:59 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Tue, 22 Jan 2019 18:24:59 -0800
Subject: Encoding italic
In-Reply-To: <20190123001640.39964074@JRWUBU2>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
 <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
 <CAMZ=zj6X+eaPOWsCU03hXocbPCeaTx2dcMaxkLBxN-8bYw6iWQ@mail.gmail.com>
 <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com>
 <CAMZ=zj5s8NxA8wgyxM-G3DY4jYfuLVdrzOee5YghxtL89hOdkw@mail.gmail.com>
 <20190123001640.39964074@JRWUBU2>
Message-ID: <CAMZ=zj5y-3361WVqB166m9LHQaLGs-E-yXd-8_T=jC2kRDbpyA@mail.gmail.com>

On Tue, Jan 22, 2019 at 4:18 PM Richard Wordingham via Unicode
<unicode at unicode.org> wrote:
> On Mon, 21 Jan 2019 00:29:42 -0800
> David Starner via Unicode <unicode at unicode.org> wrote:
>
> > The superscripts show a problem with multiple encoding; even if you
> > think they should be Unicode superscripts, and they look like Unicode
> > superscripts, they might be HTML superscripts. Same thing would happen
> > with italics if they were encoded in Unicode.
>
> But if one strips the mark-up out, and searching is then based on
> the collation elements of the text, then this is not a problem.
> Mathematical and ASCII capitals differ only at the identity level.

Searching is not the only problem. Copying the data will reveal the
same problem.

Not only that, there was a previous argument that searching with
Unicode italics would let you find titles of books and such separately
from other usage of the phrase. That's not going to work if they're
based on the collation elements and ignore the italics. Which also
brings up the question of, if this is so important, why can't we
search for italicized data in web pages right now? For anyone
interacting with a web-browser that folds searching, this will change
nothing, until if and when italics-sensitive searching is made
available by the web-browser, which is not depending on Unicode
supporting italics.

There are programs that extract titles from text files; I suspect the
programmers are most happy working with text formats that mark up
titles as titles, not italics. In systems that just mark up italics,
translating whatever form of italics marking is used is much easier
than separating italicized titles from other forms of italics.

-- 
Kie ekzistas vivo, ekzistas espero.

From unicode at unicode.org  Wed Jan 23 20:07:48 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Wed, 23 Jan 2019 21:07:48 -0500
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
Message-ID: <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>

On 1/19/19 1:19 PM, wjgo_10009 at btinternet.com via Unicode wrote:
>
> Well, a variation sequence character is being used for requesting 
> emoji display (is that a control code?), so it seems there is no lack 
> of precedent to use one for italics. It seems that someone only has to 
> say 'out of scope' and then that is the veto for any consideration of 
> a new idea for ISO/IEC 10646 or The Unicode Standard. There seems to 
> be no way for a request to the committee to consider a widening of the 
> scope to even be put before the committee if such a request is from 
> someone outside the inner circle.

You make it sound like there's been invented some magical incantation 
that *anyone* can use to quash all discussion on a particular (your) 
topic.? It doesn't just take someone saying "out of scope."? It also has 
to *be* out of scope!? If someone chants the incantation, but I can 
persuasively argue that no, it IS in scope, then the spell fails.? 
Requesting the scope of Unicode be widened is not like other discussions 
being had here, so it makes sense that it should be treated differently, 
if treated at all. There were discussions and agreements made as to the 
scope of Unicode, long ago.? And just like you can't petition to change 
a character name, no matter how wrong it is, asking the Unicode 
consortium to redefine itself on your say-so is not going to be taken 
seriously either.? Out of scope means just that: it isn't something 
we're discussing.? Discussing how to change the scope so that 
whatever-it-is IS in scope is a very large undertaking, and would need a 
tremendous groundswell of support from all the major stakeholders in 
Unicode, so you should probably start there.? Get Microsoft and Google 
and various national bodies on your side, not just to say "um, ok, 
maybe," but to actively argue with you that the scope needs to be 
changed.? Or that there needs to be, as Asmus says, another, 
supplemental standard.? Raise popular support, write petitions, get 
signatures, all that fun stuff. "But so many of the people I would want 
to talk to about this are right here on this list!" you say?? Be that as 
it may, it doesn't mean the list has to grant you a platform.? Change 
the world on your own dime.

>
> It seems to me that it would be useful to have some codes that ....

See, once you start a proposal like that, you're already looking down 
the wrong end of the Unicode scope.? This is exactly what Asmus (I 
think) said in a quote I can't seem to find, repeating it for the n+1st 
time: Unicode isn't here to encode cool new ideas that would be cool and 
new.? It's here for writing what people already do.? You want a standard 
that does something else?? That's another thing.? It's as appropriate to 
demand that Unicode support these things as it would be to go to OSHA or 
the Bureau of Weights and Measures or the Acad?mie Fran?aise and tell 
them you want some new letters...

~mark


From unicode at unicode.org  Wed Jan 23 20:08:31 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Wed, 23 Jan 2019 21:08:31 -0500
Subject: Encoding italic
In-Reply-To: <19faf5d9-bfea-b732-f940-b937c852d5d8@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <19faf5d9-bfea-b732-f940-b937c852d5d8@gmail.com>
Message-ID: <f2281e06-75ae-9a55-53c9-12ad8ea293b3@kli.org>

On 1/19/19 3:34 PM, James Kass via Unicode wrote:
>
> On 2019-01-19 6:19 PM, wjgo_10009 at btinternet.com wrote:
>
> > It seems to me that it would be useful to have some codes that are
> > ordinary characters in some contexts yet are control codes in 
> others, ...
>
> Italics aren't a novel concept.? The approach for encoding new 
> characters is that? conventions for them exist and that people *are* 
> exchanging them, people have exchanged them in the past, or that 
> people demonstrably *need* to exchange them.
>
> Excluding emoji, any suggestion or proposal whose premise is "It seems 
> to me that? it would be useful if characters supporting <this or 
> that>..." is doomed to be deemed out of scope for the standard.

This was the quote I had been looking for, sorry James and Asmus.? It 
isn't the first time it's been pointed out here.

~mark


From unicode at unicode.org  Wed Jan 23 20:21:39 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Wed, 23 Jan 2019 21:21:39 -0500
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <D86D6521.4464E%kent.karlsson14@telia.com>
References: <D86D6521.4464E%kent.karlsson14@telia.com>
Message-ID: <42afafc1-a0ab-e1f7-5954-371f174603d1@kli.org>

On 1/22/19 6:26 PM, Kent Karlsson via Unicode wrote:
> Ok. One thing to note is that escape sequences (including control sequences,
> for those who care to distinguish those) probably should be "default
> ignorable" for display. Requiring, or even recommending, them to be default
> ignorable for other processing (like sorting, searching, and other things)
> may be a tall order. So, for display, (maximal) substrings that match:
>
> \u001B[\u0020-\002F]*[\u0030-\007E]|
> (\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]
>
> should be default ignorable (i.e. invisible, but a "show invisibles" mode
> would show them; not interpreted ones should be kept, even if interpreted
> ones need not, just (re)generated on save). That is as far as Unicode
> should go.

So it isn't just "these characters should be default ignorable", but 
"this regular expression is default ignorable."? This gets back to 
"things that span more than a character" again, only this time the 
"span" isn't the text being styled, it's the annotation to style it.? 
The "bash" shell has special escape-sequences (\[ and \]) to use in 
defining its prompt that tell the system that the text enclosed by them 
is not rendered and should not be counted when it comes to doing 
cursor-control and line-editing stuff (so you put them around, yep, the 
escape sequences for coloring or boldfacing or whatever that you want in 
your prompt). That would seem to be at least simpler than a big ol' 
regexp, but really not that much of an improvement.? It also goes to 
show how things like this require all kinds of special handling, 
even/especially in a "simple" shell prompt (which could make a strong 
case for being "plain text", though, yes, terminal escape codes are a 
thing.)

~mark

From unicode at unicode.org  Wed Jan 23 20:32:55 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Wed, 23 Jan 2019 21:32:55 -0500
Subject: Encoding italic
In-Reply-To: <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
Message-ID: <f5cd8c3f-bb3d-5e09-c544-3ae99b8b09da@kli.org>

There is something deliciously simple, elegant... and kinda... 
rebellious? about doing this.? And it wouldn't even be in purview of 
Unicode.? "Yep, my HTML-renderer treats characters E0020..E007F just 
exactly the same 0020..007F, 'cept that it won't render 'em."? And you 
can send HTML text that looks for all the world like plain text to any 
normal Unicode-conformant viewer.? Now, the security issues of being 
able to write "invisible" JavaScript, or rather, Yet Another way you 
need to look at and reveal possible code, are a headache for someone 
else.? Viewed like this, you might do better taking this suggestion to 
W3C and having them amend the HTML/XML specs so that E0020..E007F are 
non-rendering synonyms for 0020..007F.? It wouldn't be a Unicode thing 
anymore, just changing the definition of HTML.? (I'm not saying it would 
be a GOOD idea, mind you.)

~mark

On 1/22/19 10:43 PM, James Kass via Unicode wrote:
>
> Nobody has really addressed Andrew West's suggestion about using the 
> tag characters.
>
> It seems conformant, unobtrusive, requiring no official sanction, and 
> could be supported by third-partiers in the absence of corporate 
> interest if deemed desirable.
>
> One argument against it might be:? Whoa, that's just HTML.? Why not 
> just use HTML?? SMH
>
> One argument for it might be:? Whoa, that's just HTML!? Most everybody 
> already knows about HTML, so a simple subset of HTML would be 
> recognizable.
>
> After revisiting the concept, it does seem elegant and workable. It 
> would provide support for elements of writing in plain-text for anyone 
> desiring it, enabling essential (or frivolous) preservation of 
> editorial/authorial intentions in plain-text.
>
> Am I missing something?? (Please be kind if replying.)
>
> On 2019-01-20 10:35 AM, Andrew West wrote:
>
>> A possibility that I don't think has been mentioned so far would be to
>> use the existing tag characters (E0020..E007F). These are no longer
>> deprecated, and as they are used in emoji flag tag sequences, software
>> already needs to support them, and they should just be ignored by
>> software that does not support them. The advantages are that no new
>> characters need to be encoded, and they are flexible so that tag
>> sequences for start/end of italic, bold, fraktur, double-struck,
>> script, sans-serif styles could be defined. For example start and end
>> of italic styling could be defined as the tag sequences <i> and </i>
>> (E003C E0069 E003E and E003C E002F E0069 E003E).
>>
>> Andrew


From unicode at unicode.org  Thu Jan 24 05:50:49 2019
From: unicode at unicode.org (Andrew West via Unicode)
Date: Thu, 24 Jan 2019 11:50:49 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
Message-ID: <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>

On Thu, 24 Jan 2019 at 02:10, Mark E. Shoulson via Unicode
<unicode at unicode.org> wrote:
>
> Unicode isn't here to encode cool new ideas that would be cool and
> new.  It's here for writing what people already do.

http://www.unicode.org/L2/L2018/18141r2-emoji-colors.pdf

"Add 14 colored emoji characters for decorative and/or descriptive
uses. These may be used to indicate that an emoji has a different
color."

No evidence has been provided that anybody is currently using colored
blobs for this purpose (in fact emoji users have explicitly rejected
this method for indicating emoji color:
http://www.unicode.org/L2/L2018/18208-white-wine-rgi.pdf), just an
assertion that it would be a good idea if emoji users could add a
colored swatch to an existing emoji to indicate what color they want
it to represent (note that the colored characters do not change the
color of the emoji they are attached to [before or after, depending
upon whether you are speaking French or English dialect of emoji],
they are just intended as a visual indication of what colour you wish
the emoji was).

This proposal to add 14 additional colored circles, squares and hearts
is a perfect example of a cool new idea for something that the authors
think would be really useful, but for which there is no evidence of
existing use. The UTC should have rejected it as out of scope, but we
all know that rules and procedures do not apply to the Emoji
Subcommittee, so in fact this cool new idea will be included in
Unicode 12 in March.

Andrew

From unicode at unicode.org  Thu Jan 24 07:56:53 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 24 Jan 2019 13:56:53 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
Message-ID: <d228b187-f3f3-0f90-7dc8-0f450fe9fbbc@gmail.com>


Andrew West wrote,

 > ...
 > http://www.unicode.org/L2/L2018/18208-white-wine-rgi.pdf), just an
 > assertion that it would be a good idea if emoji users could add a
 > colored swatch to an existing emoji to indicate what color they want
 > it to represent (note that the colored characters do not change the
 > color of the emoji they are attached to [before or after, depending
 > upon whether you are speaking French or English dialect of emoji],
 > they are just intended as a visual indication of what colour you wish
 > the emoji was).

In order to simplify emoji processing, these should be stored in the 
data stream in logical order.? Whether these cool new characters become 
reordrant color blobs or not would depend upon language.? So, what we'd 
need is some way of indicating language in plain-text. Some kind of 
tagging mechanism.

FAICT, the emoji repertoire is vendor-driven, just as the pre-Unicode 
emoji sets were vendor driven.? Pre-Unicode, if a vendor came up with 
cool ideas for new emoji they added new characters to the PUA.? Now that 
emoji are standardized, when vendors come up with new ideas they put 
them in the emoji ranges in order to preserve the standardization factor 
and ensure interoperability.? (That's probably over-simplified and there 
are bound to be other factors involved.)

We should no more expect the conventional Unicode character encoding 
model to apply to emoji than we should expect the old-fashioned text 
ranges to become vendor-driven.


From unicode at unicode.org  Thu Jan 24 08:49:59 2019
From: unicode at unicode.org (Andrew West via Unicode)
Date: Thu, 24 Jan 2019 14:49:59 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <d228b187-f3f3-0f90-7dc8-0f450fe9fbbc@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
 <d228b187-f3f3-0f90-7dc8-0f450fe9fbbc@gmail.com>
Message-ID: <CALgEMhzjgT4TWYT6LROWjrDVVO42AAF8bx1TfWAv0xAw9o_8Wg@mail.gmail.com>

On Thu, 24 Jan 2019 at 13:59, James Kass via Unicode
<unicode at unicode.org> wrote:
>
> FAICT, the emoji repertoire is vendor-driven, just as the pre-Unicode
> emoji sets were vendor driven.  Pre-Unicode, if a vendor came up with
> cool ideas for new emoji they added new characters to the PUA.  Now that
> emoji are standardized, when vendors come up with new ideas they put
> them in the emoji ranges in order to preserve the standardization factor
> and ensure interoperability.  (That's probably over-simplified and there
> are bound to be other factors involved.)

I do not believe that recent (post-6.0) emoji additions are
vendor-driven. There is no formal vendor representation on the ESC,
and most ESC members do not work for vendors. Current emoji additions
are driven by ordinary users, who are actively encouraged by the UTC
to propose novel characters for encoding:

http://blog.unicode.org/2018/04/submissions-open-for-2020-emoji.html
http://blog.unicode.org/2016/09/emoji-deadline.html

The vendors happily lap up whatever emojis the UTC throws at them, but
they seem to have little interest in taking control of the emoji
process.

> We should no more expect the conventional Unicode character encoding
> model to apply to emoji than we should expect the old-fashioned text
> ranges to become vendor-driven.

Why should we not expect the conventional Unicode character encoding
mode to apply to emoji?

We were told time and time again when emoji were first proposed that
they were required for encoding for interoperability with Japanese
telecoms whose usage had spilled over to the internet. At that time
there was no suggestion that encoding emoji was anything other than a
one-off solution to a specific problem with PUA usage by different
vendors, and I at least had no idea that emoji encoding would become a
constant stream with an annual quota of 60+ fast-tracked
user-suggested novelties. Maybe that was the hidden agenda, and I was
just na?ve.

The ESC and UTC do an appallingly bad job at regulating emoji, and I
would like to see the Emoji Subcommittee disbanded, and decisions on
new emoji taken away from the UTC, and handed over to a consortium or
committee of vendors who would be given a dedicated vendor-use emoji
plane to play with (kinda like a PUA plane with pre-assigned
characters with algorithmic names [VENDOR-ASSIGNED EMOJI XXXXX] which
the vendors can then associate with glyphs as they see fit; and as
emoji seem to evolve over time they would be free to modify and
reassign glyphs as they like because the Unicode Standard would not
define the meaning or glyph for any characters in this plane).

Andrew


From unicode at unicode.org  Thu Jan 24 09:42:37 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 24 Jan 2019 15:42:37 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <CALgEMhzjgT4TWYT6LROWjrDVVO42AAF8bx1TfWAv0xAw9o_8Wg@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
 <d228b187-f3f3-0f90-7dc8-0f450fe9fbbc@gmail.com>
 <CALgEMhzjgT4TWYT6LROWjrDVVO42AAF8bx1TfWAv0xAw9o_8Wg@mail.gmail.com>
Message-ID: <35032127-df39-3e54-d106-573932dbfc4b@gmail.com>


Andrew West wrote,

 > Why should we not expect the conventional Unicode character encoding
 > mode to apply to emoji?

Remember when William Overington used to post about encoding colours, 
sometimes accompanied by novel suggestions about how they could be 
encoded or referenced in plain-text?

Here's a very polite reply from John Hudson from 2000,
http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/1042.html
...and, over time, many of the replies to William Overington's colorful 
suggestions were less than polite.? But it was clear that colors were 
out-of-scope for a computer plain-text encoding standard.

So I don't expect the conventional model to apply to emoji because it 
didn't; if it had, they'd not have been encoded.? Since they're in 
there, the conventional model does not apply.? Of course, the 
conventions have changed along with the concept of what's acceptable in 
plain-text.

Since emoji are an open-ended evolving phenomenon, there probably has to 
be a provision for expansion.? Any idea about them having been a finite 
set overlooked the probability of open-endedness and the impracticality 
of having only the original subset covered in plain-text while additions 
would be banished to higher level protocols.

Thank you for the information about current emoji additions being 
unrelated to vendors.? I have to confess that I haven't kept up-to-date 
on the emoji.

Maybe I should have said that emoji are fan-driven.


From unicode at unicode.org  Thu Jan 24 04:47:36 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Thu, 24 Jan 2019 10:47:36 +0000 (GMT)
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <877518274.400362.1548324826544.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <336725219.103976.1547919568662.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <877518274.400362.1548324826544.JavaMail."wjgo_10009@btinternet.com"@be62.bt.int.cpcloud.co.uk>
Message-ID: <49f5f750.844.1687f78e899.Webtop.228@btinternet.com>

Mark E. Shoulson wrote:

>  It doesn't just take someone saying "out of scope."

It depends who it is. The theory is that people post in the mailing list 
as individuals, yet some people have very great influence.

>  It also has to *be* out of scope!

Maybe, it depends who says what.

> If someone chants the incantation, but I can persuasively argue that 
> no, it IS in scope, then the spell fails.

Well, that may work for you, it does not work for me. Decision is by an 
unnamed gatekeeper and the Unicode Technical Committee does not get to 
discuss it, and discussing whether it is in scope or not is not allowed 
on the mailing list, because discussion of the topic is permanently 
banned.

> Requesting the scope of Unicode be widened is not like other 
> discussions being had here, so it makes sense that it should be 
> treated differently, if treated at all.

Well, it does not make sense to me. If benefit could be produced by 
widening the scope of Unicode in some way, then it seems that it should 
be allowed to be discussed in the mailing list. And even if rejected at 
some time then still be allowed to be discussed at some future time as 
things may have changed.

> There were discussions and agreements made as to the scope of Unicode, 
> long ago.

Yes. Yet surely decisions made long ago should not lock out all progress 
as new ideas come along.

> And just like you can't petition to change a character name, no matter 
> how wrong it is, asking the Unicode consortium to redefine itself on 
> your say-so is -not going to be taken seriously either.

Well, to me it is not like that. Yes, "a character name, no matter how 
wrong it is," is part of the stability guarantee and cannot be changed. 
Adding U+FFF7 as a base character for a tag digit sequence to uniquely 
and interoperably and stably define a code for a specific meaning for a 
localizable sentence would not, as far as I am aware, break any 
stability guarantees for Unicode. That might widen the scope of Unicode 
or it might be within the present scope, yet either way if it would be 
of benefit to end users then it would be reasonable to consider the idea 
and not block its discussion: and it is not a matter of my say-so at 
all, putting forward an idea for fair consideration is not at all the 
same as dictating that something should be done on someone's say-so. Was 
the scope of Unicode widened for emoji? First of all emoji were encoded 
for compatibility, but the Unicorn Face changed all that and now it an 
annual "could be useful" exercise of generating new characters based on 
people's ideas. For the avoidance of doubt I am not against that at all, 
it is fun and hopefully will continue.

I appreciate that the particular tag sequences to follow U+FFF7 might 
not be encoded by Unicode Inc., they might be encoded by an ISO 
committee, such as ISO/TC 37. Yet encoding U+FFF7 as the base character 
would allow a link as interoperable plain text rather than needing to 
use what amounts to a markup system.

Yet please remember that Unicode Inc. has defined and published base 
character plus tag sequences for the some flags, including the Welsh 
flag and the Scottish flag. Recently I was informed that they are not 
part of The Unicode Standard nor part of ISO/IEC 10646.

It appears that a Unicode Technical Note is being prepared with 
recommendations of how to express teletext control characters using 
Unicode characters, possibly using Escape sequences.

So a Unicode Inc. publication listing numbers and meanings together with 
a context guide for each to help translation of meanings for a 
localization file of code numbers and sentences into a target language 
seems not unreasonable.

As an example, the vertical line used as a separator, as a comma might 
be used within the sentence itself, so not using a  comma as a separator 
of fields.

812|Would you like to go to the day room?

Not all codes would be three digits, some would be longer. Codes where 
the first three digits are all different from the other two digits are 
three digits long. Codes where the first and third digit are the same 
have a length of 3 plus the value of the third digit. So, for example, 
codes starting 313 are six digits long and are a set of localizable 
sentences intended primarily for seeking information through the 
language barrier about relatives and friends after a disaster. The third 
digit being zero allows for even longer code numbers.

> Discussing how to change the scope so that whatever-it-is IS in scope 
> is a very large undertaking, ?

Not necessarily. If the Unicode Technical Committee were to consider a 
proposal and, after consideration and discussion were to agree to 
proceed, it could all be done within a short discussion at a Unicode 
Technical Committee meeting and then the recommendation sent to the ISO 
committee.

I am not saying that it should be or that it will be, I am just trying 
to say that it is not necessarily a very large undertaking. The Unicode 
Technical Committee discusses many things.

> ? and would need a tremendous groundswell of support from all the 
> major stakeholders in Unicode, ?

Quite possibly. And if there were discussion in the Unicode mailing list 
and the topic came up at a Unicode Technical Committee meeting that 
might happen.

> ?, so you should probably start there.

Well, they meet at the Unicode Technical Committee meetings, so that is 
where I consider that the matter should be discussed. The problem is, it 
is not possible for me at present to get such a suggestion before the 
committee because it gets blocked and it cannot be discussed in the 
Unicode mailing list because the topic is permanently banned.

> "But so many of the people I would want to talk to about this are 
> right here on this list!" you say?  Be that as it may, it doesn't mean 
> the list has to grant you a platform.

That is very true. Unicode Inc. has no obligation whatsoever to allow me 
to post my ideas in the Unicode mailing list and no obligation 
whatsoever to consider my ideas for progress at the Unicode Technical 
Committee. I find it quite ironic that if this idea were implemented 
then demonstrations of what the system could do would be a marvellous 
example of what is possible in displaying the languages of the world 
using Unicode.

http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_025.pdf

> Change the world on your own dime.

Well, I had not met that expression before but I have had a search and I 
think that I understand your meaning.

I am doing what I can. I am retired, at home, with a laptop computer 
with some budget software (yet very good software with which I can make 
fonts and publish PDF documents), an internet connection, and a small 
personal webspace hosted by a United Kingdom Public Limited Company for 
a small annual fee, so it is safe to access, it is not a server based on 
my home computer, I upload over the internet (it is a legacy webspace 
from a free-with-dial-up-internet-access webspace dating from 1997 after 
a takeover then another takeover, after the dial-up facility was closed 
yet I was allowed to keep the webspace with same original address.)

For example, as well as producing some scientific publications, I am 
writing a novel, chapters 1 ..72, 75, 80, 81 all written, published on 
the web for free reading and legal-deposited with the British Library.

http://www.users.globalnet.co.uk/~ngo/novel.htm

If just browsing through, Chapters 34, 42 and 51 are good places to 
start browsing.

> "Unicode isn't here to encode cool new ideas that would be cool and 
> new.  It's here for writing what people already do. "

That may have been true once, and maybe that is still the theory, but 
the continual encoding of new emoji just does not fit that!

I did at one time, a few years ago, consider trying to formulate 
localizable sentences as emoji, each with a square glyph, but I changed 
from that when I realized that emoji do not have precise meanings yet a 
very important aspect of localizable sentences is that each one has a 
very precise meaning and is grammatical independent.

> It's as appropriate to demand that Unicode support these things ?

One of the problems I get is the Aunt Sally suggestion, not only here 
but in posts from others, that I am demanding anything.

I am a researcher and I would like to put my ideas forward for sensible 
discussion. I am asking for consideration of my ideas please, I have 
not, and am not, demanding anything at all.

When people start making out that I am making demands it is very 
prejudicial and, I consider, very unfair.

By the way, I have been put on moderated post so please do not reply to 
the list unless you get a copy of this as from me via Unicode. I write 
this because I am not seeking to bypass the moderator's decision as if 
Unicode Inc. does not want any discussion of localizable sentences in 
its mailing list that is its right so to choose.

William Overington
Thursday 24 January 2019


From unicode at unicode.org  Thu Jan 24 09:06:49 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Thu, 24 Jan 2019 15:06:49 +0000 (GMT)
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
Message-ID: <7918bdc9.d847.16880663a2b.Webtop.73@btinternet.com>

Andrew West wrote as follows:

> ? (note that the colored characters do not change the color of the 
> emoji they are attached to [before or after, depending upon whether 
> you are speaking French or English dialect of emoji], they are just 
> intended as a visual indication of what colour you wish the emoji 
> was).

I thought that the idea was that they could possibly be used for glyph 
substitution with an appropriate font, so that there could be, for 
example, a glyph of a polar bear.

I produced a proposal for some characters specifically intended each as 
a colour modifier character.

http://www.unicode.org/L2/L2018/18198-colour-mod-chars.pdf

I know that the document was once on the agenda for a UTC meeting but 
was not mentioned in the minutes, so I do not know whether consideration 
of the best plain text way to express a request for a particular colour 
for an emoji is still taking place and my document is just one of 
several possibilities being considered.

William Overington
Thursday 24 January 2019


------ Original Message ------
From: "Andrew West via Unicode" <unicode at unicode.org>
To: "Mark E. Shoulson" <mark at kli.org>
Cc: "Unicode Discussion" <unicode at unicode.org>
Sent: Thursday, 2019 Jan 24 At 11:50
Subject: Re: Encoding italic (was: A last missing link)

On Thu, 24 Jan 2019 at 02:10, Mark E. Shoulson via Unicode
<unicode at unicode.org> wrote:
>
> Unicode isn't here to encode cool new ideas that would be cool and
> new.  It's here for writing what people already do.

http://www.unicode.org/L2/L2018/18141r2-emoji-colors.pdf

"Add 14 colored emoji characters for decorative and/or descriptive
uses. These may be used to indicate that an emoji has a different
color."

No evidence has been provided that anybody is currently using colored
blobs for this purpose (in fact emoji users have explicitly rejected
this method for indicating emoji color:
http://www.unicode.org/L2/L2018/18208-white-wine-rgi.pdf), just an
assertion that it would be a good idea if emoji users could add a
colored swatch to an existing emoji to indicate what color they want
it to represent (note that the colored characters do not change the
color of the emoji they are attached to [before or after, depending
upon whether you are speaking French or English dialect of emoji],
they are just intended as a visual indication of what colour you wish
the emoji was).

This proposal to add 14 additional colored circles, squares and hearts
is a perfect example of a cool new idea for something that the authors
think would be really useful, but for which there is no evidence of
existing use. The UTC should have rejected it as out of scope, but we
all know that rules and procedures do not apply to the Emoji
Subcommittee, so in fact this cool new idea will be included in
Unicode 12 in March.

Andrew


From unicode at unicode.org  Thu Jan 24 09:54:29 2019
From: unicode at unicode.org (Andrew West via Unicode)
Date: Thu, 24 Jan 2019 15:54:29 +0000
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <35032127-df39-3e54-d106-573932dbfc4b@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
 <d228b187-f3f3-0f90-7dc8-0f450fe9fbbc@gmail.com>
 <CALgEMhzjgT4TWYT6LROWjrDVVO42AAF8bx1TfWAv0xAw9o_8Wg@mail.gmail.com>
 <35032127-df39-3e54-d106-573932dbfc4b@gmail.com>
Message-ID: <CALgEMhw6Bwoud_XA55C2mw4sQnyebFiggPZ2AuchOMpvxjYrig@mail.gmail.com>

On Thu, 24 Jan 2019 at 15:42, James Kass <jameskasskrv at gmail.com> wrote:
>
> Here's a very polite reply from John Hudson from 2000,
> http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/1042.html
> ...and, over time, many of the replies to William Overington's colorful
> suggestions were less than polite.  But it was clear that colors were
> out-of-scope for a computer plain-text encoding standard.

Going off topic a little, I saw this tweet from Marijn van Putten
today which shows examples of Arabic script from early Quranic
manuscripts with phonetic information indicated by the use of red and
green dots:

https://twitter.com/PhDniX/status/1088171783461703682

I would be interested to know how those should be represented in Unicode.

Andrew

From unicode at unicode.org  Thu Jan 24 10:24:07 2019
From: unicode at unicode.org (Khaled Hosny via Unicode)
Date: Thu, 24 Jan 2019 18:24:07 +0200
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <CALgEMhw6Bwoud_XA55C2mw4sQnyebFiggPZ2AuchOMpvxjYrig@mail.gmail.com>
References: <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
 <d228b187-f3f3-0f90-7dc8-0f450fe9fbbc@gmail.com>
 <CALgEMhzjgT4TWYT6LROWjrDVVO42AAF8bx1TfWAv0xAw9o_8Wg@mail.gmail.com>
 <35032127-df39-3e54-d106-573932dbfc4b@gmail.com>
 <CALgEMhw6Bwoud_XA55C2mw4sQnyebFiggPZ2AuchOMpvxjYrig@mail.gmail.com>
Message-ID: <20190124162407.GA2703@macbook.localdomain>

On Thu, Jan 24, 2019 at 03:54:29PM +0000, Andrew West via Unicode wrote:
> On Thu, 24 Jan 2019 at 15:42, James Kass <jameskasskrv at gmail.com> wrote:
> >
> > Here's a very polite reply from John Hudson from 2000,
> > http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/1042.html
> > ...and, over time, many of the replies to William Overington's colorful
> > suggestions were less than polite.  But it was clear that colors were
> > out-of-scope for a computer plain-text encoding standard.
> 
> Going off topic a little, I saw this tweet from Marijn van Putten
> today which shows examples of Arabic script from early Quranic
> manuscripts with phonetic information indicated by the use of red and
> green dots:
> 
> https://twitter.com/PhDniX/status/1088171783461703682
> 
> I would be interested to know how those should be represented in Unicode.

It is possible to represent this by use of color fonts. The green
(sometimes golden) dots are the hamza, the red ones are various vowel
marks. A color font would use colored glyphs for these instead of the
modern shapes. I did a color fonts that does a similar thing (but still
use the modern forms) and it is on my to do list to do a font using
archaic Kufi forms.

Regards,
Khaled

From unicode at unicode.org  Thu Jan 24 10:33:39 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 24 Jan 2019 16:33:39 +0000
Subject: Encoding italic
In-Reply-To: <CALgEMhw6Bwoud_XA55C2mw4sQnyebFiggPZ2AuchOMpvxjYrig@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
 <d228b187-f3f3-0f90-7dc8-0f450fe9fbbc@gmail.com>
 <CALgEMhzjgT4TWYT6LROWjrDVVO42AAF8bx1TfWAv0xAw9o_8Wg@mail.gmail.com>
 <35032127-df39-3e54-d106-573932dbfc4b@gmail.com>
 <CALgEMhw6Bwoud_XA55C2mw4sQnyebFiggPZ2AuchOMpvxjYrig@mail.gmail.com>
Message-ID: <59ecdeef-5b8f-19be-2649-fe9e38d1a020@gmail.com>


 > Maybe I should have said emoji are fan-driven.

That works.? Here's the previous assertion rephrased:

 ? We should no more expect the conventional Unicode character encoding
 ? model to apply to emoji than we should expect the old-fashioned text
 ? ranges to become fan-driven.

And if we don't want the text ranges to become fan driven, as pointed 
out by Martin D?rst and others, we take a cautious and conservative 
approach to moving forward with the standard.

Veering back on-topic, the anti fan driven aversion doesn't apply to 
encoding italics, although /fans/ would benefit.? There's pre-existing 
conventions for italics, and a scholar with the credentials of Victor 
Gaultney should be able to make a credible proposal for encoding them.? 
I hope we haven't overwhelmed him with a surplus of rhetoric.


From unicode at unicode.org  Thu Jan 24 16:42:59 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 24 Jan 2019 22:42:59 +0000
Subject: Encoding italic
In-Reply-To: <20190124162407.GA2703@macbook.localdomain>
References: <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
 <d228b187-f3f3-0f90-7dc8-0f450fe9fbbc@gmail.com>
 <CALgEMhzjgT4TWYT6LROWjrDVVO42AAF8bx1TfWAv0xAw9o_8Wg@mail.gmail.com>
 <35032127-df39-3e54-d106-573932dbfc4b@gmail.com>
 <CALgEMhw6Bwoud_XA55C2mw4sQnyebFiggPZ2AuchOMpvxjYrig@mail.gmail.com>
 <20190124162407.GA2703@macbook.localdomain>
Message-ID: <20190124224259.54ec3e28@JRWUBU2>

On Thu, 24 Jan 2019 18:24:07 +0200
Khaled Hosny via Unicode <unicode at unicode.org> wrote:

> On Thu, Jan 24, 2019 at 03:54:29PM +0000, Andrew West via Unicode
> wrote:
>> On Thu, 24 Jan 2019 at 15:42, James Kass <jameskasskrv at gmail.com>
>> wrote:  

>>> Going off topic a little, I saw this tweet from Marijn van Putten
>>> today which shows examples of Arabic script from early Quranic
>>> manuscripts with phonetic information indicated by the use of red
>>> and green dots:
>>> 
>>> https://twitter.com/PhDniX/status/1088171783461703682
 
>> I would be interested to know how those should be represented in
>> Unicode.
 
> It is possible to represent this by use of color fonts.

The limitations of rendering technology should not be an argument
against an encoding.  We have characters that differ only in their
properties, such as word-breaking and line-breaking.

In this case, it may be argued that their colours apply only to their
'plain' colouring.  Who determines what their colour should be in blue
text?  (Font technology seems to dictate that their colour is
unaffected by the choice of foreground colour.)

Richard.

From unicode at unicode.org  Thu Jan 24 17:00:10 2019
From: unicode at unicode.org (Khaled Hosny via Unicode)
Date: Fri, 25 Jan 2019 01:00:10 +0200
Subject: Encoding italic
In-Reply-To: <20190124224259.54ec3e28@JRWUBU2>
References: <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
 <d228b187-f3f3-0f90-7dc8-0f450fe9fbbc@gmail.com>
 <CALgEMhzjgT4TWYT6LROWjrDVVO42AAF8bx1TfWAv0xAw9o_8Wg@mail.gmail.com>
 <35032127-df39-3e54-d106-573932dbfc4b@gmail.com>
 <CALgEMhw6Bwoud_XA55C2mw4sQnyebFiggPZ2AuchOMpvxjYrig@mail.gmail.com>
 <20190124162407.GA2703@macbook.localdomain>
 <20190124224259.54ec3e28@JRWUBU2>
Message-ID: <20190124230010.GB2703@macbook.localdomain>

On Thu, Jan 24, 2019 at 10:42:59PM +0000, Richard Wordingham via Unicode wrote:
> On Thu, 24 Jan 2019 18:24:07 +0200
> Khaled Hosny via Unicode <unicode at unicode.org> wrote:
> 
> > On Thu, Jan 24, 2019 at 03:54:29PM +0000, Andrew West via Unicode
> > wrote:
> >> On Thu, 24 Jan 2019 at 15:42, James Kass <jameskasskrv at gmail.com>
> >> wrote:  
> 
> >>> Going off topic a little, I saw this tweet from Marijn van Putten
> >>> today which shows examples of Arabic script from early Quranic
> >>> manuscripts with phonetic information indicated by the use of red
> >>> and green dots:
> >>> 
> >>> https://twitter.com/PhDniX/status/1088171783461703682
>  
> >> I would be interested to know how those should be represented in
> >> Unicode.
>  
> > It is possible to represent this by use of color fonts.
> 
> The limitations of rendering technology should not be an argument
> against an encoding.  We have characters that differ only in their
> properties, such as word-breaking and line-breaking.

They are already encoded, in their modern uncolored form. Some of the
modern forms like U+06E5 ARABIC SMALL WAW, U+06E5 ARABIC SMALL WAW, etc.
were even specifically ?invented? in the previous century to overcome
the impracticality of printing in multiple colors, so the colored and
uncolored forms are different representations of the same underlying
characters.
 
> In this case, it may be argued that their colours apply only to their
> 'plain' colouring.  Who determines what their colour should be in blue
> text?  (Font technology seems to dictate that their colour is
> unaffected by the choice of foreground colour.)

The colors don?t change, the vowel marks are always red, the hamza is
always green/yellow.

From unicode at unicode.org  Thu Jan 24 17:46:35 2019
From: unicode at unicode.org (Kent Karlsson via Unicode)
Date: Fri, 25 Jan 2019 00:46:35 +0100
Subject: Encoding italic (was: A last missing link)
In-Reply-To: <42afafc1-a0ab-e1f7-5954-371f174603d1@kli.org>
Message-ID: <D8700CEB.446FB%kent.karlsson14@telia.com>


Den 2019-01-24 03:21, skrev "Mark E. Shoulson via Unicode"
<unicode at unicode.org>:

> On 1/22/19 6:26 PM, Kent Karlsson via Unicode wrote:
>> Ok. One thing to note is that escape sequences (including control sequences,
>> for those who care to distinguish those) probably should be "default
>> ignorable" for display. Requiring, or even recommending, them to be default
>> ignorable for other processing (like sorting, searching, and other things)
>> may be a tall order. So, for display, (maximal) substrings that match:
>> 
>> \u001B[\u0020-\002F]*[\u0030-\007E]|
>> (\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]
>> 
>> should be default ignorable (i.e. invisible, but a "show invisibles" mode
>> would show them; not interpreted ones should be kept, even if interpreted
>> ones need not, just (re)generated on save). That is as far as Unicode
>> should go.
> 
> So it isn't just "these characters should be default ignorable", but
> "this regular expression is default ignorable."? This gets back to
> "things that span more than a character" again, only this time the
> "span" isn't the text being styled, it's the annotation to style it.?

True. That is how ECMA/ISO/ANSI escape/control-sequences are designed.
Had they not already been designed, and implemented, but we were to do
a design today, it would surely be done differently; e.g. having
"controls" that consisted only of (individually) "default-ignorable"
characters.

But, and this is the important thing here:

a) The current esc/control-sequences is an accepted standard,
since long.

b) This standard is still in very much active use, albeit mostly
by terminal emulators. But the styling stuff need not at all
be limited to terminal emulators.

Since it is an actively and widely used standard, I don't see the
point of trying to design another way of specifying "default
ignorable"-controls for text styling. (HTML, for instance, does not
have "default ignorable" controls, since ALL characters in the
"controls" are printable characters, so one needs a "second level"
for parsing the controls.) True, ignoring or interpreting an
esc/control-sequence requires some processing of substrings, since
some (all but the first) are printable characters. But not that hard.
It has been implemented over and over...

Had this standard been defunct, then there would be an opportunity
to design something different.

<OT>
> The "bash" shell has special escape-sequences (\[ and \]) to use in
> defining its prompt that tell the system that the text enclosed by them
> is not rendered and should not be counted when it comes to doing

Never heard of. Cannot find any reference mentioning them. Reference?
</OT>

> cursor-control and line-editing stuff (so you put them around, yep, the
> escape sequences for coloring or boldfacing or whatever that you want in
> your prompt). 

<OT>
Line editing stuff in bash is done on an internal buffer (there is a library
for doing this, and that library can be used by various other command line
programs; bash does not use the system input line editing). Then that
library tries to show what is in the buffer on the terminal. So, I'm
not sure what you are talking about; bash does NOT (somehow) scrape
the screen (terminal emulator window).
</OT>

Furthermore, colouring and bold/underline is quite common not only in
prompts, but also in output directed at a terminal from various programs.
(And it works just fine.) Unfortunately cut-and-paste tends to loose
much (or all) of that. (Would be nicer if it got converted to HTML,
RTF, .doc, or whatever is the target format; or just nicely kept if
"plain text" is the target.)

 
> That would seem to be at least simpler than a big ol'
> regexp, but really not that much of an improvement.? It also goes to
> show how things like this require all kinds of special handling,
> even/especially in a "simple" shell prompt (which could make a strong
> case for being "plain text", though, yes, terminal escape codes are a
> thing.)

They are NOT "terminal escape codes". It is just that, for now, it is
just about only terminal emulator that implement esc/control-sequences.
>From https://www.ecma-international.org/publications/standards/Ecma-048.htm:
"The control functions are intended to be used embedded in character-coded
data for interchange, in particular with character-imaging devices."
A (plain) text editor is an example of a 'character-imaging device'.
(Yes, the terminology is a bit dated.)

/Kent K

> 
> ~mark


From unicode at unicode.org  Thu Jan 24 23:44:21 2019
From: unicode at unicode.org (Garth Wallace via Unicode)
Date: Thu, 24 Jan 2019 21:44:21 -0800
Subject: Encoding italic
In-Reply-To: <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
Message-ID: <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>

On Wed, Jan 23, 2019 at 1:27 AM James Kass via Unicode <unicode at unicode.org>
wrote:

>
> Nobody has really addressed Andrew West's suggestion about using the tag
> characters.
>
> It seems conformant, unobtrusive, requiring no official sanction, and
> could be supported by third-partiers in the absence of corporate
> interest if deemed desirable.
>
> One argument against it might be:  Whoa, that's just HTML.  Why not just
> use HTML?  SMH
>
> One argument for it might be:  Whoa, that's just HTML!  Most everybody
> already knows about HTML, so a simple subset of HTML would be recognizable.
>
> After revisiting the concept, it does seem elegant and workable. It
> would provide support for elements of writing in plain-text for anyone
> desiring it, enabling essential (or frivolous) preservation of
> editorial/authorial intentions in plain-text.
>
> Am I missing something?  (Please be kind if replying.)
>

There is also RFC 1896 "enriched text", which is an attempt at a
lightweight HTML substitute for styling in email. But these, and the ANSI
escape code suggestion, seem like they're trying to solve the wrong problem
here.

Here's how I understand the situation:
* Some people using forms of text or mostly-text communication that do not
provide styling features want to use styling, for emphasis or personal flair
* Some of these people caught on to the existence of the "styled"
mathematical alphanumerics and, not caring that this is "wrong", started
using them as a workaround
* The use of these symbols, which are not technically equivalent to basic
Latin, make posts inaccessible to screen readers, among other problems

These are suggestions for Unicode to provide a different, more "acceptable"
workaround for a lack of functionality in these social media systems (this
mostly seems to be an issue with Twitter; IME this shows up much less on
Facebook). But the root problem isn't the kludge, it's the lack of
functionality in these systems: if Twitter etc. simply implemented some
styling on their own, the whole thing would be a moot point. Essentially,
this is trying to add features to Twitter without waiting for their
development team.

Interoperability is not an issue, since in modern computers copying and
pasting styled text between apps works just fine.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190124/1cf7677e/attachment.html>

From unicode at unicode.org  Fri Jan 25 00:34:10 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 24 Jan 2019 22:34:10 -0800
Subject: Encoding italic
In-Reply-To: <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
Message-ID: <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190124/671939cb/attachment.html>

From unicode at unicode.org  Fri Jan 25 01:14:40 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Thu, 24 Jan 2019 23:14:40 -0800
Subject: Encoding italic
In-Reply-To: <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
Message-ID: <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>

I am surprised at the length of this debate, especially since the arguments are repetitive?

 
That said:

 
Twitter was offered as an example, not the only example just one of the most ubiquitous. Many messaging apps and other apps would benefit from italics. The argument is not based on adding italics to twitter.

 
Most apps today have security protections that filter or translate problematic characters. If the proposal would cause ?normalization? problems, adding the proposed characters to the filter lists or substitution lists would not be a big burden.

The biggest burden would be to the apps that would benefit, to add italicizing and editing capabilities.

 
tex

 
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag via Unicode
Sent: Thursday, January 24, 2019 10:34 PM
To: unicode at unicode.org
Subject: Re: Encoding italic

 
On 1/24/2019 9:44 PM, Garth Wallace via Unicode wrote:

But the root problem isn't the kludge, it's the lack of functionality in these systems: if Twitter etc. simply implemented some styling on their own, the whole thing would be a moot point. Essentially, this is trying to add features to Twitter without waiting for their development team.

Interoperability is not an issue, since in modern computers copying and pasting styled text between apps works just fine.  

Yep, that's what this is: trying to add features to some platforms that could very simply be added by the  respective developers while in the process causing a normalization issue (of sorts) everywhere else. 

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190124/544a7008/attachment.html>

From unicode at unicode.org  Fri Jan 25 01:25:12 2019
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Thu, 24 Jan 2019 23:25:12 -0800
Subject: Encoding italic
In-Reply-To: <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
Message-ID: <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>

On 1/24/2019 11:14 PM, Tex wrote:
>
> I am surprised at the length of this debate, especially since the 
> arguments are repetitive?
>
> That said:
>
> Twitter was offered as an example, not the only example just one of 
> the most ubiquitous. Many messaging apps and other apps would benefit 
> from italics. The argument is not based on adding italics to twitter.
>
> Most apps today have security protections that filter or translate 
> problematic characters. If the proposal would cause ?normalization? 
> problems, adding the proposed characters to the filter lists or 
> substitution lists would not be a big burden.
>
> The biggest burden would be to the apps that would benefit, to add 
> italicizing and editing capabilities.
>
The "normalization" is when you import to rich text, you don't want 
competing formatting instructions. Getting styled character codes 
normalized to styling of character runs is the most difficult, that's 
why the abuse of math italics really is abuse in terms of interoperability.

Other schemes, like a VS per code point, also suffer from being 
different in philosophy from "standard" rich text approaches. Best would 
be as standard extension to all the messaging systems (e.g. a common 
markdown language, supported by UI).

A./

> tex
>
> *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of 
> *Asmus Freytag via Unicode
> *Sent:* Thursday, January 24, 2019 10:34 PM
> *To:* unicode at unicode.org
> *Subject:* Re: Encoding italic
>
> On 1/24/2019 9:44 PM, Garth Wallace via Unicode wrote:
>
>     But the root problem isn't the kludge, it's the lack of
>     functionality in these systems: if Twitter etc. simply implemented
>     some styling on their own, the whole thing would be a moot point.
>     Essentially, this is trying to add features to Twitter without
>     waiting for their development team.
>
>     Interoperability is not an issue, since in modern computers
>     copying and pasting styled text between apps works just fine.
>
> Yep, that's what this is: trying to add features to some platforms 
> that could very simply be added by the? respective developers while in 
> the process causing a normalization issue (of sorts) everywhere else.
>
> A./
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190124/9b3b8f0e/attachment.html>

From unicode at unicode.org  Fri Jan 25 05:59:09 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Fri, 25 Jan 2019 03:59:09 -0800
Subject: Encoding italic
In-Reply-To: <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
Message-ID: <CAMZ=zj7B-aSrTOCWHbd78K20=MaHtXAkDQsAi0sQxPTFwUV+-A@mail.gmail.com>

On Thu, Jan 24, 2019 at 11:16 PM Tex via Unicode <unicode at unicode.org> wrote:
> Twitter was offered as an example, not the only example just one of the most ubiquitous. Many messaging apps and other apps would benefit from italics. The argument is not based on adding italics to twitter.

And again, color me skeptical. If italics are just added to Unicode
and not to the relevant app or interface, they will not see much use,
in the same way that most non-ASCII characters for proper English--the
quotes, the dashes, the accents--are often ignored because they're too
hard to enter. But if you're going to add italics, having it in
Unicode doesn't make it significantly easier, particularly when they
need to support systems that predate Unicode adding italics.

> The biggest burden would be to the apps that would benefit, to add italicizing and editing capabilities.

If they would benefit or if they'd accept the burden, they'd have
already added italics, via HTML or Markdown or escape sequences or
whatever.

-- 
Kie ekzistas vivo, ekzistas espero.

From unicode at unicode.org  Fri Jan 25 06:07:21 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Fri, 25 Jan 2019 07:07:21 -0500
Subject: Ancient Greek apostrophe marking elision
Message-ID: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>

There seems some debate amongst digital classicists in whether to use
U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking
elision. (e.g. ?? for ?? preceding a word starting with a vowel).

It seems to me that U+2019 is the technically correct choice per the
Unicode Standard but it is not without at least one problem: default word
breaking rules.

I'm trying to provide guidelines for digital classicists in this regard.

Is it correct to say the following:

1) U+2019 is the correct character to use for the apostrophe in Ancient
Greek when marking elision.
2) U+02BC is a misuse of a modifier for this purpose
3) However, use of U+2019 (unlike U+02BC) means the default Word Boundary
Rules in UAX#29 will (incorrectly) exclude the apostrophe from the word
token
4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules in
UAX#29 will (incorrectly) include the apostrophe as part of a glyph cluster
with the previous letter
5) The correct solution is to tailor the Word Boundary Rules in the case of
Ancient Greek to treat U+2019 as not breaking a word (which shouldn't have
the same ambiguity problems with the single quotation mark as in English as
it should not be used as a quotation mark in Ancient Greek)

Many thanks in advance.

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190125/67b7a311/attachment.html>

From unicode at unicode.org  Fri Jan 25 03:06:35 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Fri, 25 Jan 2019 09:06:35 +0000 (GMT)
Subject: Encoding italic
In-Reply-To: <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
Message-ID: <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>

Asmus Freytag wrote;

> Other schemes, like a VS per code point, also suffer from being 
> different in philosophy from "standard" rich text approaches. Best 
> would be as standard extension to all the messaging systems (e.g. a 
> common markdown language, supported by UI).     A./

Yet that claim of what would be best would be stateful and statefulness 
is the very thing that Unicode seeks to avoid.

Plain text is the basic system and a Variation Selector mechanism after 
each character that is to become italicized is not stateful and can be 
implemented using existing OpenType technology.

If an organization chooses to develop and use a rich text format then 
that is a matter for that organization and any changing of formatting of 
how italics are done when converting between plain text and rich text is 
the responsibility of the organization that introduces its rich text 
format.

Twitter was just an example that someone introduced along the way, it 
was not the original request.

Also this is not only about messaging. Of primary importance is the 
conservation of texts in plain text format, for example, where a printed 
book has one word italicized in a sentence and the text is being 
transcribed into a computer.

William Overington
Friday 25 January 2019


From unicode at unicode.org  Fri Jan 25 11:34:40 2019
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Fri, 25 Jan 2019 18:34:40 +0100
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
Message-ID: <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>

U+2019 is normally the character used, except where the ? is considered a
letter. When it is between letters it doesn't cause a word break, but
because it is also a right single quote, at the end of words there is a
break. Thus in a phrase like ?tryin? to go? there is a word break after the
n, because one can't tell.

So something like "?? ??????" (picking a phrase at random) would have a
word break after the delta.

Word break:
?? ??????

However, there is no *line break* between them (which is the more important
operation in normal usage). Probably not worth tailoring the word break.

Line break:
?? ??????

Mark


On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode <
unicode at unicode.org> wrote:

> There seems some debate amongst digital classicists in whether to use
> U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking
> elision. (e.g. ?? for ?? preceding a word starting with a vowel).
>
> It seems to me that U+2019 is the technically correct choice per the
> Unicode Standard but it is not without at least one problem: default word
> breaking rules.
>
> I'm trying to provide guidelines for digital classicists in this regard.
>
> Is it correct to say the following:
>
> 1) U+2019 is the correct character to use for the apostrophe in Ancient
> Greek when marking elision.
> 2) U+02BC is a misuse of a modifier for this purpose
> 3) However, use of U+2019 (unlike U+02BC) means the default Word Boundary
> Rules in UAX#29 will (incorrectly) exclude the apostrophe from the word
> token
> 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules in
> UAX#29 will (incorrectly) include the apostrophe as part of a glyph cluster
> with the previous letter
> 5) The correct solution is to tailor the Word Boundary Rules in the case
> of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't
> have the same ambiguity problems with the single quotation mark as in
> English as it should not be used as a quotation mark in Ancient Greek)
>
> Many thanks in advance.
>
> James
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190125/21431f07/attachment.html>

From unicode at unicode.org  Fri Jan 25 11:39:47 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Fri, 25 Jan 2019 12:39:47 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
Message-ID: <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>

Thank you, although the word break does still affect things like
double-clicking to select.

And people do seem to want to use U+02BC for this reason (and I'm trying to
articulate why that isn't what U+02BC is meant for).

James

On Fri, Jan 25, 2019 at 12:34 PM Mark Davis ?? <mark at macchiato.com> wrote:

> U+2019 is normally the character used, except where the ? is considered a
> letter. When it is between letters it doesn't cause a word break, but
> because it is also a right single quote, at the end of words there is a
> break. Thus in a phrase like ?tryin? to go? there is a word break after the
> n, because one can't tell.
>
> So something like "?? ??????" (picking a phrase at random) would have a
> word break after the delta.
>
> Word break:
> ?? ??????
>
> However, there is no *line break* between them (which is the more
> important operation in normal usage). Probably not worth tailoring the word
> break.
>
> Line break:
> ?? ??????
>
> Mark
>
>
> On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode <
> unicode at unicode.org> wrote:
>
>> There seems some debate amongst digital classicists in whether to use
>> U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking
>> elision. (e.g. ?? for ?? preceding a word starting with a vowel).
>>
>> It seems to me that U+2019 is the technically correct choice per the
>> Unicode Standard but it is not without at least one problem: default word
>> breaking rules.
>>
>> I'm trying to provide guidelines for digital classicists in this regard.
>>
>> Is it correct to say the following:
>>
>> 1) U+2019 is the correct character to use for the apostrophe in Ancient
>> Greek when marking elision.
>> 2) U+02BC is a misuse of a modifier for this purpose
>> 3) However, use of U+2019 (unlike U+02BC) means the default Word Boundary
>> Rules in UAX#29 will (incorrectly) exclude the apostrophe from the word
>> token
>> 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules
>> in UAX#29 will (incorrectly) include the apostrophe as part of a glyph
>> cluster with the previous letter
>> 5) The correct solution is to tailor the Word Boundary Rules in the case
>> of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't
>> have the same ambiguity problems with the single quotation mark as in
>> English as it should not be used as a quotation mark in Ancient Greek)
>>
>> Many thanks in advance.
>>
>> James
>>
>

-- 
*James Tauber*
Greek Linguistics: https://jktauber.com/
Music Theory: https://modelling-music.com/
Digital Tolkien: https://digitaltolkien.com/

Twitter: @jtauber
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190125/8b140032/attachment.html>

From unicode at unicode.org  Fri Jan 25 12:05:40 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 25 Jan 2019 18:05:40 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
Message-ID: <cf03f3a5-df3b-725c-f88c-7fa7502da853@gmail.com>


For U+2019, there's a note saying 'this is the preferred character to 
use for apostrophe'.

Mark Davis wrote,

 > When it is between letters it doesn't cause a word break, ...

Some applications don't seem to get that.? For instance, the 
spellchecker for Mozilla Thunderbird flags the string "aren" for 
correction in the word "aren?t", which suggests that users trying to use 
preferred characters may face uphill battles.


From unicode at unicode.org  Fri Jan 25 15:26:33 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 25 Jan 2019 21:26:33 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
Message-ID: <20190125212633.147193ac@JRWUBU2>

On Fri, 25 Jan 2019 12:39:47 -0500
James Tauber via Unicode <unicode at unicode.org> wrote:

> Thank you, although the word break does still affect things like
> double-clicking to select.
> 
> And people do seem to want to use U+02BC for this reason (and I'm
> trying to articulate why that isn't what U+02BC is meant for).

It's a bit tricky when the reason is that it was too hard to get users
of English to make a distinction between U+02BC and U+2019.  And for
Larry Niven's elephant-like aliens in _Footfall__, is _fi'_, the
singular of _fithp_, better written with U+02BC or U+2019?  And does
the phonetically faithful spelling of Estuarine English _fi'_ for
_fit_ depend on whether the glottal stop is dropped?

The science-fiction ethnonym _Vl'harg_ is also tricky.  Does its elegant
encoding depend on whether the apostrophe is a vowel symbol (so
U+02BC) or the indication of an omitted vowel (so U+2019)?

Richard.

From unicode at unicode.org  Fri Jan 25 15:59:58 2019
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Fri, 25 Jan 2019 13:59:58 -0800
Subject: Encoding italic
In-Reply-To: <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
Message-ID: <6d3e4948-648a-2883-c2f1-aa60559677f2@ix.netcom.com>

On 1/25/2019 1:06 AM, wjgo_10009 at btinternet.com wrote:
> Asmus Freytag wrote;
>
>> Other schemes, like a VS per code point, also suffer from being 
>> different in philosophy from "standard" rich text approaches. Best 
>> would be as standard extension to all the messaging systems (e.g. a 
>> common markdown language, supported by UI).???? A./
>
> Yet that claim of what would be best would be stateful and 
> statefulness is the very thing that Unicode seeks to avoid. 

All rich text is stateful, and rich text is very widely used and 
cut&paste tends to work rather well among applications that support it, 
as do conversions of entire documents. Trying to duplicate it with "yet 
another mechanism" is a doubtful achievement, even if it could be made 
"stateless".

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190125/c3550137/attachment.html>

From unicode at unicode.org  Fri Jan 25 16:02:25 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Fri, 25 Jan 2019 17:02:25 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190125212633.147193ac@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <20190125212633.147193ac@JRWUBU2>
Message-ID: <CAJdVgG+e7ZE7oFzQG3o-Urnt+vjehiRFd_pnr_kLNUiNPQfHzw@mail.gmail.com>

I guess U+02BC is category Lm not Mn, but doesn't that still mean it
modifies the previous character (i.e. is really part of the same grapheme
cluster) and so isn't appropriate as either a vowel or an indication of an
omitted vowel?


On Fri, Jan 25, 2019 at 4:30 PM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Fri, 25 Jan 2019 12:39:47 -0500
> James Tauber via Unicode <unicode at unicode.org> wrote:
>
> > Thank you, although the word break does still affect things like
> > double-clicking to select.
> >
> > And people do seem to want to use U+02BC for this reason (and I'm
> > trying to articulate why that isn't what U+02BC is meant for).
>
> It's a bit tricky when the reason is that it was too hard to get users
> of English to make a distinction between U+02BC and U+2019.  And for
> Larry Niven's elephant-like aliens in _Footfall__, is _fi'_, the
> singular of _fithp_, better written with U+02BC or U+2019?  And does
> the phonetically faithful spelling of Estuarine English _fi'_ for
> _fit_ depend on whether the glottal stop is dropped?
>
> The science-fiction ethnonym _Vl'harg_ is also tricky.  Does its elegant
> encoding depend on whether the apostrophe is a vowel symbol (so
> U+02BC) or the indication of an omitted vowel (so U+2019)?
>
> Richard.
>


-- 
*James Tauber*
Greek Linguistics: https://jktauber.com/
Music Theory: https://modelling-music.com/
Digital Tolkien: https://digitaltolkien.com/

Twitter: @jtauber
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190125/df39dbbc/attachment.html>

From unicode at unicode.org  Fri Jan 25 16:03:52 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 25 Jan 2019 14:03:52 -0800
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
Message-ID: <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190125/d2076655/attachment.html>

From unicode at unicode.org  Fri Jan 25 16:06:57 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 25 Jan 2019 14:06:57 -0800
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <cf03f3a5-df3b-725c-f88c-7fa7502da853@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <cf03f3a5-df3b-725c-f88c-7fa7502da853@gmail.com>
Message-ID: <8d8f81ea-88d9-8382-f8b1-5b648eda9923@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190125/482b8957/attachment.html>

From unicode at unicode.org  Fri Jan 25 17:49:15 2019
From: unicode at unicode.org (Andrew Cunningham via Unicode)
Date: Sat, 26 Jan 2019 10:49:15 +1100
Subject: Encoding italic
In-Reply-To: <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
Message-ID: <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>

Assuming some mechanism for italics is added to Unicode,  when converting
between the new plain text and HTML there is insufficient information to
correctly convert to HTML. many elements may have italic stying and there
would be no meta information in Unicode to indicate the appropriate HTML
element.


On Friday, 25 January 2019, wjgo_10009 at btinternet.com via Unicode <
unicode at unicode.org> wrote:

> Asmus Freytag wrote;
>
> Other schemes, like a VS per code point, also suffer from being different
>> in philosophy from "standard" rich text approaches. Best would be as
>> standard extension to all the messaging systems (e.g. a common markdown
>> language, supported by UI).     A./
>>
>
> Yet that claim of what would be best would be stateful and statefulness is
> the very thing that Unicode seeks to avoid.
>
> Plain text is the basic system and a Variation Selector mechanism after
> each character that is to become italicized is not stateful and can be
> implemented using existing OpenType technology.
>
> If an organization chooses to develop and use a rich text format then that
> is a matter for that organization and any changing of formatting of how
> italics are done when converting between plain text and rich text is the
> responsibility of the organization that introduces its rich text format.
>
> Twitter was just an example that someone introduced along the way, it was
> not the original request.
>
> Also this is not only about messaging. Of primary importance is the
> conservation of texts in plain text format, for example, where a printed
> book has one word italicized in a sentence and the text is being
> transcribed into a computer.
>
> William Overington
> Friday 25 January 2019
>
>

-- 
Andrew Cunningham
lang.support at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190126/bff29502/attachment.html>

From unicode at unicode.org  Fri Jan 25 18:18:32 2019
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Fri, 25 Jan 2019 16:18:32 -0800
Subject: Encoding italic
In-Reply-To: <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
Message-ID: <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>

On 1/25/2019 3:49 PM, Andrew Cunningham wrote:
> Assuming some mechanism for italics is added to Unicode,? when 
> converting between the new plain text and HTML there is insufficient 
> information to correctly convert to HTML. many elements may have 
> italic stying and there would be no meta information in Unicode to 
> indicate the appropriate HTML element.
>
>
So, we would be creating an interoperability issue.

A./


>
>
> On Friday, 25 January 2019, wjgo_10009 at btinternet.com 
> <mailto:wjgo_10009 at btinternet.com> via Unicode <unicode at unicode.org 
> <mailto:unicode at unicode.org>> wrote:
>
>     Asmus Freytag wrote;
>
>         Other schemes, like a VS per code point, also suffer from
>         being different in philosophy from "standard" rich text
>         approaches. Best would be as standard extension to all the
>         messaging systems (e.g. a common markdown language, supported
>         by UI).? ? ?A./
>
>
>     Yet that claim of what would be best would be stateful and
>     statefulness is the very thing that Unicode seeks to avoid.
>
>     Plain text is the basic system and a Variation Selector mechanism
>     after each character that is to become italicized is not stateful
>     and can be implemented using existing OpenType technology.
>
>     If an organization chooses to develop and use a rich text format
>     then that is a matter for that organization and any changing of
>     formatting of how italics are done when converting between plain
>     text and rich text is the responsibility of the organization that
>     introduces its rich text format.
>
>     Twitter was just an example that someone introduced along the way,
>     it was not the original request.
>
>     Also this is not only about messaging. Of primary importance is
>     the conservation of texts in plain text format, for example, where
>     a printed book has one word italicized in a sentence and the text
>     is being transcribed into a computer.
>
>     William Overington
>     Friday 25 January 2019
>
>
>
> -- 
> Andrew Cunningham
> lang.support at gmail.com <mailto:lang.support at gmail.com>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190125/1fb47b5e/attachment.html>

From unicode at unicode.org  Fri Jan 25 20:36:27 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 26 Jan 2019 02:36:27 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgG+e7ZE7oFzQG3o-Urnt+vjehiRFd_pnr_kLNUiNPQfHzw@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <20190125212633.147193ac@JRWUBU2>
 <CAJdVgG+e7ZE7oFzQG3o-Urnt+vjehiRFd_pnr_kLNUiNPQfHzw@mail.gmail.com>
Message-ID: <20190126023627.4962951e@JRWUBU2>

On Fri, 25 Jan 2019 17:02:25 -0500
James Tauber via Unicode <unicode at unicode.org> wrote:

> I guess U+02BC is category Lm not Mn, but doesn't that still mean it
> modifies the previous character (i.e. is really part of the same
> grapheme cluster) and so isn't appropriate as either a vowel or an
> indication of an omitted vowel?

To quote TUS:

"A few may modify the following letter, and some may serve as a
independent letters".

Bear in mind that one of the uses of U+02BC is the scholarly
representation of a glottal stop, especially in Arabic names.

Richard.

From unicode at unicode.org  Sat Jan 26 00:12:25 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Sat, 26 Jan 2019 01:12:25 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190126023627.4962951e@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <20190125212633.147193ac@JRWUBU2>
 <CAJdVgG+e7ZE7oFzQG3o-Urnt+vjehiRFd_pnr_kLNUiNPQfHzw@mail.gmail.com>
 <20190126023627.4962951e@JRWUBU2>
Message-ID: <CAJdVgGKL3Hi5i4dkcVX-YWggy9g_FraK0ggRHmFwhsSuFxytuA@mail.gmail.com>

On Fri, Jan 25, 2019 at 9:41 PM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> To quote TUS:
>
> "A few may modify the following letter, and some may serve as a
> independent letters".
>
> Bear in mind that one of the uses of U+02BC is the scholarly
> representation of a glottal stop, especially in Arabic names.
>

Okay, so this legitimises the use of U+02BC (with its better
word-breaking properties) for the apostrophe marking elision in Ancient
Greek even though U+2019 is stated as the preferred character _in
general_ for the apostrophe.

On balance, this would seem to suggest U+02BC can (and perhaps
should) be used for the specific purpose in Ancient Greek.

(Of course, the other character that comes up is U+1FBD, but there
the consensus seems strong that this is just plain wrong.)

Thank you all.

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190126/5f92b89e/attachment.html>

From unicode at unicode.org  Sat Jan 26 00:39:42 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 26 Jan 2019 06:39:42 +0000
Subject: Encoding italic
In-Reply-To: <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
Message-ID: <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>


On 2019-01-26 12:18 AM, Asmus Freytag (c) responded:
> On 1/25/2019 3:49 PM, Andrew Cunningham wrote:
>> Assuming some mechanism for italics is added to Unicode,? when 
>> converting between the new plain text and HTML there is insufficient 
>> information to correctly convert to HTML. many elements may have 
>> italic stying and there would be no meta information in Unicode to 
>> indicate the appropriate HTML element.
>>
>>
> So, we would be creating an interoperability issue.
>
>
What happens now when we convert plain-text to HTML?

From unicode at unicode.org  Sat Jan 26 00:42:36 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 26 Jan 2019 06:42:36 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <8d8f81ea-88d9-8382-f8b1-5b648eda9923@ix.netcom.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <cf03f3a5-df3b-725c-f88c-7fa7502da853@gmail.com>
 <8d8f81ea-88d9-8382-f8b1-5b648eda9923@ix.netcom.com>
Message-ID: <83391552-c908-ce3e-d9b1-9126a427dfe8@gmail.com>


On 2019-01-25 10:06 PM, Asmus Freytag via Unicode wrote:
> James, by now it's unclear whether your ' is 2019 or 02BC.
The example word "aren't" in previous message used U+2019.? Sorry if I 
was unclear.

From unicode at unicode.org  Sat Jan 26 05:02:58 2019
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Sat, 26 Jan 2019 12:02:58 +0100
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
Message-ID: <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>

> breaking selection for "d'Artagnan" or "can't" into two is overly fussy.

True, and that is not what U+2019 does; it does not break medially.

Mark


On Fri, Jan 25, 2019 at 11:07 PM Asmus Freytag via Unicode <
unicode at unicode.org> wrote:

> On 1/25/2019 9:39 AM, James Tauber via Unicode wrote:
>
> Thank you, although the word break does still affect things like
> double-clicking to select.
>
> And people do seem to want to use U+02BC for this reason (and I'm trying
> to articulate why that isn't what U+02BC is meant for).
>
> For normal edition operations, breaking selection for "d'Artagnan" or
> "can't" into two is overly fussy.
>
> No wonder people get frustrated.
>
> A./
>
> James
>
> On Fri, Jan 25, 2019 at 12:34 PM Mark Davis ?? <mark at macchiato.com> wrote:
>
>> U+2019 is normally the character used, except where the ? is considered a
>> letter. When it is between letters it doesn't cause a word break, but
>> because it is also a right single quote, at the end of words there is a
>> break. Thus in a phrase like ?tryin? to go? there is a word break after the
>> n, because one can't tell.
>>
>> So something like "?? ??????" (picking a phrase at random) would have a
>> word break after the delta.
>>
>> Word break:
>> ?? ??????
>>
>> However, there is no *line break* between them (which is the more
>> important operation in normal usage). Probably not worth tailoring the word
>> break.
>>
>> Line break:
>> ?? ??????
>>
>> Mark
>>
>>
>> On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode <
>> unicode at unicode.org> wrote:
>>
>>> There seems some debate amongst digital classicists in whether to use
>>> U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking
>>> elision. (e.g. ?? for ?? preceding a word starting with a vowel).
>>>
>>> It seems to me that U+2019 is the technically correct choice per the
>>> Unicode Standard but it is not without at least one problem: default word
>>> breaking rules.
>>>
>>> I'm trying to provide guidelines for digital classicists in this regard.
>>>
>>> Is it correct to say the following:
>>>
>>> 1) U+2019 is the correct character to use for the apostrophe in Ancient
>>> Greek when marking elision.
>>> 2) U+02BC is a misuse of a modifier for this purpose
>>> 3) However, use of U+2019 (unlike U+02BC) means the default Word
>>> Boundary Rules in UAX#29 will (incorrectly) exclude the apostrophe from the
>>> word token
>>> 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules
>>> in UAX#29 will (incorrectly) include the apostrophe as part of a glyph
>>> cluster with the previous letter
>>> 5) The correct solution is to tailor the Word Boundary Rules in the case
>>> of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't
>>> have the same ambiguity problems with the single quotation mark as in
>>> English as it should not be used as a quotation mark in Ancient Greek)
>>>
>>> Many thanks in advance.
>>>
>>> James
>>>
>>
>
> --
> *James Tauber*
> Greek Linguistics: https://jktauber.com/
> Music Theory: https://modelling-music.com/
> Digital Tolkien: https://digitaltolkien.com/
>
> Twitter: @jtauber
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190126/0486aafd/attachment.html>

From unicode at unicode.org  Sat Jan 26 05:45:19 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 26 Jan 2019 11:45:19 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
Message-ID: <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>


Mark Davis responded to Asmus Freytag,

 >> breaking selection for "d'Artagnan" or "can't" into two is overly fussy.
 >
 > True, and that is not what U+2019 does; it does not break medially.

Mark Davis earlier posted this example,
 > So something like "?? ??????" (picking a phrase at random) would have
 > a word break after the delta.
If the user wanted to use the preferred character, U+2019, would using 
the no break space (U+00A0) after it resolve the word or line break 
issues?? Or possibly NNBSP (U+202F)?

It's a shame if users choose suboptimal characters over preferred 
characters because of what are essentially rendering/text selection 
issues.? IMO, it's better to use preferred characters in the long run.

(Users should file bug reports on applications which improperly medially 
break strings which include U+2019.)


From unicode at unicode.org  Sat Jan 26 09:45:54 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 26 Jan 2019 15:45:54 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
Message-ID: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>


Perhaps I'm not understanding, but if the desired behavior is to 
prohibit both line and word breaks in the example string, then...

In Notepad, replacing U+0020 with U+00A0 removes the line-break.
U+0020 ( ?? ?????? )
U+00A0 ( ????????? )
U+202F ( ????????? )
It also changes the advancement of the text cursor (Ctrl + arrows), 
suggesting that word/string selection would be as desired.? (U+202F also 
does this and may offer a more pleasing appearance to classisists by 
default.)

Wouldn't it be best to handle substitution of U+00A0 for U+0020 at the 
input method / keyboard driver level where appropriate, so that 
preferred apostrophe U+2019 can be used?


From unicode at unicode.org  Sat Jan 26 17:52:28 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Sat, 26 Jan 2019 18:52:28 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
Message-ID: <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>

Well, *my* desire it to simple know whether to tell people doing digital
editions of Ancient Greek texts whether to use U+2019 or U+02BC for the
apostrophe marking elision (or at least accurately describe the trade-offs
of each).


On Sat, Jan 26, 2019 at 10:50 AM James Kass via Unicode <unicode at unicode.org>
wrote:

>
> Perhaps I'm not understanding, but if the desired behavior is to
> prohibit both line and word breaks in the example string, then...
>
> In Notepad, replacing U+0020 with U+00A0 removes the line-break.
> U+0020 ( ?? ?????? )
> U+00A0 ( ?? ?????? )
> U+202F ( ????????? )
> It also changes the advancement of the text cursor (Ctrl + arrows),
> suggesting that word/string selection would be as desired.  (U+202F also
> does this and may offer a more pleasing appearance to classisists by
> default.)
>
> Wouldn't it be best to handle substitution of U+00A0 for U+0020 at the
> input method / keyboard driver level where appropriate, so that
> preferred apostrophe U+2019 can be used?
>
>

-- 
*James Tauber*
Eldarion <https://eldarion.com/> | jktauber.com (Greek Linguistics)
<https://jktauber.com/> | Modelling Music
<https://modelling-music.com/> | Digital
Tolkien <https://digitaltolkien.com/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190126/d4e8230e/attachment.html>

From unicode at unicode.org  Sat Jan 26 18:32:43 2019
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Sun, 27 Jan 2019 00:32:43 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
Message-ID: <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>

I?ll be publishing a translation of Alice into Ancient Greek in due course. I will absolutely only use U+2019 for the apostrophe. It would be wrong for lots of reasons to use U+02BC for this.

Moreover, implementations of U+02BC need to be revised. In the context of Polynesian languages, it is impossible to use U+02BC if it is _identical_ to U+2019. Readers cannot work out what is what. I will prepare documentation on this in due course.

> On 26 Jan 2019, at 23:52, James Tauber via Unicode <unicode at unicode.org> wrote:
> 
> Well, my desire it to simple know whether to tell people doing digital editions of Ancient Greek texts whether to use U+2019 or U+02BC for the apostrophe marking elision (or at least accurately describe the trade-offs of each).


From unicode at unicode.org  Sat Jan 26 19:11:49 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 26 Jan 2019 17:11:49 -0800
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
Message-ID: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190126/b71f5c2d/attachment.html>

From unicode at unicode.org  Sat Jan 26 19:15:18 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Jan 2019 01:15:18 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
Message-ID: <20190127011518.0b7e2ace@JRWUBU2>

On Sat, 26 Jan 2019 15:45:54 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> Perhaps I'm not understanding, but if the desired behavior is to 
> prohibit both line and word breaks in the example string, then...
> 
> In Notepad, replacing U+0020 with U+00A0 removes the line-break.

I believe the problem is that "?? ??????" should have non-blank
*words*.  With U+2019, one gets 3.  Line-break suppressing spaces don't
help with word-breaking, because they are not treated as letters.

A clunky solution would be to have a sequence <delta,
control-joining-words, U+2019>.  However, there is no such
thing as a 'control-joining-words' if one complies with the TUS
injunction in Section 23.3, "The word joiner should be ignored in
contexts other than line breaking".  A robust, trainable spell-checker
will treat this institutionally racist injunction with the contempt it
deserves.

It's interesting that the spellings "'bus" and "'phone" have died.
They would once have hit the word-boundary problems when "bus" and
"phone" were rejected.

Richard.


From unicode at unicode.org  Sat Jan 26 19:37:39 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Jan 2019 01:37:39 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
Message-ID: <20190127013739.3eb50597@JRWUBU2>

On Sun, 27 Jan 2019 00:32:43 +0000
Michael Everson via Unicode <unicode at unicode.org> wrote:

> I?ll be publishing a translation of Alice into Ancient Greek in due
> course. I will absolutely only use U+2019 for the apostrophe. It
> would be wrong for lots of reasons to use U+02BC for this.

Please list them.

Will your coding decision be machine readable for the readership?

> Moreover, implementations of U+02BC need to be revised. In the
> context of Polynesian languages, it is impossible to use U+02BC if it
> is _identical_ to U+2019. Readers cannot work out what is what. I
> will prepare documentation on this in due course.

It looks as though you've found a new character - or a revived
distinction.

Richard.


From unicode at unicode.org  Sat Jan 26 19:43:57 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Jan 2019 01:43:57 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
Message-ID: <20190127014357.78efc612@JRWUBU2>

On Sat, 26 Jan 2019 17:11:49 -0800
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> To make matters worse, users for languages that "should" use U+02BC
> aren't actually consistent; much data uses U+2019 or U+0027. Ordinary
> users can't tell the difference (and spell checkers seem not
> successful in enforcing the practice).

That appears to contradict Michael Everson's remark about a Polynesian
need to distinguish the two visually.

Richard.

From unicode at unicode.org  Sat Jan 26 19:55:29 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 27 Jan 2019 01:55:29 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127014357.78efc612@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
Message-ID: <9280d0ca-88d7-728e-03aa-70d55486274e@gmail.com>


Richard Wordingham replied to Asmus Freytag,

 >> To make matters worse, users for languages that "should" use U+02BC
 >> aren't actually consistent; much data uses U+2019 or U+0027. Ordinary
 >> users can't tell the difference (and spell checkers seem not
 >> successful in enforcing the practice).
 >
 > That appears to contradict Michael Everson's remark about a Polynesian
 > need to distinguish the two visually.

Does it?

U+02BC /should/ be used but ordinary users can't tell the difference 
because the glyphs in their displays are identical, resulting in much 
data which uses U+2019 or U+0027.? I don't see any contradiction.


From unicode at unicode.org  Sat Jan 26 19:59:27 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 27 Jan 2019 01:59:27 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127013739.3eb50597@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
Message-ID: <3ccba01a-cabf-8e54-1983-fb12b6bb9ef4@gmail.com>


Richard Wordingham responded to Michael Everson,

 >> I?ll be publishing a translation of Alice into Ancient Greek in due
 >> course. I will absolutely only use U+2019 for the apostrophe. It
 >> would be wrong for lots of reasons to use U+02BC for this.
 >
 > Please list them.

Let's see the list of reasons why U+02BC should be used first.


From unicode at unicode.org  Sat Jan 26 20:06:57 2019
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Sun, 27 Jan 2019 02:06:57 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127014357.78efc612@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
Message-ID: <144A0313-1566-4C3E-862F-C7C313881B65@evertype.com>

Polynesians are using 0027 as a fallback, and this has to do with education, keyboarding, and training.

The typography of the fallback is of no consequence. It?s a fallback.

> On 27 Jan 2019, at 01:43, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Sat, 26 Jan 2019 17:11:49 -0800
> Asmus Freytag via Unicode <unicode at unicode.org> wrote:
> 
>> To make matters worse, users for languages that "should" use U+02BC
>> aren't actually consistent; much data uses U+2019 or U+0027. Ordinary
>> users can't tell the difference (and spell checkers seem not
>> successful in enforcing the practice).
> 
> That appears to contradict Michael Everson's remark about a Polynesian
> need to distinguish the two visually.
> 
> Richard.


From unicode at unicode.org  Sat Jan 26 20:25:41 2019
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Sun, 27 Jan 2019 02:25:41 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127013739.3eb50597@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
Message-ID: <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>

On 27 Jan 2019, at 01:37, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
>> I?ll be publishing a translation of Alice into Ancient Greek in due
>> course. I will absolutely only use U+2019 for the apostrophe. It
>> would be wrong for lots of reasons to use U+02BC for this.
> 
> Please list them.

The Greek use is of an apostrophe. Often a mark elision (as here), that?s what 2019 is for.

02BC is a letter. Usually a glottal stop. 

I didn?t follow the beginning of this. Evidently it has something to do with word selection of d? + a space + what follows. If that?s so, then there?s no argument at all for 02BC. It?s a question of the space, and that?s got nothing to do with the identity of the apostrophe.

> Will your coding decision be machine readable for the readership?

I don?t know what you mean by ?readable?.

>> Moreover, implementations of U+02BC need to be revised. In the
>> context of Polynesian languages, it is impossible to use U+02BC if it
>> is _identical_ to U+2019. Readers cannot work out what is what. I
>> will prepare documentation on this in due course.
> 
> It looks as though you've found a new character - or a revived
> distinction.

It may not be ?revived?. In origin, linguists took the lead-type 2019 and used it as a consonant letter. Now, in the 21st century, where Harry Potter is translated into Hawaiian, and where Harry Potter has glottals alongside both single and double quotation marks, the 02BC?s need to be bigger or the text can?t be read easily. In our work we found that a vertical height of 140% bigger than the quotation mark improved legibility hugely. Fine typography asks for some other alterations to the glyph, but those are cosmetic.

If the recommended glyph for 02BC were to be changed, it would in no case impact adversely on scientific linguistics texts. It would just make the mark a bit bigger. But for practical use in Polynesian languages where the character has to be found alongside the quotation marks, a glyph distinction must be made between this and punctuation.

Michael Everson


From unicode at unicode.org  Sat Jan 26 20:26:23 2019
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Sun, 27 Jan 2019 02:26:23 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <3ccba01a-cabf-8e54-1983-fb12b6bb9ef4@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <3ccba01a-cabf-8e54-1983-fb12b6bb9ef4@gmail.com>
Message-ID: <6BD6F978-B123-4F97-9E50-1A8189A36787@evertype.com>

Fair enough, but I didn?t wait.

> On 27 Jan 2019, at 01:59, James Kass via Unicode <unicode at unicode.org> wrote:
> 
> 
> Richard Wordingham responded to Michael Everson,
> 
> >> I?ll be publishing a translation of Alice into Ancient Greek in due
> >> course. I will absolutely only use U+2019 for the apostrophe. It
> >> would be wrong for lots of reasons to use U+02BC for this.
> >
> > Please list them.
> 
> Let's see the list of reasons why U+02BC should be used first.
> 


From unicode at unicode.org  Sat Jan 26 21:53:06 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Jan 2019 03:53:06 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <9280d0ca-88d7-728e-03aa-70d55486274e@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
 <9280d0ca-88d7-728e-03aa-70d55486274e@gmail.com>
Message-ID: <20190127035306.27d7a124@JRWUBU2>

On Sun, 27 Jan 2019 01:55:29 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham replied to Asmus Freytag,
> 
>  >> To make matters worse, users for languages that "should" use
>  >> U+02BC aren't actually consistent; much data uses U+2019 or
>  >> U+0027. Ordinary users can't tell the difference (and spell
>  >> checkers seem not successful in enforcing the practice).  
>  >
>  > That appears to contradict Michael Everson's remark about a
>  > Polynesian need to distinguish the two visually.  
> 
> Does it?
> 
> U+02BC /should/ be used but ordinary users can't tell the difference 
> because the glyphs in their displays are identical, resulting in much 
> data which uses U+2019 or U+0027.? I don't see any contradiction.

I had assumed that Polynesians would be writing with paper and ink.  It
depends on what 'tell the difference' means.  In normal parlance it
means that they are unaware of the difference in the symbols; you are
assuming that it means that printed material doesn't show the
difference.

In general, handwritten differences can show up in various ways.  For
example, one can find a slight, unreliable difference in the relative
positioning of characters that reflects the difference in the usage of
characters.

Of course, Asmus's facts have to be unreliable.  It's like someone
typing U+1142A NEWA LETTER MHA for Sanskrit <U+11434 NEWA LETTER HA,
U+11442 NEWA SIGN VIRAMA, U+11429 NEWA LETTER MA>, which we've been
assured would never happen.  There must be something wrong with reality.

Richard.


From unicode at unicode.org  Sat Jan 26 23:11:36 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 26 Jan 2019 21:11:36 -0800
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127014357.78efc612@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
Message-ID: <b4581511-1fa1-c5a8-33ef-bcf24878232d@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190126/5bfc1b13/attachment.html>

From unicode at unicode.org  Sat Jan 26 23:23:04 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 26 Jan 2019 21:23:04 -0800
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
Message-ID: <bb22435a-0cc7-0a54-bbac-ee4966e8df1f@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190126/e487c67b/attachment.html>

From unicode at unicode.org  Sat Jan 26 23:28:50 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 26 Jan 2019 21:28:50 -0800
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127035306.27d7a124@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
 <9280d0ca-88d7-728e-03aa-70d55486274e@gmail.com>
 <20190127035306.27d7a124@JRWUBU2>
Message-ID: <dc580685-f979-9b15-1a7f-4949e311b686@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190126/ba359e0f/attachment.html>

From unicode at unicode.org  Sun Jan 27 00:08:31 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Jan 2019 06:08:31 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <b4581511-1fa1-c5a8-33ef-bcf24878232d@ix.netcom.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
 <b4581511-1fa1-c5a8-33ef-bcf24878232d@ix.netcom.com>
Message-ID: <20190127060831.2e96572d@JRWUBU2>

On Sat, 26 Jan 2019 21:11:36 -0800
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> On 1/26/2019 5:43 PM, Richard Wordingham via Unicode wrote:

>> That appears to contradict Michael Everson's remark about a
>> Polynesian
>> need to distinguish the two visually.

> Why do you need to distinguish them? To code text correctly (so the
> invisible properties are what the software expects) or because a
> human reader needs the disambiguation in order to follow the text?

> The latter phenomenon is so common throughout many writing systems,
> that I have difficulties buying it.

It may be a matter of literacy in Hawaiian.  If the test readership
doesn't use ?okina, it could be confusing to have to resolve the
difference between a sentence(?) starting with one from a sentence in
single quotes. Otherwise, one does wonder why the issue should only
arise now.

One other possibility is that single quote punctuation is being used on
a readership used to double quote punctuation.  Double quotes would
avoid the confusion.

> PS: I wasn't talking about what the Polynesians do; different part of
> the world.

Why should the Polynesians be different?

Richard.


From unicode at unicode.org  Sun Jan 27 00:19:50 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 26 Jan 2019 22:19:50 -0800
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127060831.2e96572d@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
 <b4581511-1fa1-c5a8-33ef-bcf24878232d@ix.netcom.com>
 <20190127060831.2e96572d@JRWUBU2>
Message-ID: <973744ae-797d-18da-c5b3-1fd3ccd229cf@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190126/943117d5/attachment.html>

From unicode at unicode.org  Sun Jan 27 02:02:07 2019
From: unicode at unicode.org (Andrew Cunningham via Unicode)
Date: Sun, 27 Jan 2019 19:02:07 +1100
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <973744ae-797d-18da-c5b3-1fd3ccd229cf@ix.netcom.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
 <b4581511-1fa1-c5a8-33ef-bcf24878232d@ix.netcom.com>
 <20190127060831.2e96572d@JRWUBU2>
 <973744ae-797d-18da-c5b3-1fd3ccd229cf@ix.netcom.com>
Message-ID: <CAGJ7U-WMf8XZHKfZSPWVPB-7HgM2UwD+0-5is6HZ2hCBGKMc8Q@mail.gmail.com>

On Sunday, 27 January 2019, Asmus Freytag via Unicode <unicode at unicode.org>
wrote:

>
> Choice of quotation marks is language-based and for novels, many times
> there are
> additional conventions that may differ by publisher.
>
> Wonder why the publisher is forcing single quotes on them
>

In theory quotation marks are language based but many languages have had
the puntuation and typographic conventions of colonial languages  imposed,
even when it isn't the best choice.

And publishers are following established patterns. The publishers that care
about the language do try to distinguish or refine these characters
typographically.

Andrew


-- 
Andrew Cunningham
lang.support at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190127/982c5e8e/attachment.html>

From unicode at unicode.org  Sun Jan 27 09:08:19 2019
From: unicode at unicode.org (Tom Gewecke via Unicode)
Date: Sun, 27 Jan 2019 08:08:19 -0700
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127060831.2e96572d@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
 <b4581511-1fa1-c5a8-33ef-bcf24878232d@ix.netcom.com>
 <20190127060831.2e96572d@JRWUBU2>
Message-ID: <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org>


> On Jan 26, 2019, at 11:08 PM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> It may be a matter of literacy in Hawaiian.  If the test readership
> doesn't use ?okina, 

I think the Unicode Hawaiian ?okina is supposed to be U+02BB (instead of U+02BC).


From unicode at unicode.org  Sun Jan 27 09:37:37 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 27 Jan 2019 15:37:37 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
 <b4581511-1fa1-c5a8-33ef-bcf24878232d@ix.netcom.com>
 <20190127060831.2e96572d@JRWUBU2>
 <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org>
Message-ID: <745b8264-7e20-6a73-199a-6de4e9cab66a@gmail.com>


On 2019-01-27 3:08 PM, Tom Gewecke via Unicode wrote:
> I think the Unicode Hawaiian ?okina is supposed to be U+02BB (instead 
> of U+02BC).
notes for U+02BB
* typographical alternate for 02BD or 02BF
* used in Hawai'ian orthorgraphy as 'okina (glottal stop)

From unicode at unicode.org  Sun Jan 27 10:08:20 2019
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Sun, 27 Jan 2019 16:08:20 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127052149.1baaf1b2@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
Message-ID: <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>

On 27 Jan 2019, at 05:21, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

>>> I?ll be publishing a translation of Alice into Ancient Greek in due
>>>> course. I will absolutely only use U+2019 for the apostrophe. It
>>>> would be wrong for lots of reasons to use U+02BC for this.  
>>> 
>>> Please list them.  
>> 
>> The Greek use is of an apostrophe. Often a mark elision (as here),
>> that?s what 2019 is for.
>> 
>> 02BC is a letter. Usually a glottal stop. 
> 
> So it would seem that the 'lots of reasons' is just that it goes against the *recommendation* of TUS.

I have no idea what TUS says about this. I did not look it up. I know a lot about characters, though. 

> Incidentally, I believe the principal use of U+2019 RIGHT SINGLE QUOTATION MARK is as a quotation mark.

You can believe what you like, but that isn?t likely true. In books which prefer ?this kind? of quotation marks for primary quotations and ?this kind? for nested quotations, 2019 is primarily used for the apostrophe in words like I?m, can?t, isn?t, don?t etc. In books which prefer ?this kind? for primary quotations 2019 the statistics will be different. But 2019 is still the correct character for both.

> As you have noted in the text left in below, U+02BC started out as the apostrophe.

Lead-type typesetters used that sort, yes. And that sort was used for both apostrophe and single quotation marks. 

> The closing single inverted comma has a different origin to the apostrophe.

No, it doesn?t, but you are welcome to try to prove your assertion. 

> My argument for U+02BC is that this apostrophe is an integral part of the word.

It is a letter. In ?can?t? the apostrophe isn?t a letter. It?s a mark of elision.  I can double-click on the three words in this paragraph which have the apostrophe in them, and they are all whole-word selected. 

> The main constituent of a prototypical word are letters and their attendant marks. Now, the word-breaking algorithm in TR27 allows for various generally overloaded elements to join elements of a word. However, this apostrophe does not mark the boundary of constituents. Accordingly it makes sense to treat it as a letter.

The behaviour of 2019 it not broken. I use it every day. I?ve typeset many many books in English and Cornish and Irish, all of which use single quotation marks and double quotation marks and lots and lots of apostrophes, and I have no trouble with them. 2019 has for decades been treated correctly in software that I use. 

> Treating the Greek apostrophe as a letter (U+02BC) gives better word-breaking.

Why do you claim this? I did not read the beginning of this thread and I am not going to try to find it. What is the problem you claim to have? In what software? On what platform?

> I don't see any downside in treating it like a Polynesian glottal stop.

I do. And to try to replace the apostrophe in English can?t and don?t and all is doomed to fail. Doomed. 

Moreover there are good practical reasons to change the glyph for the Polynesian letter.

When I typeset Greek, I will use 2019 for the apostrophe. 

> Is someone going to tell me there is an advantage in treating "men's? as one word but "dogs'" as two?  As I've said, the argument for encoding English apostrophes as U+2019 is that even with adequate keyboards, users cannot be relied upon to distinguish U+02BC and U+2019 - especially with no feedback. A writing system should choose one and stick with it.  User unreliability forces a compromise.

Polynesian users need to 02BC to be visually distinguished from 2019. European users don?t need the apostrophe to be visually distinguished from 2019. The edge case of ?dogs?? doesn?t convince me. In all my years of typesetting I have never once noticed this, much less considered it a problem that needed fixing.

> Now, if text processors were to enable a difference, then the arguments would change.  I for one find it helpful that Microsoft Word is willing to display visible symbols for spaces and tab characters so that I know what white space is composed of.

Most word-processing typesetting programs will do this. Quark and InDesign do. Word and LibreOffice and Apple Pages do. 

>> I didn?t follow the beginning of this. Evidently it has something to do with word selection of d? + a space + what follows. If that?s so, then there?s no argument at all for 02BC. It?s a question of the space, and that?s got nothing to do with the identity of the apostrophe.
> 
> The word selection issue is that except before a letter, the standard word-breaking algorithm says that there is a word boundary between the delta and apostrophe.

Well, that?s the expected behaviour for a character which is polyvalent. If you have problems double-clicking ?d? Artagnan? you should probably just write ?d?Artagnan?. 

> 
>>> Will your coding decision be machine readable for the readership?  
>> 
>> I don?t know what you mean by ?readable?.
> 
> Will the difference between U+02BC and U+2019 be discernible by the readers?

They should be, in Polynesian languages. Otherwise the text isn't easily legible. 

> If one could copy a phrase to a general application and select a word by double-clicking, then the difference would be visible.

If you know what the behaviour is then you can take it into account when you are copying a word. You can?t fix this by character encoding. Certainly not by screwing with 02BC.

> If the result of the publishing is simply a printed book, then your choice of U+2019 or U+02BC will depend only on font differences.

That non-argument can be applied to everything. 

> Not that it makes much difference to the issue,  but isn't the correct encoding for the ?okina U+02BB MODIFIER LETTER TURNED COMMA? 

Yes, but both 02BB and 02BC are used in linguistic transcriptions and in Polynesian languages, and the graphic identity with 2018 and 2019 is problematic and unnecessary.

Using 02BC for the apostrophe is a mistake, in my view.

Michael Everson

From unicode at unicode.org  Sun Jan 27 10:11:12 2019
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Sun, 27 Jan 2019 16:11:12 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
 <b4581511-1fa1-c5a8-33ef-bcf24878232d@ix.netcom.com>
 <20190127060831.2e96572d@JRWUBU2>
 <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org>
Message-ID: <95DB9A94-B1E8-4CEA-929D-9AB3018CCF93@evertype.com>

Yes, yes. It doesn?t matter. The discussion applies to both the two quotation marks and the two modifier letters.

> On 27 Jan 2019, at 15:08, Tom Gewecke via Unicode <unicode at unicode.org> wrote:
> 
> 
>> On Jan 26, 2019, at 11:08 PM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
>> 
>> It may be a matter of literacy in Hawaiian.  If the test readership
>> doesn't use ?okina, 
> 
> I think the Unicode Hawaiian ?okina is supposed to be U+02BB (instead of U+02BC).
> 


From unicode at unicode.org  Sun Jan 27 11:32:42 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sun, 27 Jan 2019 12:32:42 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <bb22435a-0cc7-0a54-bbac-ee4966e8df1f@ix.netcom.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <bb22435a-0cc7-0a54-bbac-ee4966e8df1f@ix.netcom.com>
Message-ID: <2214f4d5-e440-69f0-18c2-be06aa064171@kli.org>

Well, sure; some languages work better with some fonts.? There's nothing 
wrong with saying that 02BC might look the same as 2019... but it's 
nice, when writing Hawaiian (or Klingon for that matter) to use a bigger 
glyph. That's why they pay typesetters the big bucks (you wish): to make 
things look good on the page.

I recall in early Volap?k, ? was a letter (presumably 02BC), with value 
/h/.? And the "capital" ? was the same, except bolder: see 
https://archive.org/details/cu31924027111453/page/n11 (entry 4, on the 
left-hand page).

~mark

On 1/27/19 12:23 AM, Asmus Freytag via Unicode wrote:
> On 1/26/2019 6:25 PM, Michael Everson via Unicode wrote:
> the 02BC?s need to be bigger or the text can?t be read easily. In our 
> work we found that a vertical height of 140% bigger than the quotation 
> mark improved legibility hugely. Fine typography asks for some other 
> alterations to the glyph, but those are cosmetic.
>> If the recommended glyph for 02BC were to be changed, it would in no case impact adversely on scientific linguistics texts. It would just make the mark a bit bigger. But for practical use in Polynesian languages where the character has to be found alongside the quotation marks, a glyph distinction must be made between this and punctuation.
>
> It somehow seems to me that an evolution of the glyph shape of 02BC in 
> a direction of increased distinction from U+2019 is something that 
> Unicode has indeed made possible by a separate encoding. However, that 
> evolution is a matter of ALL the language communities that use U+02BC 
> as part of their orthography, and definitely NOT something were 
> Unicode can be permitted to take a lead. Unicode does not *recommend* 
> glyphs for letters.
>
> However, as a publisher, you are of course free to experiment and to 
> see whether your style becomes popular.
>
> There is a concern though, that your choice may appeal only to some 
> languages that use this code point and not become universally accepted.
>
> A./
>
>


From unicode at unicode.org  Sun Jan 27 11:38:39 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sun, 27 Jan 2019 12:38:39 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
Message-ID: <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>

On 1/27/19 11:08 AM, Michael Everson via Unicode wrote:
> It is a letter. In ?can?t? the apostrophe isn?t a letter. It?s a mark of elision.  I can double-click on the three words in this paragraph which have the apostrophe in them, and they are all whole-word selected.

That doesn't work when I try it: I double-click on the "a" in "can?t" 
and get only the "can" selected.

This does not necessarily prove anything; my software (Thunderbird) is 
arguably doing it wrong.

~mark

From unicode at unicode.org  Sun Jan 27 12:19:28 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Jan 2019 18:19:28 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
Message-ID: <20190127181928.2d5225a4@JRWUBU2>

On Sun, 27 Jan 2019 12:38:39 -0500
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 1/27/19 11:08 AM, Michael Everson via Unicode wrote:
> > It is a letter. In ?can?t? the apostrophe isn?t a letter. It?s a
> > mark of elision.  I can double-click on the three words in this
> > paragraph which have the apostrophe in them, and they are all
> > whole-word selected.  
> 
> That doesn't work when I try it: I double-click on the "a" in "can?t" 
> and get only the "can" selected.
> 
> This does not necessarily prove anything; my software (Thunderbird)
> is arguably doing it wrong.

Except the Uniocde-compliant processes aren't required to follow the
scheme of TR27 Unicode Text Segmentation.  However, it is only required
to select the whole word because the U+2019 is followed by a letter.
TR27 prescribes different behaviour for "dogs'" with U+2019 (interpret
as two 'words') and U+02BC (interpret as one word).  The GTK-based
email client I'm using has that difference, but also fails with
"don't" unless one uses U+02BC.

However LibreOffice treats "don't" as a single word for U+0027, U+02BC
and U+2019, but "dogs'" as a single word only for U+02BC.  This
complies with TR27.  I'm not surprised, as LibreOffice does use or has
used ICU.

Richard.


From unicode at unicode.org  Sun Jan 27 12:19:52 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Jan 2019 18:19:52 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <95DB9A94-B1E8-4CEA-929D-9AB3018CCF93@evertype.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <6feb077e-016f-f3bf-4fc7-55d8585ac935@ix.netcom.com>
 <20190127014357.78efc612@JRWUBU2>
 <b4581511-1fa1-c5a8-33ef-bcf24878232d@ix.netcom.com>
 <20190127060831.2e96572d@JRWUBU2>
 <69C2A193-6EC2-4BC2-8B25-85CB83524347@bluesky.org>
 <95DB9A94-B1E8-4CEA-929D-9AB3018CCF93@evertype.com>
Message-ID: <20190127181952.6cd1ba46@JRWUBU2>

On Sun, 27 Jan 2019 16:11:12 +0000
Michael Everson via Unicode <unicode at unicode.org> wrote:

> Yes, yes. It doesn?t matter. The discussion applies to both the two
> quotation marks and the two modifier letters.

Actually, there is a difference.  As the ?okina doesn?t occur at the
end of a word in Hawaiian, one only strictly needs a contrast at the
beginning of a word - unless Hawaiian makes significant use of the
apostrophe for abbreviation.  Unfortunately, U+02BB is worse than
U+02BC from this perspective.  

Richard.


From unicode at unicode.org  Sun Jan 27 13:09:31 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Sun, 27 Jan 2019 14:09:31 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127181928.2d5225a4@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
Message-ID: <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>

On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> Except the Uniocde-compliant processes aren't required to follow the
> scheme of TR27 Unicode Text Segmentation.  However, it is only required
> to select the whole word because the U+2019 is followed by a letter.
> TR27 prescribes different behaviour for "dogs'" with U+2019 (interpret
> as two 'words') and U+02BC (interpret as one word).  The GTK-based
> email client I'm using has that difference, but also fails with
> "don't" unless one uses U+02BC.
>
> However LibreOffice treats "don't" as a single word for U+0027, U+02BC
> and U+2019, but "dogs'" as a single word only for U+02BC.  This
> complies with TR27.  I'm not surprised, as LibreOffice does use or has
> used ICU.
>

This comes back to my original question that started this thread. Many
people creating Ancient Greek digital resources use U+02BC seemingly
because of incorrect word-breaking with *word-final* U+2019 (which is the
only time it occurs in Ancient Greek and always marking elision, never as
the end of a quotation).

I am trying to write guidelines as to why they should use U+2019. I'm
convinced it's technically the right code point to use but am wanting to
get my facts straight about how to address the word-breaking issue
(specifically for word-final U+2019 in Ancient Greek, to be clear). In my
original post, I asked if a language-specific tailoring of the text
segmentation algorithm was the solution but no one here has agreed so far.

Here's a concrete example from Smyth's Grammar:

??????? ??

Double-clicking on the first word should select the U+2019 as well.
Interestingly on macOS Mojave it does in Pages[1] but not in Notes, the
Terminal or here in Gmail on Chrome.

To be clear: when I say "should" I mean that that is the expectation
classicists have and the failure to meet it is why some of them insist on
using U+02BC.

I'm happy if the answer is "use U+2019 and go get your text segmentation
implementations fixed"[2] but am looking for confirmation of that.

James

[1] To be honest, I was impressed Pages got it right.
[2] In the same spirit as "if certain combining character combinations
don't work, the solution is not to add precomposed characters, it's to
improve the fonts" or "tonos and oxia are the same and if they look
different, it's the fault of your font".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190127/e2374c43/attachment.html>

From unicode at unicode.org  Sun Jan 27 13:57:37 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 27 Jan 2019 19:57:37 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
Message-ID: <ca9e5c69-3058-812e-71d9-0eba6739f301@gmail.com>


On 2019-01-27 7:09 PM, James Tauber via Unicode wrote:
> In my original post, I asked if a language-specific tailoring of the 
> text segmentation algorithm was the solution but no one here has 
> agreed so far.
If there are likely to be many languages requiring exceptions to the 
segmentation algorithm wrt U+2019, then perhaps it would be better to 
establish conventions using ZWJ/ZWNJ and adjust the algorithm 
accordingly so that it would be cross-languages.? (Rather than requiring 
additional and open ended language-specific tailorings.) (I inserted 
several combinations of ZWJ/ZWNJ into James Tauber's example, but 
couldn't improve the segmentation in LibreOffice, although it was 
possible to make it worse.)

From unicode at unicode.org  Sun Jan 27 14:03:00 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 27 Jan 2019 20:03:00 +0000
Subject: Encoding italic
In-Reply-To: <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
Message-ID: <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>


A new beta of BabelPad has been released which enables input, storing, 
and display of italics, bold, strikethrough, and underline in plain-text 
using the tag characters method described earlier in this thread.? This 
enhancement is described in the release notes linked on this download page:

http://www.babelstone.co.uk/Software/index.html


From unicode at unicode.org  Sun Jan 27 14:17:45 2019
From: unicode at unicode.org (Tom Gewecke via Unicode)
Date: Sun, 27 Jan 2019 13:17:45 -0700
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
Message-ID: <42122A2E-0EA8-40CB-98D1-582CB4D54CFA@bluesky.org>


> On Jan 27, 2019, at 12:09 PM, James Tauber via Unicode <unicode at unicode.org> wrote:
> 
> ??????? ??
> 
> Double-clicking on the first word should select the U+2019 as well. Interestingly on macOS Mojave it does in Pages[1] but not in Notes

On my ipad/iphone, Word does it correctly but Pages and Notes do not.


From unicode at unicode.org  Sun Jan 27 15:00:40 2019
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Sun, 27 Jan 2019 21:00:40 +0000 (GMT)
Subject: Ancient Greek apostrophe marking elision
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
Message-ID: <slrnq4s6vo.rfb.jcb@home.stevens-bradfield.com>

On 2019-01-27, Michael Everson via Unicode <unicode at unicode.org> wrote:
> On 27 Jan 2019, at 05:21, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
>> The closing single inverted comma has a different origin to the apostrophe.
> No, it doesn?t, but you are welcome to try to prove your assertion. 

As far as I can tell from the easily accessible literature, the
apostrophe derives from an in-line manuscript mark that is a point
with a tail, while the quotation marks derive from a marginal mark
shaped like an arrowhead (like modern guillemets). What is your story
about them?

>> Is someone going to tell me there is an advantage in treating "men's? as one word but "dogs'" as two?  As I've said, the argument for encoding English apostrophes as U+2019 is that even with adequate keyboards, users cannot be relied upon to distinguish U+02BC and U+2019 - especially with no feedback. A writing system should choose one and stick with it.  User unreliability forces a compromise.
>
> Polynesian users need to 02BC to be visually distinguished from 2019. European users don?t need the apostrophe to be visually distinguished from 2019. The edge case of ?dogs?? doesn?t convince me. In all my years of typesetting I have never once noticed this, much less considered it a problem that needed fixing.

You have a very low opinion of Polynesian users. People (as opposed to
computers) use context to remove ambiguity. Before we had to interact
with pedantic computers, we were rarely confused by the typewriter-induced
confusion of 1 and l and 0 and O (or, indeed, the use of symmetrical
quotation marks).
Now a sensible orthographic choice for a language using comma-like
letters would be to use guillemets for quotation, and while I don't
know (there being precious few modern Polynesian materials online), I
would guess that the languages of French Polynesia do that.
If, like Hawaiian, you're stuck with English-style quotation marks for
historical reasons, an obvious typographic solution is to thin-space
them, French-style. (See previous thread!). That seems visually
preferable to relying on a small difference in size of what is already
a small letter compared to everything else on the page.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Sun Jan 27 15:30:55 2019
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 27 Jan 2019 22:30:55 +0100
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <2214f4d5-e440-69f0-18c2-be06aa064171@kli.org>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <bb22435a-0cc7-0a54-bbac-ee4966e8df1f@ix.netcom.com>
 <2214f4d5-e440-69f0-18c2-be06aa064171@kli.org>
Message-ID: <CAGa7JC2j=qgZR7ubJRj=O39W7iL+TgkRBSoMDeHXCA1mcLaGQQ@mail.gmail.com>

For Volap?k, it looks much more like U+02BE (right half ring modifier
letter)
than like U+02BC (apostrophe "modifier" letter).
according to the PDF on
https://archive.org/details/cu31924027111453/page/n12

The half ring makes a clear distinction with the regular apostrophe (for
elisions) or quotation marks. It is used really in this context as a
modifier after another consonnants for borrowing words *phonetically* from
other languages, notably after 'c' and 'l'. Then U+02BD (left  half ring
"modifier" letter) is a regular letter (for translitterating the expirated
'h' from English). But I'm currious about the diacritic used above 'h' on
item (5) ("ta") of that page to transliteratiung the English soft "th". But
this was describing the "Labas" orthography.

On the next chapter ("Noms Tonabas"), another convention is used for the
aopostrophe like letters, and U+02BE (right half ring modifier letter) is
used instead of U+02BD for the expirated 'h' (see paragraph 18), but it is
said to use the "Greek mark" (not sure if the author meant the coronis
U+01FBD or the soft spirit U+01FBF).

So it looks like these were various early adaptations of the basic Volap?k
orthography to borrow foreign names (notable proper names for people,
trademarks, toponyms and other place names), and these were part of several
competing proposals. I'm curious to know if there was finally a wide enough
consensus to standardize these.

So It seems that for Volap?k all the apostrophe-like letters are not
formally assigned, authors will use anyone as they want when they
transliterate foreign words, or will simply avoid transliterating them
completely if they exist natively in a Latin form (I bet English is not
transliterated at all, and French or German accents are preserved as is if
they are already part of the basic alphabet and the only standard diacritic
is then the "diaeresis", as used in the German umlaut (Volap?k does not
need any true diaeresis to avoid the formation of diphtongs and digrams,
all its orthography use a single base letter as a foundation principle.

If so, the 1st convention using the apostrophe-like modifier to create
digrams is probably not favored and ther Tonabas convention is proably more
convenient and more compliant t othe principles. I don't think they will
ever use directly the greek signs or letters (like the one used for
transliterating the English 'ng' and would prefer using now the Latin Eng
letter.

The right half-ring being rarely supported is now most probably supported
using U+02BC (for both letter cases, ignoring the bolder style for the
capital variant) which uses a curved comma shape (with a filled bowl at
top). If there are case distinction, the same glyph would be used but at
different height instead of using bold distinctions, or dictinction would
be made using the alternate forms of the comma (probably the wedge for
lowercase, and the bowl with curl for capitals).

Note: Are the different shapes of the comma (and similar apostrophe-like
letters, or even the semicolon) distinguished with encoded variant
selectors ?


Le dim. 27 janv. 2019 ? 18:42, Mark E. Shoulson via Unicode <
unicode at unicode.org> a ?crit :

> Well, sure; some languages work better with some fonts.  There's nothing
> wrong with saying that 02BC might look the same as 2019... but it's
> nice, when writing Hawaiian (or Klingon for that matter) to use a bigger
> glyph. That's why they pay typesetters the big bucks (you wish): to make
> things look good on the page.
>
> I recall in early Volap?k, ? was a letter (presumably 02BC), with value
> /h/.  And the "capital" ? was the same, except bolder: see
> https://archive.org/details/cu31924027111453/page/n11 (entry 4, on the
> left-hand page).
>
> ~mark
>
> On 1/27/19 12:23 AM, Asmus Freytag via Unicode wrote:
> > On 1/26/2019 6:25 PM, Michael Everson via Unicode wrote:
> > the 02BC?s need to be bigger or the text can?t be read easily. In our
> > work we found that a vertical height of 140% bigger than the quotation
> > mark improved legibility hugely. Fine typography asks for some other
> > alterations to the glyph, but those are cosmetic.
> >> If the recommended glyph for 02BC were to be changed, it would in no
> case impact adversely on scientific linguistics texts. It would just make
> the mark a bit bigger. But for practical use in Polynesian languages where
> the character has to be found alongside the quotation marks, a glyph
> distinction must be made between this and punctuation.
> >
> > It somehow seems to me that an evolution of the glyph shape of 02BC in
> > a direction of increased distinction from U+2019 is something that
> > Unicode has indeed made possible by a separate encoding. However, that
> > evolution is a matter of ALL the language communities that use U+02BC
> > as part of their orthography, and definitely NOT something were
> > Unicode can be permitted to take a lead. Unicode does not *recommend*
> > glyphs for letters.
> >
> > However, as a publisher, you are of course free to experiment and to
> > see whether your style becomes popular.
> >
> > There is a concern though, that your choice may appeal only to some
> > languages that use this code point and not become universally accepted.
> >
> > A./
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190127/b45a4af2/attachment.html>

From unicode at unicode.org  Sun Jan 27 17:21:31 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Jan 2019 23:21:31 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
Message-ID: <20190127232131.077a4448@JRWUBU2>

On Sun, 27 Jan 2019 14:09:31 -0500
James Tauber via Unicode <unicode at unicode.org> wrote:

> On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:  

> > However LibreOffice treats "don't" as a single word for U+0027,
> > U+02BC and U+2019, but "dogs'" as a single word only for U+02BC.
> > This complies with TR27.  I'm not surprised, as LibreOffice does
> > use or has used ICU.

> This comes back to my original question that started this thread.

Yes.  I'm driving home the problem for those who somehow fail to
understand your opening post.

> Here's a concrete example from Smyth's Grammar:
> 
> ??????? ??
> 
> Double-clicking on the first word should select the U+2019 as well.
> Interestingly on macOS Mojave it does in Pages[1] but not in Notes,
> the Terminal or here in Gmail on Chrome.
> 
> To be clear: when I say "should" I mean that that is the expectation
> classicists have and the failure to meet it is why some of them
> insist on using U+02BC.
> 
> I'm happy if the answer is "use U+2019 and go get your text
> segmentation implementations fixed"[2] but am looking for
> confirmation of that.

The problem with that approach is that it assumes one can have a
language-sensitive implementation, and that that will suffice.

Smyth?s grammar gives the concrete example, ???????? ???.  It contains
the word ????.

Should double-clicking the first Greek word in the paragraph above
select it?  That's not going to work if the paragraph above is
considered to be in English.  And what about double clicking the third
Greek word?  What should that select?  Or is that paragraph
ungrammatical?

To fix the problem with possessive plural "dogs?" with U+2019 one has
to parse enough of the paragraph to distinguish an apostrophe from a
closing single inverted comma. Moreover, it assumes that end-of-word
apostrophes will not be included in a span bounded by single inverted
commas.  I may observe such a rule, but I don't remember being taught
it.

In Unicode 2.0 the apostrophe was U+02BC; it was changed to U+2019 in
Unicode 2.1.  The justification I could find given for the change is in
the Unicore thread (members only) starting at
https://www.unicode.org/mail-arch/unicore-ml/y1997-A/0185.html .  The
justification recorded there was merely that:

1) Windows and Mac Latin character sets had equivalents of U+0027, to
which the 'letter apostrophe' was mapped, and U+2019, which was used
for single quotes.

2) The 'punctuation apostrophe' was being mapped to the U+2019 by the
'smart quote' apparatus.

3) For consistency, the 'punctuation apostrophe' should therefore be
encoded by U+2019 instead of U+02BC.

This argument didn't persuade everyone even then, and it feels even
weaker now.

Perhaps I just have the problem that I don't see a sharp difference
between the letter apostrophe and the punctuation apostrophe. For
example, when the pronunciation of English "letter" with a glottal stop
as the intervocalic consonant is represented in writing as something
like "le'er", is it a letter apostrophe because it's a glottal stop, or
a punctuation apostrophe because the 'tt' is dropped?

The issue arises in the orthography of Finnish.  The genitive singular
of _keko_ 'a pile' is _keon_ - the 'k' is 'dropped' because of
consonant gradation.  However, regularly, the genitive singular of
_raaka_ 'raw' is _raa'an_, where the U+0027 I wrote represent an
apostrophe and is pronounced as a glottal stop.  Is this a letter
apostrophe or a punctuation apostrophe?  The 'k' has been dropped by
the same rule, but because of the vowel pattern it is replaced by a
glottal stop and written with an apostrophe.  English Wiktionary
chooses U+2019: the Finnish Wiktionary ducks the issue and uses U+0027.

Richard.


From unicode at unicode.org  Sun Jan 27 17:38:40 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Jan 2019 23:38:40 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <ca9e5c69-3058-812e-71d9-0eba6739f301@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <ca9e5c69-3058-812e-71d9-0eba6739f301@gmail.com>
Message-ID: <20190127233840.72bd25cb@JRWUBU2>

On Sun, 27 Jan 2019 19:57:37 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> On 2019-01-27 7:09 PM, James Tauber via Unicode wrote:
> > In my original post, I asked if a language-specific tailoring of
> > the text segmentation algorithm was the solution but no one here
> > has agreed so far.  
> If there are likely to be many languages requiring exceptions to the 
> segmentation algorithm wrt U+2019, then perhaps it would be better to 
> establish conventions using ZWJ/ZWNJ and adjust the algorithm 
> accordingly so that it would be cross-languages.? (Rather than
> requiring additional and open ended language-specific tailorings.) (I
> inserted several combinations of ZWJ/ZWNJ into James Tauber's
> example, but couldn't improve the segmentation in LibreOffice,
> although it was possible to make it worse.)

If you look at TR29, you will see that ZWJ should only affect word
boundaries for emoji.  ZWNJ shall have no effect.  What you want is a
control that joins words, but we don't have that.

Richard.


From unicode at unicode.org  Sun Jan 27 17:44:18 2019
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 28 Jan 2019 00:44:18 +0100
Subject: Encoding italic
In-Reply-To: <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
Message-ID: <CAGa7JC3a8QsP=gykWORogHhtZnghPryfqHwocefmckvjjencKA@mail.gmail.com>

You're not very explicit about the Tag encoding you use for these styles.

Of course it must not be a language tag so the introducer is not U+E0001,
or a cancel-all tag so it is not prefixed by U+E007F
It cannot also use letter-like, digit-like and hyphen-like tag characters
for its introduction.
So probably you use some prefix in U+E0002..U+E001F and some additional tag
(tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S" for
strikethough?) and the cancel tag to return to normal text (terminate the
tagged sequence).

Or may be you just use standard HTML encoding by adding U+E0000 to each
character of the HTML tag syntax (including attributes and close tags,
allowing embedding?) So you use the "<" and ">" tag characters (possibly
also the space tag U+E0020, or TAB tag U+E0009 for separating attributes
and the quotation tags for attribute values)?
Is your proposal also allowing the embedding of other HTML objects (such as
SVG)?

In that case what you do is only to remap the HTML syntax outside the
standard text. If an attribute values contains standard text (such as <span
title="Some text">...</span>) do you also remap the attribute value, i.e.
"Some text"? Do you remap the technical name of the HTML tag itself i.e.
"span" in the last example?

And what is then the interest compared to standard HTML (it is not more
compact, and just adds another layer on top of it), except allowing to
embed it in places where plain HTML would be restricted by form inputs or
would be reconverted using character entities hiding the effect of "<", ">"
and "&" in HTML so they are not reinterpreted as HTML but as plain-text
characters?

Now let's suppose that your convention starts being decoded and used in
some applications, this could be used to transport sensitive active scripts
(e.g. Javascript event handlers or plain <script> elements): this adds an
extra layer of security needed now in these applications, plus updated to
security tools/antivirus scanners.
I bet in fact that all tag characters are most often restricted in text
input forms, and will be silently discarded or the whole text will be
rejected.

For me the tag characters is just a quirk for trying to embed in text, some
higher level protocol which is actually not part of the text but only
metadata, including for use with existing language tags (in HTML/SVG we can
already use the lang="..." or xml:lang="..." for that purpose, in MIME and
HTTP(S) we can already use the "Language:" and "Accept-Language:" headers).
We were told that these tag characters were deprecated, and in fact even
their use for language tags has not found any significant use except some
trials (but there are now better technologies available in lot of
softwares, APIs and services, and application design/development tools, or
document editing/publishing tools).


Le dim. 27 janv. 2019 ? 21:10, James Kass via Unicode <unicode at unicode.org>
a ?crit :

>
> A new beta of BabelPad has been released which enables input, storing,
> and display of italics, bold, strikethrough, and underline in plain-text
> using the tag characters method described earlier in this thread.  This
> enhancement is described in the release notes linked on this download page:
>
> http://www.babelstone.co.uk/Software/index.html
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/5894ef40/attachment.html>

From unicode at unicode.org  Sun Jan 27 17:56:47 2019
From: unicode at unicode.org (Kent Karlsson via Unicode)
Date: Mon, 28 Jan 2019 00:56:47 +0100
Subject: Encoding italic
In-Reply-To: <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
Message-ID: <D87403CF.447D3%kent.karlsson14@telia.com>

Apart from that control sequences for (some) styling is standardised
(since decades by now), and the "tag characters" approach is not:

For the control sequences for styling, there is no pretence of nesting,
just setting/unsetting an aspect of styling. For <b></b> etc. (in tag
characters) there is at least the pretence/appearance of nesting, even
if the interpreter doesn't actually care about nesting (and just interprets
them as set/unset). (In addition, <b> etc. in "real" HTML are
1) disrecommended, and
2) the actual styling comes from a style sheet (and the **default**
one makes <b> stuff bold).)

/Kent K


Den 2019-01-27 21:03, skrev "James Kass via Unicode" <unicode at unicode.org>:

> 
> A new beta of BabelPad has been released which enables input, storing,
> and display of italics, bold, strikethrough, and underline in plain-text
> using the tag characters method described earlier in this thread.? This
> enhancement is described in the release notes linked on this download page:
> 
> http://www.babelstone.co.uk/Software/index.html
> 


From unicode at unicode.org  Sun Jan 27 19:53:24 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 28 Jan 2019 01:53:24 +0000
Subject: Encoding italic
In-Reply-To: <CAGa7JC3a8QsP=gykWORogHhtZnghPryfqHwocefmckvjjencKA@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAGa7JC3a8QsP=gykWORogHhtZnghPryfqHwocefmckvjjencKA@mail.gmail.com>
Message-ID: <3646cd71-f1b8-79f4-8490-be54de3610b8@gmail.com>


On 2019-01-27 11:44 PM, Philippe Verdy wrote:

 > You're not very explicit about the Tag encoding you use for these styles.

This ??????bold???????? new concept was not mine.? When I tested it 
here, I was using the tag encoding recommended by the developer.

 > Of course it must not be a language tag so the introducer is not 
U+E0001, or a cancel-all tag so it
 > is not prefixed by U+E007F?? It cannot also use letter-like, 
digit-like and hyphen-like tag characters
 > for its introduction.? So probably you use some prefix in 
U+E0002..U+E001F and some additional tag
 > (tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S" 
for strikethough?) and the cancel
 > tag to return to normal text (terminate the tagged sequence).

Yes, U+E0001 remains deprecated and its use is strongly discouraged.

 > Or may be you just use standard HTML encoding by adding U+E0000 to 
each character of the HTML
 > tag syntax (including attributes and close tags, allowing embedding?) 
So you use the "<" and ">" tag
 > characters (possibly also the space tag U+E0020, or TAB tag U+E0009 
for separating attributes and the
 > quotation tags for attribute values)?? Is your proposal also allowing 
the embedding of other HTML
 > objects (such as SVG)?

AFAICT, this beta release supports the tag sequences <i></i>, <b></b>, 
<s></s>, & <u></u> expressed here in ASCII.? I don?t know if the 
software developer has plans to expand the enhancements in the future.

 > And what is then the interest compared to standard HTML (it is not 
more compact, ...

This was one of the ideas which surfaced earlier in this thread. Some 
users have expressed an interest in preserving, for example, italics in 
plain-text and are uncomfortable using the math alphanumerics for this, 
although the math alphanumerics seem well qualified for the purpose.? 
One of the advantages given for this approach earlier is that it can be 
made to work without any official sanction and with no action necessary 
by the Consortium.

 > I bet in fact that all tag characters are most often restricted in 
text input forms, and will be
 > silently discarded or the whole text will be rejected.

In this e-mail, I used the tags <b> & </b> around the word ?bold? in the 
first sentence of my reply in order to test your bet.

 > We were told that these tag characters were deprecated, and in fact 
even their use for language
 > tags has not found any significant use except some trials (but there 
are now better technologies
 > available in lot of softwares, APIs and services, and application 
design/development tools, or
 > document editing/publishing tools).

Indeed, these tags were deprecated.? At the time the tags were 
deprecated, there was such sorrow on this list that some list members 
were even inspired to compose haiku lamenting their passing and did post 
those haiku to this list.? Now, thanks to emoji requirements, many of 
those tags are experiencing a resurrection/renaissance.? I wonder if 
anyone is composing limericks in joyful celebration?


From unicode at unicode.org  Sun Jan 27 21:48:52 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 28 Jan 2019 03:48:52 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190127233840.72bd25cb@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <ca9e5c69-3058-812e-71d9-0eba6739f301@gmail.com>
 <20190127233840.72bd25cb@JRWUBU2>
Message-ID: <fc5b2344-43fa-47a5-85cd-d54f0eaac74b@gmail.com>


On 2019-01-27 11:38 PM, Richard Wordingham via Unicode wrote:
> On Sun, 27 Jan 2019 19:57:37 +0000
> James Kass via Unicode <unicode at unicode.org> wrote:
>
>> On 2019-01-27 7:09 PM, James Tauber via Unicode wrote:
>>> In my original post, I asked if a language-specific tailoring of
>>> the text segmentation algorithm was the solution but no one here
>>> has agreed so far.
>> If there are likely to be many languages requiring exceptions to the
>> segmentation algorithm wrt U+2019, then perhaps it would be better to
>> establish conventions using ZWJ/ZWNJ and adjust the algorithm
>> accordingly so that it would be cross-languages.? (Rather than
>> requiring additional and open ended language-specific tailorings.) (I
>> inserted several combinations of ZWJ/ZWNJ into James Tauber's
>> example, but couldn't improve the segmentation in LibreOffice,
>> although it was possible to make it worse.)
> If you look at TR29, you will see that ZWJ should only affect word
> boundaries for emoji.  ZWNJ shall have no effect.  What you want is a
> control that joins words, but we don't have that.
>
> Richard.
>

(https://unicode.org/reports/tr29/)

It?s been said that the text segmentation rules seem over-complicated 
and are probably non-trivial to implement properly.? I tried your 
suggestion of WORD JOINER U+2060 after tau ( ???????? ?? ), but it only 
added yet another word break in LibreOffice.

The problem may stem from the fact that WORD JOINER is supposed to be 
treated as though it were a zero-width no-break space.? IOW it is a 
*space*, and as a space it indicates a word break.? That doesn?t seem right.

Instead of treating WORD JOINER as a SPACE, why not treat it as a WORD 
JOINER?? It could save a lot of problems wrt undesirable string 
segmentation in addition to possibly minimizing future language-specific 
tailoring and easing the burden on implementers.


From unicode at unicode.org  Mon Jan 28 01:31:40 2019
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Mon, 28 Jan 2019 08:31:40 +0100
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
Message-ID: <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>

Note that this is no different than the reasonably common cases in English
such as ?the boys? books?.
(you can try various combinations in
http://unicode.org/cldr/utility/list-unicodeset.jsp)

There are certainly cases that are suboptimal in word selection. As another
example, ?re-iterate? seems like it should not break around hyphens, but on
the other hand in ?an out-of-the-box experience? it seems like they should.
Expecting people to type in hard-to-find invisible characters just to
correct double-click is not a realistic expectation. Short of a dictionary
or ML lookup, there is no good way to distinguish certain tricky cases.
(And that probably needs more context, to distinguish ?Ted was lyin? to her
mother.? from ?She said ?Ted was lyin? to her mother.?.)

But the question is how important those are in daily life. I'm not sure why
the double-click selection behavior is so much more of a problem for
Ancient Greek users than it is for the somewhat larger community of English
users. Word selection is not normally as important an operation as line
break, which does work as expected.

Mark


On Sun, Jan 27, 2019 at 8:13 PM James Tauber via Unicode <
unicode at unicode.org> wrote:

> On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:
>
>> Except the Uniocde-compliant processes aren't required to follow the
>> scheme of TR27 Unicode Text Segmentation.  However, it is only required
>> to select the whole word because the U+2019 is followed by a letter.
>> TR27 prescribes different behaviour for "dogs'" with U+2019 (interpret
>> as two 'words') and U+02BC (interpret as one word).  The GTK-based
>> email client I'm using has that difference, but also fails with
>> "don't" unless one uses U+02BC.
>>
>> However LibreOffice treats "don't" as a single word for U+0027, U+02BC
>> and U+2019, but "dogs'" as a single word only for U+02BC.  This
>> complies with TR27.  I'm not surprised, as LibreOffice does use or has
>> used ICU.
>>
>
> This comes back to my original question that started this thread. Many
> people creating Ancient Greek digital resources use U+02BC seemingly
> because of incorrect word-breaking with *word-final* U+2019 (which is the
> only time it occurs in Ancient Greek and always marking elision, never as
> the end of a quotation).
>
> I am trying to write guidelines as to why they should use U+2019. I'm
> convinced it's technically the right code point to use but am wanting to
> get my facts straight about how to address the word-breaking issue
> (specifically for word-final U+2019 in Ancient Greek, to be clear). In my
> original post, I asked if a language-specific tailoring of the text
> segmentation algorithm was the solution but no one here has agreed so far.
>
> Here's a concrete example from Smyth's Grammar:
>
> ??????? ??
>
> Double-clicking on the first word should select the U+2019 as well.
> Interestingly on macOS Mojave it does in Pages[1] but not in Notes, the
> Terminal or here in Gmail on Chrome.
>
> To be clear: when I say "should" I mean that that is the expectation
> classicists have and the failure to meet it is why some of them insist on
> using U+02BC.
>
> I'm happy if the answer is "use U+2019 and go get your text segmentation
> implementations fixed"[2] but am looking for confirmation of that.
>
> James
>
> [1] To be honest, I was impressed Pages got it right.
> [2] In the same spirit as "if certain combining character combinations
> don't work, the solution is not to add precomposed characters, it's to
> improve the fonts" or "tonos and oxia are the same and if they look
> different, it's the fault of your font".
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/40883a83/attachment.html>

From unicode at unicode.org  Mon Jan 28 01:49:29 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Mon, 28 Jan 2019 02:49:29 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
Message-ID: <CAJdVgG+3fgx4XOEqB-jjo08_ZJhvNp=1maPdBxDf_p_+qry9QQ@mail.gmail.com>

On Mon, Jan 28, 2019 at 2:31 AM Mark Davis ?? <mark at macchiato.com> wrote:

> But the question is how important those are in daily life. I'm not sure
> why the double-click selection behavior is so much more of a problem for
> Ancient Greek users than it is for the somewhat larger community of English
> users. Word selection is not normally as important an operation as line
> break, which does work as expected.
>

Even if they don't _really_ care about word selection, there are digital
classicists who care even less about U+2019 being the preferred character
which makes it harder for me to make my case :-)

What triggered the question in my original post about tailoring the Word
Boundary Rules was the statement in TR29 "A further complication is the use
of the same character as an apostrophe and as a quotation mark. Therefore
leading or trailing apostrophes are best excluded from the default
definition of a word." Because Ancient Greek does not have that ambiguity,
there's no need for the exclusion in that case. Immediately following that
quote is a suggestion about tailoring for French and Italian which made we
wonder if the "right" thing to do is to tailor the WBRs for Ancient Greek.

I know you've said here (and in your original response to me) that you
don't think it's worth it, but is WBR tailoring (the only|the best|a)
technically correct way to achieve with U+2019 (in Ancient Greek) what
people are abusing U+02BC for?

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/093bcc0b/attachment.html>

From unicode at unicode.org  Mon Jan 28 01:52:12 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 28 Jan 2019 07:52:12 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
Message-ID: <8a11561d-8568-63e2-ee65-65b456c52928@gmail.com>


On 2019-01-28 7:31 AM, Mark Davis ?? via Unicode wrote:
> Expecting people to type in hard-to-find invisible characters just to 
> correct double-click is not a realistic expectation.

True, which is why such entries, when consistent, are properly handled 
at the keyboard driver level.? It's a presumption that Greek classicists 
are already specifying fonts and using dedicated keyboard drivers.? 
Based on the description provided by James Tauber, it should be 
relatively simple to make the keyboard insert some kind of joiner before 
U+2019 if it follows a Greek letter. This would not be visible to the 
end-user.

This approach would also mean that plain-text, which has no language 
tagging mechanism, would "get it right" cross-platform, cross-applications.


From unicode at unicode.org  Mon Jan 28 02:19:18 2019
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 28 Jan 2019 09:19:18 +0100
Subject: Encoding italic
In-Reply-To: <3646cd71-f1b8-79f4-8490-be54de3610b8@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAGa7JC3a8QsP=gykWORogHhtZnghPryfqHwocefmckvjjencKA@mail.gmail.com>
 <3646cd71-f1b8-79f4-8490-be54de3610b8@gmail.com>
Message-ID: <CAGa7JC3srmfpXUpNtd51FAqJgh5mnPhSz+0F+yKKg9N_WxSWJg@mail.gmail.com>

So you used
"<U+E003C,U+E0063,U+E003E>bold <U+E003C,U+E002F,U+E0063,U+E003E>
I.e, you converted from ASCII to tag characters the full HTML sequences
"<b>" and "</b>", including the HTML element name. I see little interest
for that approach.

Additionally this means that U+E003C is the tag identifier and its scope
does not end for the rest of the text (the HTML close tag is closing the
previous Unicode tag but opens a new one, as the second sequence is not
<U+E003C,U+E007F>, i.e. the Unicode tag-cancel).

I bet that a Unicode confirming code that treats some tag characters could
choose to remove everything in a Unicode tag that it does not understand
(e.g. U+E003C is not an understood identifier, only U+E0001 is understood
as a language tag) or does not want to parse but without the tag-cancel,
all the rest of your email could have been truncated, instead of just the
tagged text "bold".

Given how HTML tags are nesting(.. or not...), I don't think this approach
is desirable

And I'm not sure that everyone on this list actually received you mail with
this tag, it may have happened that your mail was truncated or all U+E00nn
characters were silently removed by an intermediate agent not wanting to
support any Unicode Tag character.

Le lun. 28 janv. 2019 ? 03:03, James Kass via Unicode <unicode at unicode.org>
a ?crit :

>
> On 2019-01-27 11:44 PM, Philippe Verdy wrote:
>
>  > You're not very explicit about the Tag encoding you use for these
> styles.
>
> This ??????bold???????? new concept was not mine.  When I tested it
> here, I was using the tag encoding recommended by the developer.
>
>  > Of course it must not be a language tag so the introducer is not
> U+E0001, or a cancel-all tag so it
>  > is not prefixed by U+E007F   It cannot also use letter-like,
> digit-like and hyphen-like tag characters
>  > for its introduction.  So probably you use some prefix in
> U+E0002..U+E001F and some additional tag
>  > (tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S"
> for strikethough?) and the cancel
>  > tag to return to normal text (terminate the tagged sequence).
>
> Yes, U+E0001 remains deprecated and its use is strongly discouraged.
>
>  > Or may be you just use standard HTML encoding by adding U+E0000 to
> each character of the HTML
>  > tag syntax (including attributes and close tags, allowing embedding?)
> So you use the "<" and ">" tag
>  > characters (possibly also the space tag U+E0020, or TAB tag U+E0009
> for separating attributes and the
>  > quotation tags for attribute values)?  Is your proposal also allowing
> the embedding of other HTML
>  > objects (such as SVG)?
>
> AFAICT, this beta release supports the tag sequences <i></i>, <b></b>,
> <s></s>, & <u></u> expressed here in ASCII.  I don?t know if the
> software developer has plans to expand the enhancements in the future.
>
>  > And what is then the interest compared to standard HTML (it is not
> more compact, ...
>
> This was one of the ideas which surfaced earlier in this thread. Some
> users have expressed an interest in preserving, for example, italics in
> plain-text and are uncomfortable using the math alphanumerics for this,
> although the math alphanumerics seem well qualified for the purpose.
> One of the advantages given for this approach earlier is that it can be
> made to work without any official sanction and with no action necessary
> by the Consortium.
>
>  > I bet in fact that all tag characters are most often restricted in
> text input forms, and will be
>  > silently discarded or the whole text will be rejected.
>
> In this e-mail, I used the tags <b> & </b> around the word ?bold? in the
> first sentence of my reply in order to test your bet.
>
>  > We were told that these tag characters were deprecated, and in fact
> even their use for language
>  > tags has not found any significant use except some trials (but there
> are now better technologies
>  > available in lot of softwares, APIs and services, and application
> design/development tools, or
>  > document editing/publishing tools).
>
> Indeed, these tags were deprecated.  At the time the tags were
> deprecated, there was such sorrow on this list that some list members
> were even inspired to compose haiku lamenting their passing and did post
> those haiku to this list.  Now, thanks to emoji requirements, many of
> those tags are experiencing a resurrection/renaissance.  I wonder if
> anyone is composing limericks in joyful celebration?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/4cf103c1/attachment.html>

From unicode at unicode.org  Mon Jan 28 02:37:54 2019
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Mon, 28 Jan 2019 09:37:54 +0100
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgG+3fgx4XOEqB-jjo08_ZJhvNp=1maPdBxDf_p_+qry9QQ@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <CAJdVgG+3fgx4XOEqB-jjo08_ZJhvNp=1maPdBxDf_p_+qry9QQ@mail.gmail.com>
Message-ID: <CAJ2xs_Gx2VSPycenXB=4Oa31Y2TN8wsP=FgqGvwCZCGkaKJMAA@mail.gmail.com>

It would certainly be possible (and relatively simple) to change ? into a
word character for languages that don't use ? for any other purpose. And if
no languages using a particular script use ? for another purpose, then it
is particularly easy. (If you depend on language tagging, then any software
that doesn't maintain the language tagging will cause it to revert to the
default behavior.)

So does modern Greek use ? for in trailing environments where people
wouldn't expect it to be included in word selection?

Mark


On Mon, Jan 28, 2019 at 8:49 AM James Tauber <jtauber at jtauber.com> wrote:

> On Mon, Jan 28, 2019 at 2:31 AM Mark Davis ?? <mark at macchiato.com> wrote:
>
>> But the question is how important those are in daily life. I'm not sure
>> why the double-click selection behavior is so much more of a problem for
>> Ancient Greek users than it is for the somewhat larger community of English
>> users. Word selection is not normally as important an operation as line
>> break, which does work as expected.
>>
>
> Even if they don't _really_ care about word selection, there are digital
> classicists who care even less about U+2019 being the preferred character
> which makes it harder for me to make my case :-)
>
> What triggered the question in my original post about tailoring the Word
> Boundary Rules was the statement in TR29 "A further complication is the use
> of the same character as an apostrophe and as a quotation mark. Therefore
> leading or trailing apostrophes are best excluded from the default
> definition of a word." Because Ancient Greek does not have that ambiguity,
> there's no need for the exclusion in that case. Immediately following that
> quote is a suggestion about tailoring for French and Italian which made we
> wonder if the "right" thing to do is to tailor the WBRs for Ancient Greek.
>
> I know you've said here (and in your original response to me) that you
> don't think it's worth it, but is WBR tailoring (the only|the best|a)
> technically correct way to achieve with U+2019 (in Ancient Greek) what
> people are abusing U+02BC for?
>
> James
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/6f72bbea/attachment.html>

From unicode at unicode.org  Mon Jan 28 02:41:48 2019
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Mon, 28 Jan 2019 09:41:48 +0100
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <8a11561d-8568-63e2-ee65-65b456c52928@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <8a11561d-8568-63e2-ee65-65b456c52928@gmail.com>
Message-ID: <CAJ2xs_HfHep7eFa7-udZ2-n08NvuHB=J_8q5Emr3jxfU=__niw@mail.gmail.com>

That is a fair point; if you could get everyone to use keyboards that
inserted such a character, and also get people with current data (eg
Thesaurus Linguae Graecae to process their text), then it would behave as
expected.

Mark


On Mon, Jan 28, 2019 at 8:55 AM James Kass via Unicode <unicode at unicode.org>
wrote:

>
> On 2019-01-28 7:31 AM, Mark Davis ?? via Unicode wrote:
> > Expecting people to type in hard-to-find invisible characters just to
> > correct double-click is not a realistic expectation.
>
> True, which is why such entries, when consistent, are properly handled
> at the keyboard driver level.  It's a presumption that Greek classicists
> are already specifying fonts and using dedicated keyboard drivers.
> Based on the description provided by James Tauber, it should be
> relatively simple to make the keyboard insert some kind of joiner before
> U+2019 if it follows a Greek letter. This would not be visible to the
> end-user.
>
> This approach would also mean that plain-text, which has no language
> tagging mechanism, would "get it right" cross-platform, cross-applications.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/f87ce78a/attachment.html>

From unicode at unicode.org  Mon Jan 28 02:46:31 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Mon, 28 Jan 2019 03:46:31 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJ2xs_Gx2VSPycenXB=4Oa31Y2TN8wsP=FgqGvwCZCGkaKJMAA@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <CAJdVgG+3fgx4XOEqB-jjo08_ZJhvNp=1maPdBxDf_p_+qry9QQ@mail.gmail.com>
 <CAJ2xs_Gx2VSPycenXB=4Oa31Y2TN8wsP=FgqGvwCZCGkaKJMAA@mail.gmail.com>
Message-ID: <CAJdVgGLitux2Zjpcu3o464nOM1Xuc33jgYX_qgZASeRbdGKB3w@mail.gmail.com>

On Mon, Jan 28, 2019 at 3:38 AM Mark Davis ?? <mark at macchiato.com> wrote:

> So does modern Greek use ? for in trailing environments where people
> wouldn't expect it to be included in word selection?
>
>
Unfortunately, I can't speak for Modern Greek at all.

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/f8bdc5f9/attachment.html>

From unicode at unicode.org  Mon Jan 28 02:51:59 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Mon, 28 Jan 2019 03:51:59 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <8a11561d-8568-63e2-ee65-65b456c52928@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <8a11561d-8568-63e2-ee65-65b456c52928@gmail.com>
Message-ID: <CAJdVgG+w3vaM4+oCnqLyRGT+1aiuTfX6ffR8jRfPHRMVT3FDpQ@mail.gmail.com>

On Mon, Jan 28, 2019 at 2:54 AM James Kass via Unicode <unicode at unicode.org>
wrote:

> at the keyboard driver level.  It's a presumption that Greek classicists
> are already specifying fonts and using dedicated keyboard drivers.
> Based on the description provided by James Tauber, it should be
> relatively simple to make the keyboard insert some kind of joiner before
> U+2019 if it follows a Greek letter. This would not be visible to the
> end-user.
>

As a user of the Greek - Polytonic Input Source on macOS, I can confirm
that when I'm entering U+2019 in a Greek context (via option-n)  the
keyboard is fully aware I'm in that Greek context. The virtual keyboard
could easily map option-n in Greek - Polytonic to a sequence of joiner plus
U+2019.

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/bf8075d9/attachment.html>

From unicode at unicode.org  Mon Jan 28 05:54:10 2019
From: unicode at unicode.org (Kent Karlsson via Unicode)
Date: Mon, 28 Jan 2019 12:54:10 +0100
Subject: Encoding italic
In-Reply-To: <3646cd71-f1b8-79f4-8490-be54de3610b8@gmail.com>
Message-ID: <D874ABF2.447F5%kent.karlsson14@telia.com>


Den 2019-01-28 02:53, skrev "James Kass via Unicode" <unicode at unicode.org>:

> plain-text and are uncomfortable using the math alphanumerics for this,
> although the math alphanumerics seem well qualified for the purpose.?

It "works" basically only for English (note that any diacritics would be
placed suitable for math, not for words, and then there are Latin letters
that do not have a decomposition (like ?), and then there is of course
Cyrillic, and a whole slew of non-Latin scripts. So, no, they do NOT AT
ALL "seem well qualified". And... We already have a well-established
standard for doing this kind of things...

/Kent K


From unicode at unicode.org  Mon Jan 28 07:58:15 2019
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Mon, 28 Jan 2019 13:58:15 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <slrnq4s6vo.rfb.jcb@home.stevens-bradfield.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <slrnq4s6vo.rfb.jcb@home.stevens-bradfield.com>
Message-ID: <8A83F9EA-6806-44E3-8087-189311C3E0F7@evertype.com>

The hell I do, Julian. 

http://evertype.com/polynesian.html

> On 27 Jan 2019, at 21:00, Julian Bradfield via Unicode <unicode at unicode.org> wrote:
> 
> You have a very low opinion of Polynesian users. 


From unicode at unicode.org  Mon Jan 28 09:04:24 2019
From: unicode at unicode.org (Tom Gewecke via Unicode)
Date: Mon, 28 Jan 2019 08:04:24 -0700
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgG+w3vaM4+oCnqLyRGT+1aiuTfX6ffR8jRfPHRMVT3FDpQ@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <8a11561d-8568-63e2-ee65-65b456c52928@gmail.com
 > <CAJdVgG+w3vaM4+oCnqLyRGT+1aiuTfX6ffR8jRfPHRMVT3FDpQ@mail.gmail.com>
Message-ID: <EC698C2A-ADAA-4C5B-8947-DBD32DEB1E4B@bluesky.org>


> On Jan 28, 2019, at 1:51 AM, James Tauber via Unicode <unicode at unicode.org> wrote:
> 
> when I'm entering U+2019 in a Greek context (via option-n)  the keyboard is fully aware I'm in that Greek context. 

Could you explain what you mean by the keyboard being ?aware? of the Greek context?  

From unicode at unicode.org  Mon Jan 28 09:21:54 2019
From: unicode at unicode.org (Kalvesmaki, Joel via Unicode)
Date: Mon, 28 Jan 2019 15:21:54 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJ2xs_Gx2VSPycenXB=4Oa31Y2TN8wsP=FgqGvwCZCGkaKJMAA@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <CAJdVgG+3fgx4XOEqB-jjo08_ZJhvNp=1maPdBxDf_p_+qry9QQ@mail.gmail.com>,
 <CAJ2xs_Gx2VSPycenXB=4Oa31Y2TN8wsP=FgqGvwCZCGkaKJMAA@mail.gmail.com>
Message-ID: <CY4PR10MB18290C40D09BD3B2EBB56CCBBE960@CY4PR10MB1829.namprd10.prod.outlook.com>

In publishing critical editions of ancient/medieval Greek texts, I regularly deals with editions that mix elision and closing single-quotation marks. That is, I cannot assume without context that an instance of U+2019 represents either an ancient/medieval elision mark or modern editorial punctuation. I therefore have no expectations on ideal behavior when double-clicking a string with U+2019.


Best wishes,


jk

--
Joel Kalvesmaki
Editor in Byzantine Studies
Dumbarton Oaks
1703 32nd St. NW
Washington, DC 20007
(202) 339-6435
________________________________
From: Unicode <unicode-bounces at unicode.org> on behalf of Mark Davis ?? via Unicode <unicode at unicode.org>
Sent: Monday, January 28, 2019 3:37:54 AM
To: James Tauber
Cc: Richard Wordingham; Unicode Mailing List
Subject: Re: Ancient Greek apostrophe marking elision

It would certainly be possible (and relatively simple) to change ? into a word character for languages that don't use ? for any other purpose. And if no languages using a particular script use ? for another purpose, then it is particularly easy. (If you depend on language tagging, then any software that doesn't maintain the language tagging will cause it to revert to the default behavior.)

So does modern Greek use ? for in trailing environments where people wouldn't expect it to be included in word selection?

Mark


On Mon, Jan 28, 2019 at 8:49 AM James Tauber <jtauber at jtauber.com<mailto:jtauber at jtauber.com>> wrote:
On Mon, Jan 28, 2019 at 2:31 AM Mark Davis ?? <mark at macchiato.com<mailto:mark at macchiato.com>> wrote:
But the question is how important those are in daily life. I'm not sure why the double-click selection behavior is so much more of a problem for Ancient Greek users than it is for the somewhat larger community of English users. Word selection is not normally as important an operation as line break, which does work as expected.

Even if they don't _really_ care about word selection, there are digital classicists who care even less about U+2019 being the preferred character which makes it harder for me to make my case :-)

What triggered the question in my original post about tailoring the Word Boundary Rules was the statement in TR29 "A further complication is the use of the same character as an apostrophe and as a quotation mark. Therefore leading or trailing apostrophes are best excluded from the default definition of a word." Because Ancient Greek does not have that ambiguity, there's no need for the exclusion in that case. Immediately following that quote is a suggestion about tailoring for French and Italian which made we wonder if the "right" thing to do is to tailor the WBRs for Ancient Greek.

I know you've said here (and in your original response to me) that you don't think it's worth it, but is WBR tailoring (the only|the best|a) technically correct way to achieve with U+2019 (in Ancient Greek) what people are abusing U+02BC for?

James

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business, providing a safer and more useful place for your human generated data. Mimecast specializes in security, archiving and compliance. To find out more visit the Mimecast website.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/d6d17b72/attachment.html>

From unicode at unicode.org  Mon Jan 28 09:27:23 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Mon, 28 Jan 2019 10:27:23 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CY4PR10MB18290C40D09BD3B2EBB56CCBBE960@CY4PR10MB1829.namprd10.prod.outlook.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <CAJdVgG+3fgx4XOEqB-jjo08_ZJhvNp=1maPdBxDf_p_+qry9QQ@mail.gmail.com>
 <CAJ2xs_Gx2VSPycenXB=4Oa31Y2TN8wsP=FgqGvwCZCGkaKJMAA@mail.gmail.com>
 <CY4PR10MB18290C40D09BD3B2EBB56CCBBE960@CY4PR10MB1829.namprd10.prod.outlook.com>
Message-ID: <CAJdVgGJVeFdFN5saLRcmCukS1gfvgCqU=XLzf-qMOLL2hdMPpg@mail.gmail.com>

On Mon, Jan 28, 2019 at 10:21 AM Kalvesmaki, Joel <KalvesmakiJ at doaks.org>
wrote:

> In publishing critical editions of ancient/medieval Greek texts, I
> regularly deals with editions that mix elision and closing single-quotation
> marks.
>

You have my sympathies :-)

But you use U+2019 for both, right? (just checking as another data point
against U+02BC)

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/2b8d2f00/attachment.html>

From unicode at unicode.org  Mon Jan 28 09:34:02 2019
From: unicode at unicode.org (Kalvesmaki, Joel via Unicode)
Date: Mon, 28 Jan 2019 15:34:02 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJdVgGJVeFdFN5saLRcmCukS1gfvgCqU=XLzf-qMOLL2hdMPpg@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <CAJdVgG+3fgx4XOEqB-jjo08_ZJhvNp=1maPdBxDf_p_+qry9QQ@mail.gmail.com>
 <CAJ2xs_Gx2VSPycenXB=4Oa31Y2TN8wsP=FgqGvwCZCGkaKJMAA@mail.gmail.com>
 <CY4PR10MB18290C40D09BD3B2EBB56CCBBE960@CY4PR10MB1829.namprd10.prod.outlook.com>,
 <CAJdVgGJVeFdFN5saLRcmCukS1gfvgCqU=XLzf-qMOLL2hdMPpg@mail.gmail.com>
Message-ID: <CY4PR10MB18299F5999906DACFF55EC46BE960@CY4PR10MB1829.namprd10.prod.outlook.com>

Yes, we use U+2019 in either case. We might do something different if we ever run across a case where the two different types are justifiably adjacent, but that would be a rare case indeed.


jk

________________________________
From: James Tauber <jtauber at jtauber.com>
Sent: Monday, January 28, 2019 10:27:23 AM
To: Kalvesmaki, Joel
Cc: Mark Davis ??; unicode at unicode.org; Richard Wordingham
Subject: Re: Ancient Greek apostrophe marking elision

On Mon, Jan 28, 2019 at 10:21 AM Kalvesmaki, Joel <KalvesmakiJ at doaks.org<mailto:KalvesmakiJ at doaks.org>> wrote:

In publishing critical editions of ancient/medieval Greek texts, I regularly deals with editions that mix elision and closing single-quotation marks.

You have my sympathies :-)

But you use U+2019 for both, right? (just checking as another data point against U+02BC)

James

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business, providing a safer and more useful place for your human generated data. Mimecast specializes in security, archiving and compliance. To find out more visit the Mimecast website.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/6a04b196/attachment-0001.html>

From unicode at unicode.org  Mon Jan 28 06:57:29 2019
From: unicode at unicode.org (via Unicode)
Date: Mon, 28 Jan 2019 13:57:29 +0100
Subject: Unihan variants information
Message-ID: <0F92EDF3-8FA3-4FB0-AE8E-FC0CD43A1902@ouvaton.org>

I've developped an open-source, multi-platform desktop application called Unicode Plus <https://github.com/tonton-pixel/unicode-plus>, which is a set of utilities related to Unicode, Unihan and emoji.

The basic Unihan-related utilities are almost completed, and now I would like to add more useful information about the Unihan variants:

1. First option: "Linear Information"

- A linear list of all the variants *related* to one given Unihan character would be displayed, similar to what can be found in Apple's Character Viewer (or Palette), or in the "Unihan Variant Dictionary" application.

- Two sources of data could be merged:

	1. The information provided by the "Variants table for Unicode" data file UniVariants.txt <http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/ftp/CJKtable/UniVariants.Z> by Prof. K?ichi Yasuoka.
	
	2. The information extracted from the relevant Unihan DB tag properties: kSemanticVariant, kSimplifiedVariant, kSpecializedSemanticVariant, kTraditionalVariant, kZVariant.

- Discarding self-variants, assuming that Z-variants are somehow symmetrical, and possibly merge the different types of variants tags would result into independant sets of *related* Unihan characters. Acessing the info would then simply imply testing which set a given character belongs to, and omit the character itself for display.

- This kind of information is most certainly user-friendly, however it lacks structural information about the relationships between the different variants.

2. Second option: "Structured Information"

- This is probably more ambitious and challenging: ideally, the information could be displayed graphically as a diagram of characters joined by arrowed links, indicating the type of variant. It would support one-to-one, one-to-many and many-to-one relationships...


Any ideas, comments, suggestions are most welcome...

-- Michel MARIANI
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/3b172cad/attachment.html>

From unicode at unicode.org  Mon Jan 28 12:49:31 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Mon, 28 Jan 2019 11:49:31 -0700
Subject: Unihan variants information
Message-ID: <20190128114931.665a7a7059d7ee80bb4d670165c8327d.ac38193053.wbe@email03.godaddy.com>

Michel MARIANI wrote:
 
> I've developped an open-source, multi-platform desktop application
> called Unicode Plus
 
Before you get too heavily invested in this product name, you may want
to:
 
1. check out the page "Unicode? Copyright and Terms of Use" located at
http://www.unicode.org/copyright.html, and
 
2. send a quick note to the Consortium officers asking whether they are
OK with this use of the Unicode name.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 

From unicode at unicode.org  Mon Jan 28 13:56:04 2019
From: unicode at unicode.org (via Unicode)
Date: Mon, 28 Jan 2019 20:56:04 +0100
Subject: Unihan variants information
In-Reply-To: <20190128114931.665a7a7059d7ee80bb4d670165c8327d.ac38193053.wbe@email03.godaddy.com>
References: <20190128114931.665a7a7059d7ee80bb4d670165c8327d.ac38193053.wbe@email03.godaddy.com>
Message-ID: <627D08A9-CFDD-42C2-BB18-33E815A863DB@ouvaton.org>

>> I've developped an open-source, multi-platform desktop application
>> called Unicode Plus
> 
> Before you get too heavily invested in this product name, you may want
> to:
> 
> 1. check out the page "Unicode? Copyright and Terms of Use" located at
> http://www.unicode.org/copyright.html, and
> 
> 2. send a quick note to the Consortium officers asking whether they are
> OK with this use of the Unicode name.

Thanks for your wise words.
I just sent the following message using the contact form available at <https://www.unicode.org/reporting.html <https://www.unicode.org/reporting.html>>:

> Greetings,
> 
> I recently subscribed to the Unicode mailing list and asked for technical advice about Unihan features I would like to implement in an open-source application I've been developing since a while ago, and available on GitHub: <https://github.com/tonton-pixel/unicode-plus>
> 
> I just got a message from one of the contributors informing me of a potential problem over the use of the word "Unicode" in my program name: "Unicode Plus", and advised me to contact you in order to check whether it is OK or not.
> 
> I must say that I indeed read the two documents <http://www.unicode.org/copyright.html> and <https://www.unicode.org/policies/logo_policy.html> and didn't think naming my app "Unicode Plus" would be a problem. For me, it was just a way to provide some reference to the U+ notation used throughout the app, and indicate that the program was not only dealing with Unicode characters but also with Unihan and emoji as well.
> 
> Best regards.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/c05060f0/attachment.html>

From unicode at unicode.org  Mon Jan 28 14:58:39 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 28 Jan 2019 20:58:39 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <fc5b2344-43fa-47a5-85cd-d54f0eaac74b@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <ca9e5c69-3058-812e-71d9-0eba6739f301@gmail.com>
 <20190127233840.72bd25cb@JRWUBU2>
 <fc5b2344-43fa-47a5-85cd-d54f0eaac74b@gmail.com>
Message-ID: <20190128205839.7b06658c@JRWUBU2>

On Mon, 28 Jan 2019 03:48:52 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> It?s been said that the text segmentation rules seem over-complicated 
> and are probably non-trivial to implement properly.? I tried your 
> suggestion of WORD JOINER U+2060 after tau ( ???????? ?? ), but it
> only added yet another word break in LibreOffice.

I said we *don't* have a control that joins words.  The text of TUS
used to say we had one in U+2060, but that was removed in 2015.  I
pleaded for the retention of this functionality in document
L2/2015/15-192, but my request was refused.  I pointed out in ICU
ticket #11766 that ICU's Thai word breaker retained this facility. An
investigation was planned, but nothing seems to have come of it.
Interestingly, bringing this word breaker into line with TUS in the UK
may well be in breach of the Equality Act 2010.

Richard.


From unicode at unicode.org  Mon Jan 28 15:33:30 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 28 Jan 2019 21:33:30 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
Message-ID: <20190128213330.35675dba@JRWUBU2>

On Mon, 28 Jan 2019 08:31:40 +0100
Mark Davis ?? via Unicode <unicode at unicode.org> wrote:
> But the question is how important those are in daily life. I'm not
> sure why the double-click selection behavior is so much more of a
> problem for Ancient Greek users than it is for the somewhat larger
> community of English users. Word selection is not normally as
> important an operation as line break, which does work as expected.

How does ancient Greek spell-checking work?  (Does it work?)  Stripping
a final apostrophe off a standard English word usually yields another
standard English word.  That isn't so with Ancient Greek.  One would
prefer to use a general spell-checking framework, such as provided by
many applications.

Richard.


From unicode at unicode.org  Mon Jan 28 19:46:20 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 28 Jan 2019 20:46:20 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAGa7JC2j=qgZR7ubJRj=O39W7iL+TgkRBSoMDeHXCA1mcLaGQQ@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <bb22435a-0cc7-0a54-bbac-ee4966e8df1f@ix.netcom.com>
 <2214f4d5-e440-69f0-18c2-be06aa064171@kli.org>
 <CAGa7JC2j=qgZR7ubJRj=O39W7iL+TgkRBSoMDeHXCA1mcLaGQQ@mail.gmail.com>
Message-ID: <1b689f60-f477-5726-19c2-368444a68f2a@kli.org>

On 1/27/19 4:30 PM, Philippe Verdy via Unicode wrote:
> For Volap?k, it looks much more like?U+02BE (right half ring modifier 
> letter)
> than like U+02BC (apostrophe "modifier" letter).
> according to the PDF on 
> https://archive.org/details/cu31924027111453/page/n12


No, I don't think it's 02BE (especially since it goes in the other 
direction.? You mean 02BF.? But I don't think it's that either).? Note 
the thickness at the top.? That isn't a half-ring.? It's pretty clearly 
an 02BD on that page, whereas on the page before, it's just as clearly 
an 02BB.? Or I guess another lesson to be learned is they weren't 
terribly picky.? Which I guess is good, because I don't want to have to 
fret about "gee, we need a boldface 02BB for capitalized Volap?k..."? 
There's a reason they dropped that letter.

~mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/1b40be97/attachment.html>

From unicode at unicode.org  Mon Jan 28 19:55:39 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 28 Jan 2019 20:55:39 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
Message-ID: <73af7cd3-f774-ac16-6848-aaed8ff00f1d@kli.org>

On 1/28/19 2:31 AM, Mark Davis ?? via Unicode wrote:
>
> But the question is how important those are in daily life. I'm not 
> sure why the double-click selection behavior is so much more of a 
> problem for Ancient Greek users than it is for the somewhat larger 
> community of English users. Word selection is not normally as 
> important an operation as line break, which does work as expected.

This is a good point.? Bottom line is that word-selection, at least, is 
not going to be _exactly_ right.? Oh, and for another example, note that 
Esperanto also regularly (in poetry, anyway) uses a word-final 
apostrophe (of some kind) to indicate elision of the final -o of a 
nominative singular noun, or the -a of the article "la".? What shall we 
say to Esperantists who can't correctly the third word in ?al la mond? 
eterne militanta / ?i promesas sanktan harmonion??? I guess "Suck it up 
and deal with it."? And that may indeed be the answer.

~mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/6068dd88/attachment.html>

From unicode at unicode.org  Mon Jan 28 20:10:19 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 28 Jan 2019 21:10:19 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <20190128205839.7b06658c@JRWUBU2>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <ca9e5c69-3058-812e-71d9-0eba6739f301@gmail.com>
 <20190127233840.72bd25cb@JRWUBU2>
 <fc5b2344-43fa-47a5-85cd-d54f0eaac74b@gmail.com>
 <20190128205839.7b06658c@JRWUBU2>
Message-ID: <6a5f0f49-619e-72ea-40b0-cb13c2e509ea@kli.org>

On 1/28/19 3:58 PM, Richard Wordingham via Unicode wrote:
> Interestingly, bringing this word breaker into line with TUS in the UK
> may well be in breach of the Equality Act 2010.
>
> Richard.

OK, I've got to ask: how would that be?? How would this impinge on 
anyone's equality on the basis of "age, disability, gender reassignment, 
marriage and civil partnership, pregnancy and maternity, race, religion 
or belief, sex, and sexual orientation"? (quote from WP)


~mark


From unicode at unicode.org  Mon Jan 28 21:55:53 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 29 Jan 2019 03:55:53 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <73af7cd3-f774-ac16-6848-aaed8ff00f1d@kli.org>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <73af7cd3-f774-ac16-6848-aaed8ff00f1d@kli.org>
Message-ID: <04946a31-821d-8ef0-73c6-93458dd42760@gmail.com>


On 2019-01-29 1:55 AM, Mark E. Shoulson via Unicode wrote:
> I guess "Suck it up and deal with it."? And that may indeed be the answer.

It would certainly make for shorter and simpler FAQ pages, anyway.


From unicode at unicode.org  Mon Jan 28 22:16:32 2019
From: unicode at unicode.org (James Tauber via Unicode)
Date: Mon, 28 Jan 2019 23:16:32 -0500
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <04946a31-821d-8ef0-73c6-93458dd42760@gmail.com>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJ2xs_ETP+LJZD9dWDLo=QExZjdeTtahGYA9_ahVVAub2i47OQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <73af7cd3-f774-ac16-6848-aaed8ff00f1d@kli.org>
 <04946a31-821d-8ef0-73c6-93458dd42760@gmail.com>
Message-ID: <CAJdVgGKezGNMjmsnf_PRHYWdSeoAJ0DztKXdrkYODqSCkX+3Rg@mail.gmail.com>

On Mon, Jan 28, 2019 at 10:58 PM James Kass via Unicode <unicode at unicode.org>
wrote:

>
> On 2019-01-29 1:55 AM, Mark E. Shoulson via Unicode wrote:
> > I guess "Suck it up and deal with it."  And that may indeed be the
> answer.
>
> It would certainly make for shorter and simpler FAQ pages, anyway.
>

Except people will just respond with "okay, I'll use U+02BC instead" which
is what started all this :-)

James
-- 
*James Tauber*
Eldarion <https://eldarion.com/> | jktauber.com (Greek Linguistics)
<https://jktauber.com/> | Modelling Music
<https://modelling-music.com/> | Digital
Tolkien <https://digitaltolkien.com/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/af004184/attachment.html>

From unicode at unicode.org  Mon Jan 28 23:50:09 2019
From: unicode at unicode.org (Phake Nick via Unicode)
Date: Tue, 29 Jan 2019 13:50:09 +0800
Subject: Encoding italic
In-Reply-To: <20190122064052.dh2ofinavzflrx2f@angband.pl>
References: <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <CA+p4_H3Z5Tr9P1oaB4iC6Xk+2LBvzmy24kUJd6kaLkfCpK1RVg@mail.gmail.com>
 <4019ad4a-978d-a1ab-e0d0-73c8e9c2b4ef@gmail.com>
 <CAMZ=zj6X+eaPOWsCU03hXocbPCeaTx2dcMaxkLBxN-8bYw6iWQ@mail.gmail.com>
 <2048d64e-d754-3640-79c6-f992971bf8e9@gmail.com>
 <CAMZ=zj5s8NxA8wgyxM-G3DY4jYfuLVdrzOee5YghxtL89hOdkw@mail.gmail.com>
 <20190122064052.dh2ofinavzflrx2f@angband.pl>
Message-ID: <CAGHjPP++Y_LYah=0LEMBZxact-899X+QBLL3Cot65kST1bVXRg@mail.gmail.com>

Gmail can do *M?rchen* although I am not too sure about how they transmit
such formatting and not sure about how interoperatable are they.

? 2019?1?22??? 14:43?Adam Borowski via Unicode <unicode at unicode.org> ???

> On Mon, Jan 21, 2019 at 12:29:42AM -0800, David Starner via Unicode wrote:
> > On Sun, Jan 20, 2019 at 11:53 PM James Kass via Unicode
> > <unicode at unicode.org> wrote:
> > >  Even though /we/ know how to do
> > > it and have software installed to help us do it.
> >
> > You're emailing from Gmail, which has support for italics in email.
>
> ... and how exactly can they send italics in an e-mail?  All they can do is
> to bundle a web page as an attachment, which some clients display instead
> of
> the main text.
>
> The e-mail's body text supports anything Unicode does, including
> ???????????? and
> even ?????? ????????????, but, remarkably, not italic umlauted characters,
> thai nor
> han.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190129/f49626b1/attachment.html>

From unicode at unicode.org  Tue Jan 29 01:04:28 2019
From: unicode at unicode.org (Phake Nick via Unicode)
Date: Tue, 29 Jan 2019 15:04:28 +0800
Subject: Encoding italic
In-Reply-To: <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
Message-ID: <CAGHjPPLgmgGcA8c06TEF1Snv9Sy6ZSoLfKvCgoi04CeOLkd7Pw@mail.gmail.com>

2019-1-25 13:46, Garth Wallace via Unicode <unicode at unicode.org> wrote:

>
> On Wed, Jan 23, 2019 at 1:27 AM James Kass via Unicode <
> unicode at unicode.org> wrote:
>
>>
>> Nobody has really addressed Andrew West's suggestion about using the tag
>> characters.
>>
>> It seems conformant, unobtrusive, requiring no official sanction, and
>> could be supported by third-partiers in the absence of corporate
>> interest if deemed desirable.
>>
>> One argument against it might be:  Whoa, that's just HTML.  Why not just
>> use HTML?  SMH
>>
>> One argument for it might be:  Whoa, that's just HTML!  Most everybody
>> already knows about HTML, so a simple subset of HTML would be
>> recognizable.
>>
>> After revisiting the concept, it does seem elegant and workable. It
>> would provide support for elements of writing in plain-text for anyone
>> desiring it, enabling essential (or frivolous) preservation of
>> editorial/authorial intentions in plain-text.
>>
>> Am I missing something?  (Please be kind if replying.)
>>
>
> There is also RFC 1896 "enriched text", which is an attempt at a
> lightweight HTML substitute for styling in email. But these, and the ANSI
> escape code suggestion, seem like they're trying to solve the wrong problem
> here.
>
> Here's how I understand the situation:
> * Some people using forms of text or mostly-text communication that do not
> provide styling features want to use styling, for emphasis or personal flair
> * Some of these people caught on to the existence of the "styled"
> mathematical alphanumerics and, not caring that this is "wrong", started
> using them as a workaround
> * The use of these symbols, which are not technically equivalent to basic
> Latin, make posts inaccessible to screen readers, among other problems
>
> These are suggestions for Unicode to provide a different, more
> "acceptable" workaround for a lack of functionality in these social media
> systems (this mostly seems to be an issue with Twitter; IME this shows up
> much less on Facebook). But the root problem isn't the kludge, it's the
> lack of functionality in these systems: if Twitter etc. simply implemented
> some styling on their own, the whole thing would be a moot point.
> Essentially, this is trying to add features to Twitter without waiting for
> their development team.
>
> Interoperability is not an issue, since in modern computers copying and
> pasting styled text between apps works just fine.
>

How about outside social media system? For example, Chinese Braille have
symbols that indicate the start and end position of proper name mark and
book name mark punctuation, however when converted to plain text they
cannot be displayed with Unicode text because of the mindset that it should
be the task of styling software to render this punctuation, just because
the two punctuations are basically straight underline and wavy underline
beneath text in normal Chinese text.

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190129/9ee7178d/attachment.html>

From unicode at unicode.org  Tue Jan 29 04:05:07 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Tue, 29 Jan 2019 10:05:07 +0000
Subject: Encoding italic
In-Reply-To: <CALgEMhzjgT4TWYT6LROWjrDVVO42AAF8bx1TfWAv0xAw9o_8Wg@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <4d481bb6-7a56-0a92-d4d7-1f620ecb0fc6@gmail.com>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <00e946cc-1d3a-6e16-6c25-dccf755f9f73@kli.org>
 <555b0341-137c-bbb4-d77b-a5bb85ba9d1b@ix.netcom.com>
 <b4eb706.3b5d.1686756eabb.Webtop.231@btinternet.com>
 <77c75982-d16e-fba8-fc96-9146523233ab@kli.org>
 <CALgEMhy9vVj6DJCvhpwDn58_VJzWZubgtAk-wdHyQU7iYrALFw@mail.gmail.com>
 <d228b187-f3f3-0f90-7dc8-0f450fe9fbbc@gmail.com>
 <CALgEMhzjgT4TWYT6LROWjrDVVO42AAF8bx1TfWAv0xAw9o_8Wg@mail.gmail.com>
Message-ID: <18f6998f-54ed-51f7-01f0-d459c3ed970e@it.aoyama.ac.jp>

On 2019/01/24 23:49, Andrew West via Unicode wrote:
> On Thu, 24 Jan 2019 at 13:59, James Kass via Unicode
> <unicode at unicode.org> wrote:

> We were told time and time again when emoji were first proposed that
> they were required for encoding for interoperability with Japanese
> telecoms whose usage had spilled over to the internet. At that time
> there was no suggestion that encoding emoji was anything other than a
> one-off solution to a specific problem with PUA usage by different
> vendors, and I at least had no idea that emoji encoding would become a
> constant stream with an annual quota of 60+ fast-tracked
> user-suggested novelties. Maybe that was the hidden agenda, and I was
> just na?ve.

I don't think this was a hidden agenda. Nobody in the US or Europe 
thought that emoji would catch on like they did, with ordinary people 
and the press. Of course they had been popular in Japan, that's why the 
got into Unicode.


> The ESC and UTC do an appallingly bad job at regulating emoji, and I
> would like to see the Emoji Subcommittee disbanded, and decisions on
> new emoji taken away from the UTC, and handed over to a consortium or
> committee of vendors who would be given a dedicated vendor-use emoji
> plane to play with (kinda like a PUA plane with pre-assigned
> characters with algorithmic names [VENDOR-ASSIGNED EMOJI XXXXX] which
> the vendors can then associate with glyphs as they see fit; and as
> emoji seem to evolve over time they would be free to modify and
> reassign glyphs as they like because the Unicode Standard would not
> define the meaning or glyph for any characters in this plane).

To a small extent, that already happens. The example I'm thinking about 
is the transition from a (potentially bullet-carrying) pistol to a 
waterpistol. The Unicode consortium doesn't define the meaning of any of 
it's characters, and doesn't define stardard glyphs for characters, just 
example glyphs. Another example is a presenter at a conference who was 
using lots of emoji saying that he will need to redo his presentation 
because the vendor of his notebook's OS was in the process of changing 
their emoji designs.

Regards,    Martin.


From unicode at unicode.org  Tue Jan 29 04:23:17 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Tue, 29 Jan 2019 10:23:17 +0000
Subject: Encoding italic
In-Reply-To: <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
Message-ID: <593aeb20-2917-02d7-3d1c-d6192e7d5f51@it.aoyama.ac.jp>

On 2019/01/28 05:03, James Kass via Unicode wrote:
> 
> A new beta of BabelPad has been released which enables input, storing, 
> and display of italics, bold, strikethrough, and underline in plain-text 
> using the tag characters method described earlier in this thread.? This 
> enhancement is described in the release notes linked on this download page:
> 
> http://www.babelstone.co.uk/Software/index.html
>

I didn't say anything at the time this idea first came up, because I 
hoped people would understand that it was a bad idea.

Here's a little dirty secret about these tag characters: They were 
placed in one of the astral planes explicitly to make sure they'd use 4 
bytes per tag character, and thus quite a few bytes for any actual 
complete tags. See https://tools.ietf.org/html/rfc2482 for details. Note 
that RFC 2482 has been obsoleted by https://tools.ietf.org/html/rfc6082, 
in parallel with a similar motion on the Unicode side.

These tag characters were born only to shoot down an even worse 
proposal, https://tools.ietf.org/html/draft-ietf-acap-mlsf-01. For some 
additional background, please see 
https://tools.ietf.org/html/draft-ietf-acap-langtag-00.

The overall tag proposal had the desired effect: The original proposal 
to hijack some unused bytes in UTF-8 was defeated, and the tags itself 
were not actually used and therefore could be depreciated.

Bad ideas turn up once every 10 or 20 years. It usually takes some time 
for some of the people to realize that they are bad ideas. But that 
doesn't make them any better when they turn up again.

Regards,   Martin.


From unicode at unicode.org  Tue Jan 29 06:50:31 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Tue, 29 Jan 2019 13:50:31 +0100
Subject: Proposal for BiDi in terminal emulators
Message-ID: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>

Hi,

Terminal emulators are a powerful tool used by many people for various
tasks. Most terminal emulators' bugtracker has a request to add RTL /
BiDi support. Unicode has supported BiDi for about 20 years now.
Still, the intersection of these two fields isn't solved. Even some
Unicode experts have stated over time that no one knows how to do it
properly.

The only documentation I could find (ECMA TR/53) predates the Unicode
BiDi algorithm, and as such no surprise that it doesn't follow the
current state of the art or best practices.

Some terminal emulators decided to run the BiDi algorithm for display
purposes on its lines (rather than paragraphs, uh), not seeing the big
picture that such a behavior turns them into a platform on top of
which it's literally impossible to implement proper BiDi-aware text
editing (vim, emacs, whatever) experience. In turn, vim, emacs and
friends stand there clueless, not knowing how to do BiDi in terminals.

With about 5 years of experience in terminal emulator development, and
some prior BiDi homepage developing experience with the kind mentoring
of one of the BiDi gurus (Aharon, if you're reading this, hi there!),
I decided to tackle this issue. I studied and evaluated the
aforementioned documentation and the behavior of such terminals,
pointed out the problems, and came up with a draft proposal.

My work isn't complete yet. One of the most important pending issues
is to figure out how to track BiDi control characters (e.g. which
character cells they belong to), it is to be addressed in a subsequent
version. But I sincerely hope I managed to get the basics right and
clean enough so that work can begin on implementing proper support in
terminal emulators as well as fullscreen text applications; and as we
gain experience and feedback, extending the spec to address the
missing bits too.

You can find this (draft) specification at [1]. Feedback is welcome ?
if it's an actionable one then preferably over there in the project's
bugtracker.

[1] https://terminal-wg.pages.freedesktop.org/bidi/


cheers,
egmont (GNOME Terminal / VTE co-developer)


From unicode at unicode.org  Tue Jan 29 10:12:14 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Tue, 29 Jan 2019 18:12:14 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 (message from Egmont Koblinger via Unicode on Tue, 29 Jan 2019
 13:50:31 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
Message-ID: <83k1in30kh.fsf@gnu.org>

> Date: Tue, 29 Jan 2019 13:50:31 +0100
> From: Egmont Koblinger via Unicode <unicode at unicode.org>
> 
> In turn, vim, emacs and friends stand there clueless, not knowing
> how to do BiDi in terminals.

This is inaccurate: Emacs (at least the brand known as "GNU Emacs")
supports bidirectional editing in text terminals and GUI displays
alike, since Emacs 24.1, which was released in June 2012.  The latest
released version 26.1 supports all the latest changes in the UBA, up
to and including Unicode 10.0.

What Emacs version did you try which led you to the conclusion that
bidirectional editing and display in Emacs were not supported on text
terminals?

From unicode at unicode.org  Tue Jan 29 11:10:12 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Tue, 29 Jan 2019 10:10:12 -0700
Subject: Encoding italic
Message-ID: <20190129101012.665a7a7059d7ee80bb4d670165c8327d.3d502ff79b.wbe@email03.godaddy.com>

Philippe Verdy replied to James Kass:
 
> You're not very explicit about the Tag encoding you use for these
> styles.
 
Of course, it was Andrew West who implemented the styling mechanism in a
beta release of BabelPad. James was just reporting on it.
 
> And what is then the interest compared to standard HTML
 
This entire discussion, for more than three weeks now, has been about
how to implement styling (e.g. italics) in plain text. Everyone knows it
can be done, and how to do it, in rich text.
 
> So you used "<U+E003C,U+E0063,U+E003E>bold <U+E003C,U+E002F,U+E0063,
> U+E003E> I.e, you converted from ASCII to tag characters the full HTML
> sequences "<b>" and "</b>", including the HTML element name. I see
> little interest for that approach.
 
I thought we had established that someone had mentioned it on this list,
at some time during the past three weeks. Can someone look up what post
that was? I don't have time to go through scores of messages, and there
is no search facility.
 
I can't speak for Andrew, but I strongly suspect he implemented this as
a proof of concept, not to declare himself the Maker of Standards.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 

From unicode at unicode.org  Tue Jan 29 11:10:19 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Tue, 29 Jan 2019 10:10:19 -0700
Subject: Encoding italic
Message-ID: <20190129101019.665a7a7059d7ee80bb4d670165c8327d.bfc5dd6c17.wbe@email03.godaddy.com>

Kent Karlsson wrote:
 
> We already have a well-established standard for doing this kind of
> things...
 
I thought we were having this discussion because none of the existing
methods, no matter how well documented, has been accepted on a
widespread basis as "the" standard.
 
Some people dislike markdown because it looks like lightweight markup
(which it is), not like actual italics and boldface. Some dislike ISO
6429 because escape characters are invisible and might interfere with
other protocols (though they really shouldn't). Some dislike math
alphanumerics abuse because it's abuse, doesn't cover other writing
systems, etc.
 
I'd be happy to work with Kent to campaign for ISO 6429 as "the"
well-established standard for applying simple styling to plain text, but
we would have to acknowledge the significant challenges.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 

From unicode at unicode.org  Tue Jan 29 11:10:31 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Tue, 29 Jan 2019 10:10:31 -0700
Subject: Encoding italic
Message-ID: <20190129101031.665a7a7059d7ee80bb4d670165c8327d.a41c72a17f.wbe@email03.godaddy.com>

Martin J. D?rst wrote:
 
> Here's a little dirty secret about these tag characters: They were
> placed in one of the astral planes explicitly to make sure they'd use
> 4 bytes per tag character, and thus quite a few bytes for any actual
> complete tags. See https://tools.ietf.org/html/rfc2482 for details.
> Note that RFC 2482 has been obsoleted by
> https://tools.ietf.org/html/rfc6082, in parallel with a similar motion
> on the Unicode side.
 
I don't recall anyone mentioning Plane 14 language tags per se in this
thread. The tag characters themselves were un-deprecated to support
emoji flag sequences. But more on language tags in a moment.
 
> These tag characters were born only to shoot down an even worse
> proposal, https://tools.ietf.org/html/draft-ietf-acap-mlsf-01. For
> some additional background, please see
> https://tools.ietf.org/html/draft-ietf-acap-langtag-00.
>
> The overall tag proposal had the desired effect: The original proposal
> to hijack some unused bytes in UTF-8 was defeated, and the tags itself
> were not actually used and therefore could be depreciated.
 
I agree that the ACAP proposal was awful, for many reasons and on many
levels. But in general, introducing a new standardized mechanism SO THAT
it can be deprecated is a crummy idea. It engenders bad feelings and
distrust among loyal users of the standard. Major software vendors, one
in particular starting with M, have been castigated for decades for
employing tactics similar to this.
 
> Bad ideas turn up once every 10 or 20 years. It usually takes some
> time for some of the people to realize that they are bad ideas. But
> that doesn't make them any better when they turn up again.
 
The suggestions over the past three weeks to encode basic styling in
plain text (I'm not saying I'm for or against that) have some
similarities with Plane 14 language tags: many people consider both
types of information to be meta-information, unsuitable for plain text,
and many of the suggested mechanisms are stateful, which is an anti-goal
of Unicode. But these are NOT the same idea, and the fact that they both
use Plane 14 tag characters doesn't make them so.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 

From unicode at unicode.org  Tue Jan 29 11:34:06 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Tue, 29 Jan 2019 19:34:06 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 (message from Egmont Koblinger via Unicode on Tue, 29 Jan 2019
 13:50:31 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
Message-ID: <83ef8v2ws1.fsf@gnu.org>

> Date: Tue, 29 Jan 2019 13:50:31 +0100
> From: Egmont Koblinger via Unicode <unicode at unicode.org>
> 
> [1] https://terminal-wg.pages.freedesktop.org/bidi/

Interesting document, thanks for writing it.

My personal experience with bringing BiDi to Emacs led me to a firm
conclusion that BiDi support by terminal emulators cannot be relied on
by sophisticated text editing and display applications that are
BiDi-aware.  The terminal emulator can never be smart enough to do
what the editing needs require, so the application eventually ends up
jumping through hoops in order to trick the terminal into doing TRT.
It is easier to tell users to disable BiDi support of the terminal (if
it even has one), and do everything in the app.  This is the only way
of having full control of what is displayed, especially when
"higher-level protocols" need to be used to tailor the UBA to the need
of the user, because there's usually no way of asking the terminal to
apply a behavior which deviates from the UBA.

(If needed, I can provide examples of subtle problems with using BiDi
support of a terminal in BiDi-aware editing.  Not sure this will be
interesting to too many readers of this forum, though.)

From unicode at unicode.org  Tue Jan 29 11:50:46 2019
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Tue, 29 Jan 2019 18:50:46 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
Message-ID: <20190129175046.x4jkv25mfnun7q7b@angband.pl>

On Tue, Jan 29, 2019 at 01:50:31PM +0100, Egmont Koblinger via Unicode wrote:
> Terminal emulators are a powerful tool used by many people for various
> tasks. Most terminal emulators' bugtracker has a request to add RTL /
> BiDi support.
[...]
> Some terminal emulators decided to run the BiDi algorithm for display
> purposes on its lines (rather than paragraphs, uh)

Even a line is way too big a piece to be safely reordered by the terminal.
What you propose will break every full-screen program that uses line-drawing
characters:

????????????????????
? filename1 ?  123 ?
? FILENAME2 ?   17 ?
????????????????????

would become:

????????????????????
? filename1 ?  123 ?
? 17   ? 2EMANELIF ?
????????????????????

You can't even use character properties, because:

+===========+======+
| filename1 |  123 |
| FILENAME2 |   17 |
+-----------+------+

or even, probably more popular:

  filename1    123
  FILENAME2     17

I'm afraid there's no good way to do BiDi without support from individual
programs.


Meow!
-- 
???????
??????? Remember, the S in "IoT" stands for Security, while P stands
??????? for Privacy.
???????

From unicode at unicode.org  Tue Jan 29 13:14:10 2019
From: unicode at unicode.org (Andrew West via Unicode)
Date: Tue, 29 Jan 2019 19:14:10 +0000
Subject: Encoding italic
In-Reply-To: <593aeb20-2917-02d7-3d1c-d6192e7d5f51@it.aoyama.ac.jp>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <593aeb20-2917-02d7-3d1c-d6192e7d5f51@it.aoyama.ac.jp>
Message-ID: <CALgEMhz1tu5ge-XfkHzxF5gha4bob-8qFCNSOsnNBLj8hhAweA@mail.gmail.com>

On Tue, 29 Jan 2019 at 10:25, Martin J. D?rst via Unicode
<unicode at unicode.org> wrote:
>
> The overall tag proposal had the desired effect: The original proposal
> to hijack some unused bytes in UTF-8 was defeated, and the tags itself
> were not actually used and therefore could be depreciated.

And the tag characters (all except E0001) are now no longer
deprecated. As flag tag sequences are now a thing
(http://www.unicode.org/reports/tr51/#valid-emoji-tag-sequences), and
are widely supported (including on Twitter), your and PV's objections
to using tag characters for a plain text font styling protocol simply
because they are tag characters carry zero weight.

Andrew


From unicode at unicode.org  Tue Jan 29 13:51:25 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 29 Jan 2019 19:51:25 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <6a5f0f49-619e-72ea-40b0-cb13c2e509ea@kli.org>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <ca9e5c69-3058-812e-71d9-0eba6739f301@gmail.com>
 <20190127233840.72bd25cb@JRWUBU2>
 <fc5b2344-43fa-47a5-85cd-d54f0eaac74b@gmail.com>
 <20190128205839.7b06658c@JRWUBU2>
 <6a5f0f49-619e-72ea-40b0-cb13c2e509ea@kli.org>
Message-ID: <20190129195125.7c675192@JRWUBU2>

On Mon, 28 Jan 2019 21:10:19 -0500
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 1/28/19 3:58 PM, Richard Wordingham via Unicode wrote:
> > Interestingly, bringing this word breaker into line with TUS in the
> > UK may well be in breach of the Equality Act 2010.
> >
> > Richard.  
> 
> OK, I've got to ask: how would that be?? How would this impinge on 
> anyone's equality on the basis of "age, disability, gender
> reassignment, marriage and civil partnership, pregnancy and
> maternity, race, religion or belief, sex, and sexual orientation"?
> (quote from WP)

The most relevant clauses are 9(1), 9(4), 19(2), 29(5) and 29(7).

The change would restrict Thais' access to the provision of a service.
The service provided is to allow one to use a persistent, correctable
spell-checking system for one's native language.  Firefox and
LibreOffice provide this service.  Of course, one may have to supply
the spell-checking databases oneself.  Withdrawing this service for
some ethnic groups would be breach of the law.

By persistent, I means that corrections to the spell-checking remain
when the text is revisited.  For English plain-text, the easy
correction is to remove false positives by adding the word to
'personal dictionaries'.   The difficult correction, not always
possible, is to remove the word from the spell-checker's word list.

For scriptio continua scripts, line_break=complex_context in UCD terms,
there is the additional problem that word-breaking is not infrequently
wrong, even for Thai in Thai script.  (Recent loanwords into Thai can
be a nightmare.  So is Pali in Thai script, though Pali spell-checking
has its own issues.)  Line-breaking can be corrected with WJ and ZWSP.
At present, word-breaking can currently be corrected by inserting these
characters, and then spelling can be negotiated - the visible
characters are non-negotiable. The changes in the text will persist in
plain text. If WJ ceases to be treated as joining words, then the
service of a persistent, *correctable* spell-checking system is lost.

Now, one defence to the denial of the service would be that it would be
unreasonably difficult to allow users to solve the problem of
word-breaks in the wrong place.  However, if one is already providing
that service, that defence cannot be applied to subsequently denying
the service.

Richard.


From unicode at unicode.org  Tue Jan 29 13:57:40 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 29 Jan 2019 19:57:40 +0000
Subject: Ancient Greek apostrophe marking elision
In-Reply-To: <73af7cd3-f774-ac16-6848-aaed8ff00f1d@kli.org>
References: <CAJdVgG+V9rX-FWqo_7qFVXxYV2DOLCDs6A98BEfenSWaHF91nQ@mail.gmail.com>
 <CAJdVgGJXcgttTv9VGJrVJkgvtboWU=5uhxqn=oYLWOgoCXsoHA@mail.gmail.com>
 <cc8f8a75-7f63-bb0a-9494-14993abb7664@ix.netcom.com>
 <CAJ2xs_FVgqQSnqZdd4q=GnBMVwOO7NbAW4sU6O6-s5USZin=ig@mail.gmail.com>
 <c0374bfa-23ef-b7cd-39a1-4626873c9262@gmail.com>
 <0c990a9e-9954-1e7f-3131-729ba8fdf4e7@gmail.com>
 <CAJdVgGKfk-zgz52AcqsUd84X0LC5VrCVKU-nYY6quDZBjyJqdw@mail.gmail.com>
 <E131A4A3-D1EE-4E4C-8D3D-5208776FD5BA@evertype.com>
 <20190127013739.3eb50597@JRWUBU2>
 <E3748216-43BD-4DE3-91C7-9BC39FF68F69@evertype.com>
 <20190127052149.1baaf1b2@JRWUBU2>
 <CFBC44FE-18DE-42F5-9B9E-854C9AD2BC9E@evertype.com>
 <640d6662-45d4-3104-0628-dfc9f61d94ab@kli.org>
 <20190127181928.2d5225a4@JRWUBU2>
 <CAJdVgGL0ku+Pg8iB5jHQpEDPCsZgLAG3nchpGdvswv47+H+y2w@mail.gmail.com>
 <CAJ2xs_EWDWoaKZipTQnwwvdiw8QAvy=o++DzXSkZe=uZ5+yT0Q@mail.gmail.com>
 <73af7cd3-f774-ac16-6848-aaed8ff00f1d@kli.org>
Message-ID: <20190129195740.7e2dd8e3@JRWUBU2>

On Mon, 28 Jan 2019 20:55:39 -0500
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 1/28/19 2:31 AM, Mark Davis ?? via Unicode wrote:
> >
> > But the question is how important those are in daily life. I'm not 
> > sure why the double-click selection behavior is so much more of a 
> > problem for Ancient Greek users than it is for the somewhat larger 
> > community of English users. Word selection is not normally as 
> > important an operation as line break, which does work as expected.  
> 
> This is a good point.? Bottom line is that word-selection, at least,
> is not going to be _exactly_ right.? Oh, and for another example,
> note that Esperanto also regularly (in poetry, anyway) uses a
> word-final apostrophe (of some kind) to indicate elision of the final
> -o of a nominative singular noun, or the -a of the article "la".
> What shall we say to Esperantists who can't correctly the third word
> in ?al la mond? eterne militanta / ?i promesas sanktan harmonion??? I
> guess "Suck it up and deal with it."? And that may indeed be the
> answer.

Who's going to punish them for using U+02BC?

I found some documentation of an Ancient Greek spell-checker for
OpenOffice.  It listed problem with the apostrophe as one of its
shortcomings.

Richard.


From unicode at unicode.org  Tue Jan 29 14:30:12 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 29 Jan 2019 20:30:12 +0000
Subject: Encoding italic
In-Reply-To: <20190129101012.665a7a7059d7ee80bb4d670165c8327d.3d502ff79b.wbe@email03.godaddy.com>
References: <20190129101012.665a7a7059d7ee80bb4d670165c8327d.3d502ff79b.wbe@email03.godaddy.com>
Message-ID: <85fe837e-b591-ee3b-d4d9-56d903602e44@gmail.com>


On 2019-01-29 5:10 PM, Doug Ewell via Unicode wrote:
> I thought we had established that someone had mentioned it on this list,
> at some time during the past three weeks. Can someone look up what post
> that was? I don't have time to go through scores of messages, and there
> is no search facility.
http://www.unicode.org/mail-arch/unicode-ml/y2019-m01/0209.html

From unicode at unicode.org  Tue Jan 29 14:59:59 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 29 Jan 2019 20:59:59 +0000
Subject: Encoding italic
In-Reply-To: <20190129101012.665a7a7059d7ee80bb4d670165c8327d.3d502ff79b.wbe@email03.godaddy.com>
References: <20190129101012.665a7a7059d7ee80bb4d670165c8327d.3d502ff79b.wbe@email03.godaddy.com>
Message-ID: <5d04d645-9d07-0b9b-f922-3291aeeb1706@gmail.com>


Doug Ewell wrote,

 > I can't speak for Andrew, but I strongly suspect he implemented this as
 > a proof of concept, not to declare himself the Maker of Standards.

BabelPad also offers plain-text styling via math-alpha conversion, 
although this feature isn?t newly added.? Users interested in seeing how 
plain-text italics might work can try out the stateful approach using 
tags contrasted with the character-by-character approach using 
math-range italic letters.? (Of course, the math-range stuff is already 
being interchanged on the WWW, whilst the tagging method does not yet 
appear to be widely supported.)

A few miles upthread, ?where are the third-party developers? was asked.? 
?Everywhere? is the answer.? Since third-party developers have to 
subsist on the crumbs dropped by the large corps, they tend to be 
responsive to user needs and requests.


From unicode at unicode.org  Tue Jan 29 16:35:17 2019
From: unicode at unicode.org (Andrew West via Unicode)
Date: Tue, 29 Jan 2019 22:35:17 +0000
Subject: Encoding italic
In-Reply-To: <3646cd71-f1b8-79f4-8490-be54de3610b8@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAGa7JC3a8QsP=gykWORogHhtZnghPryfqHwocefmckvjjencKA@mail.gmail.com>
 <3646cd71-f1b8-79f4-8490-be54de3610b8@gmail.com>
Message-ID: <CALgEMhxds4N3W49-EgALLcb=0naa_kxEGDD_k3cVhMYeAehi_A@mail.gmail.com>

On Mon, 28 Jan 2019 at 01:55, James Kass via Unicode
<unicode at unicode.org> wrote:
>
> This ??????bold???????? new concept was not mine.  When I tested it
> here, I was using the tag encoding recommended by the developer.

Congratulations James, you've successfully interchanged tag-styled
plain text over the internet with no adverse side effects. I copied
your email into BabelPad and your "bold" is shown bold (see attached
screenshot).

Andrew
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Kass_email_2019_01_28.png
Type: image/png
Size: 21239 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20190129/35207e77/attachment.png>

From unicode at unicode.org  Tue Jan 29 18:30:08 2019
From: unicode at unicode.org (Kent Karlsson via Unicode)
Date: Wed, 30 Jan 2019 01:30:08 +0100
Subject: Encoding italic
In-Reply-To: <CALgEMhxds4N3W49-EgALLcb=0naa_kxEGDD_k3cVhMYeAehi_A@mail.gmail.com>
Message-ID: <D876AEA0.4477B%kent.karlsson14@telia.com>

Yes, great. But as I've said, we've ALREADY got a
default-ignorable-in-display (if implemented right)
way of doing such things.

And not only do we already have one, but it is also
standardised in multiple standards from different
standards institutions. See for instance "ISO/IEC 8613-6,
Information technology --- Open Document Architecture (ODA)
and Interchange Format: Character content architecture".
(In a little experiment I found that it seems that
Cygwin is one of the better implementations of this;
B.t.w. I have no relation to Cygwin other than using it.)

To boot, it's been around for decades and is still
alive and well. I see absolutely no need for a "bold"
new concept here; the one below is not better in any
significant way.

/Kent Karlsson


Den 2019-01-29 23:35, skrev "Andrew West via Unicode" <unicode at unicode.org>:

> On Mon, 28 Jan 2019 at 01:55, James Kass via Unicode
> <unicode at unicode.org> wrote:
>> 
>> This ??????bold???????? new concept was not mine.  When I tested it
>> here, I was using the tag encoding recommended by the developer.
> 
> Congratulations James, you've successfully interchanged tag-styled
> plain text over the internet with no adverse side effects. I copied
> your email into BabelPad and your "bold" is shown bold (see attached
> screenshot).
> 
> Andrew


From unicode at unicode.org  Wed Jan 30 07:36:42 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Wed, 30 Jan 2019 14:36:42 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <83k1in30kh.fsf@gnu.org>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
Message-ID: <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>

Hi Eli,

> > In turn, vim, emacs and friends stand there clueless, not knowing
> > how to do BiDi in terminals.
>
> This is inaccurate: [...]

I have to admit, I was somewhat sloppy in the phrasing of this
announcement. My bad, apologies.

Currently some terminal emulators shuffle the characters around for
display purposes, while most don't. There's absolutely no way an
editor (no matter if Emacs or any other) could produce the desired
look on both kinds. I actually present a proof that an editor cannot
always produce the desired look on ones that shuffle their contents
around. So it's a somewhat reasonable expectation to produce the
desired look on ones that don't shuffle their cells.

In the document, more precisely at [1] I evalute my findings with GNU
Emacs 25.2. (I've just fixed the page to add "GNU", thanks for
pointing this out!)

Brief summary:

- GNU Emacs reshuffles the characters according to the BiDi algorithm,
expecting that the terminal emulator doesn't do any BiDi.

- According to my recommendation, in order to address BiDi in the
entire ecosystem around terminal emulators, the default behavior will
have to be that terminals shuffle the characters around. Don't worry,
there'll be a mode where this shuffling doesn't occur. Emacs (and all
other BiDi-aware text editors) will have to switch to this mode.

- It doesn't do Arabic shaping. In my recommendation I'm arguing that
in this mode, where shuffling the characters is the task of the text
editor and not the terminal, so should it be for Arabic shaping using
presentation form characters.

- When it comes to visually wrapping a line because it doesn't fit in
the current width, Emacs goes its own way which doesn't match what the
Unicode BiDi algorithm says. I'm not saying Emacs's behavior is bad
per se or unreasonable, and it's out of the scope of my work to try to
get it changed, but I'm making a note that it's different.

[1] https://terminal-wg.pages.freedesktop.org/bidi/prior-work/applications.html

cheers,
egmont

From unicode at unicode.org  Wed Jan 30 07:58:36 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Wed, 30 Jan 2019 14:58:36 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <83ef8v2ws1.fsf@gnu.org>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83ef8v2ws1.fsf@gnu.org>
Message-ID: <CAGWcZkKGs_QEv2vq19FcBJDzB+eCv_6R+fmTypLO9S=TVMPHzg@mail.gmail.com>

Hi Eli,

> My personal experience with bringing BiDi to Emacs led me to a firm
> conclusion that BiDi support by terminal emulators cannot be relied on
> by sophisticated text editing and display applications that are
> BiDi-aware.  The terminal emulator can never be smart enough to do
> what the editing needs require, so the application eventually ends up
> jumping through hoops in order to trick the terminal into doing TRT.
> It is easier to tell users to disable BiDi support of the terminal (if
> it even has one), and do everything in the app.  This is the only way
> of having full control of what is displayed, especially when
> "higher-level protocols" need to be used to tailor the UBA to the need
> of the user, because there's usually no way of asking the terminal to
> apply a behavior which deviates from the UBA.

We are absolutely on the same page here. As long as the use case is
text editing or something similar, it's harmful if the terminal
emulator aims to do any BiDi.

Having to tell users to turn off BiDi in the emulator's settings is in
my firm opinion a user experience no-go. It has to be automatic,
happen under the hood, that is, using escape sequences.

There's another side to the entire BiDi story, though. Simple
utilities like "echo", "cat", "ls", "grep" and so on, line editing
experience of your shell, these kinds. It's absolutely not feasible to
add BiDi support to these utilities. Here the only viable approach is
to have the terminal emulator do it.

Hence, as I confirm ECMA TR/53's realization of 28 years ago, there
have to be two substantially different modes. "Explicit" mode for what
you need for Emacs: the terminal to stay out of the game; and
"implicit" mode where the terminal performs BiDi for the sake of "cat"
and other simple utiltiies.

I'm also arguing that contrary to TR/53, there's no way to hook up a
mode switch to "cat" and a gazillion of other similar tools. The only
reaslisticly implementable approach is if the "implicit" mode is the
default so that simple utilities provide a proper BiDi experience.
Those very few fullscreen apps that do know what they are doing and do
want the terminal to leave the characters at their designated place
(such as Emacs, Vim etc.) will have to request this "explicit" mode
from the terminal.

cheers,
egmont

From unicode at unicode.org  Wed Jan 30 08:07:22 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Wed, 30 Jan 2019 15:07:22 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <20190129175046.x4jkv25mfnun7q7b@angband.pl>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
Message-ID: <CAGWcZk+dVBaFR-G_UC2r4NWxwbEZq70rj_4UHZhOgF4v-yYz=w@mail.gmail.com>

Hi Adam,

> Even a line is way too big a piece to be safely reordered by the terminal.
> What you propose will break every full-screen program that uses line-drawing
> characters:

Certain terminal emulators already perform BiDi on their lines. They
already break every full-screen program with line-drawing and such, as
you pointed out. What my proposal adds, amongst plenty of other
things, is a means to automatically disable the terminal's BiDi,
rather than having to go to its settings. This way you can automate
the fix of the apps that aren't explicitly fixed, e.g. via wrapper
scripts, or terminfo entries with special ti/te definitions.

> I'm afraid there's no good way to do BiDi without support from individual
> programs.

Depends on the use case.

For complex apps, like text editors, you are right, the terminal
emulator must stay out of the game.

For simple utilities, like "cat" and friends, there's no way you can
implement BiDi support in "cat" itself. Here the terminal needs to do
it.

Your use case with tables is perhaps somewhat in the middle. One
possible approach for the emitting utility is to disable BiDi in the
terminal (switch to "explicit" mode) for the scope of this output.
Another possible approach is to leave the terminal doing BiDi, but
embed all the text fragments in FSI...PDI blocks. (This latter is
subject to a bit of further research, to be exactly specified in a
forthcoming version of the specs.)

What is extremely tough here is realizing that there are multiple
conflicting requirements (including the example you gave), and coming
up with a soluiton that satisfies the needs of all. This is what my
work aims to do.

cheers,
egmont

From unicode at unicode.org  Wed Jan 30 08:25:32 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Wed, 30 Jan 2019 15:25:32 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <20190129175046.x4jkv25mfnun7q7b@angband.pl>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
Message-ID: <CAGWcZkLxBV=5W9hoJJxF9SFQ3V35V9k3Dwe=a_+QhLwJkdn_gQ@mail.gmail.com>

Hi Adam,

One more note, to hopefully clarify:

> ????????????????????
> ? filename1 ?  123 ?
> ? FILENAME2 ?   17 ?
> ????????????????????
>
> I'm afraid there's no good way to do BiDi without support from individual
> programs.

In this particular example, when the output consists of RTL text in
logical order (the emitter does not reorder the characters to their
visual order, nor emit any BiDi controls), combined with line drawing
and such, there is hardly anything we could do purely on the terminal
emulator's side.

I did not consider the possibility of certain characters (e.g. line
drawing ones) being "stop characters", and BiDi to get applied only in
runs of other characters. Any such magic would be arbitrary, fix a
subset of the cases while cause other unforeseen breakages elsewhere.
E.g. what if someone intentionally uses these characters as
letter-like ones in a BiDi text, like """here I'm talking about the
'?' shaped corner"""... or what if poor man's ASCII pipe and other
symbols are used... it's way too risky to go into any kind of
heuristics.

In this particular case the terminal cannot magically fix the output
for you, you'll need to get the application fixed.

cheers,
egmont


From unicode at unicode.org  Wed Jan 30 08:32:47 2019
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Wed, 30 Jan 2019 15:32:47 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkLxBV=5W9hoJJxF9SFQ3V35V9k3Dwe=a_+QhLwJkdn_gQ@mail.gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
 <CAGWcZkLxBV=5W9hoJJxF9SFQ3V35V9k3Dwe=a_+QhLwJkdn_gQ@mail.gmail.com>
Message-ID: <20190130143247.5uftt6btf3izuprd@angband.pl>

On Wed, Jan 30, 2019 at 03:25:32PM +0100, Egmont Koblinger via Unicode wrote:
> One more note, to hopefully clarify:
> > ????????????????????
> > ? filename1 ?  123 ?
> > ? FILENAME2 ?   17 ?
> > ????????????????????
> >
> > I'm afraid there's no good way to do BiDi without support from individual
> > programs.
> 
> In this particular example, when the output consists of RTL text in
> logical order (the emitter does not reorder the characters to their
> visual order, nor emit any BiDi controls), combined with line drawing
> and such, there is hardly anything we could do purely on the terminal
> emulator's side.
> 
> I did not consider the possibility of certain characters (e.g. line
> drawing ones) being "stop characters", and BiDi to get applied only in
> runs of other characters. Any such magic would be arbitrary, fix a
> subset of the cases while cause other unforeseen breakages elsewhere.
> E.g. what if someone intentionally uses these characters as
> letter-like ones in a BiDi text, like """here I'm talking about the
> '?' shaped corner"""... or what if poor man's ASCII pipe and other
> symbols are used... it's way too risky to go into any kind of
> heuristics.
> 
> In this particular case the terminal cannot magically fix the output
> for you, you'll need to get the application fixed.

That's possible only if the program in question is running directly attached
to the tty.  That's not an option if the output is redirected.  Frames in
a plain text file are a perfectly rational, and pretty widespread, use --
and your proposal will break them all.  Be it "cat" to the screen, "less" or
even "mutt" if the text was sent via a mail.

I don't really see a possibility to do that on the terminal's side, any RTL
reordering would need to be done by the program in question.


Meow!
-- 
???????
??????? Remember, the S in "IoT" stands for Security, while P stands
??????? for Privacy.
???????

From unicode at unicode.org  Wed Jan 30 08:33:38 2019
From: unicode at unicode.org (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?= via Unicode)
Date: Wed, 30 Jan 2019 15:33:38 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
Message-ID: <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>

Le 30/01/2019 ? 14:36, Egmont Koblinger via Unicode a ?crit?:
> - It doesn't do Arabic shaping. In my recommendation I'm arguing that
> in this mode, where shuffling the characters is the task of the text
> editor and not the terminal, so should it be for Arabic shaping using
> presentation form characters.

I guess Arabic shaping is doable through presentation form characters, 
because the latter are character inherited from legacy standards using 
them in such solutions. But if you want to support other ?arabic like? 
scripts (like Syriac, N?ko), or even some LTR complex scripts, like 
Myanmar or Khmer, this ?solution? cannot work, because no equivalent of 
?presentation form characters? exists for these scripts


From unicode at unicode.org  Wed Jan 30 08:43:10 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Wed, 30 Jan 2019 15:43:10 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <20190130143247.5uftt6btf3izuprd@angband.pl>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
 <CAGWcZkLxBV=5W9hoJJxF9SFQ3V35V9k3Dwe=a_+QhLwJkdn_gQ@mail.gmail.com>
 <20190130143247.5uftt6btf3izuprd@angband.pl>
Message-ID: <CAGWcZkKKq0VO7GvqtOQcfqzjC64KfaqaBVmT4NEivLNJYF4ZQA@mail.gmail.com>

On Wed, Jan 30, 2019 at 3:32 PM Adam Borowski <kilobyte at angband.pl> wrote:

> > > ????????????????????
> > > ? filename1 ?  123 ?
> > > ? FILENAME2 ?   17 ?
> > > ????????????????????

> That's possible only if the program in question is running directly attached
> to the tty.  That's not an option if the output is redirected.  Frames in
> a plain text file are a perfectly rational, and pretty widespread, use --
> and your proposal will break them all.  Be it "cat" to the screen, "less" or
> even "mutt" if the text was sent via a mail.

I'd argue that if you have such a data stored in a file, with logical
order used in Arabic or Hebrew text, combined with line drawing chars
as you showed, then your data is broken to begin with ? broken in the
sense that it's suitable for automated processing (*), but not for
display. I can't think of any utility that would display it properly,
because that's not what the Unicode BiDi algorithm run over this data
produces.

(*) but then line drawing chars are not really a nice choice over CSV,
JSON, whatever.

The only possible choice is for some display engine to be aware that
line drawing characters are part of a "higher level protocol", and
BiDi should be applied only in the lower scope. I don't think the
terminal emulator is the right place to make such decisions ? I don't
think any other generic tool (graphical word processor, browser etc.)
does make such a call either.

cheers,
egmont


From unicode at unicode.org  Wed Jan 30 08:49:34 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Wed, 30 Jan 2019 15:49:34 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
Message-ID: <CAGWcZkL3TOkq3DEPBM0iU3Ysgd8qHhCrWDDj3SEUVzW-varMhQ@mail.gmail.com>

Hi Fr?d?ric,

> I guess Arabic shaping is doable through presentation form characters,
> because the latter are character inherited from legacy standards using
> them in such solutions. But if you want to support other ?arabic like?
> scripts (like Syriac, N?ko), or even some LTR complex scripts, like
> Myanmar or Khmer, this ?solution? cannot work, because no equivalent of
> ?presentation form characters? exists for these scripts

Unfortunately my knowledge ends here, I'm not familiar with shaping
for Syriac and other similar scripts. I'd really appreciate input from
experts here.

I outline in the document problems that arise from the terminal
emulator performing shaping on its contents in "explicit" mode, which
is to be used by Emacs and others. The terminal emulator is not aware
of the characters that are chopped off at the edge of the screen,
required for shaping. The terminal emulator is not aware of which
characters happen to be placed next to each other, but belong to
semantically different UI elements, that is, shouldn't be shaped.

(And as a side note, FriBidi doesn't provide a method for doing
shaping on _visual_ order. I'm unsure about other libraries, and
unsure if there's an algorithm for it at all.)

Honestly, I have no idea how to best address all these problems at
once. This is where we can think of extensions "expliti mode level 2",
use control characters that explicitly specify how to shape certain
glyphs. This is subject to further research.

cheers,
egmont


From unicode at unicode.org  Wed Jan 30 09:10:47 2019
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Wed, 30 Jan 2019 16:10:47 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkKKq0VO7GvqtOQcfqzjC64KfaqaBVmT4NEivLNJYF4ZQA@mail.gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
 <CAGWcZkLxBV=5W9hoJJxF9SFQ3V35V9k3Dwe=a_+QhLwJkdn_gQ@mail.gmail.com>
 <20190130143247.5uftt6btf3izuprd@angband.pl>
 <CAGWcZkKKq0VO7GvqtOQcfqzjC64KfaqaBVmT4NEivLNJYF4ZQA@mail.gmail.com>
Message-ID: <20190130151047.alykgc6e7nmev4b5@angband.pl>

On Wed, Jan 30, 2019 at 03:43:10PM +0100, Egmont Koblinger via Unicode wrote:
> On Wed, Jan 30, 2019 at 3:32 PM Adam Borowski <kilobyte at angband.pl> wrote:
> 
> > > > ????????????????????
> > > > ? filename1 ?  123 ?
> > > > ? FILENAME2 ?   17 ?
> > > > ????????????????????
> 
> > That's possible only if the program in question is running directly attached
> > to the tty.  That's not an option if the output is redirected.  Frames in
> > a plain text file are a perfectly rational, and pretty widespread, use --
> > and your proposal will break them all.  Be it "cat" to the screen, "less" or
> > even "mutt" if the text was sent via a mail.
> 
> I'd argue that if you have such a data stored in a file, with logical
> order used in Arabic or Hebrew text, combined with line drawing chars
> as you showed, then your data is broken to begin with ? broken in the
> sense that it's suitable for automated processing (*), but not for
> display.

A formatted table is pretty unsuitable for automated processing, and
obviously meant for human display.

> (*) but then line drawing chars are not really a nice choice over CSV,
> JSON, whatever.

That's why you use CSV and JSON for machine-readable, plain text for humans,
and XML for neither.

> The only possible choice is for some display engine to be aware that
> line drawing characters are part of a "higher level protocol", and
> BiDi should be applied only in the lower scope. I don't think the
> terminal emulator is the right place to make such decisions

At this point, required information is lost.  Any transformations such as
RTL reordering needs to be done earlier, when you still see _unformatted_
version of the data.

You're a terminal emulator maintainer, thus it's natural for you to think
it's the right place to come up with a solution.  I'd argue that it's not --
all a terminal emulator can do is to display already formatted text, there's
no sane way to move things around.  Any changes need to be localized -- for
example, you can do ligatures only if you keep total length unchanged.  Ie,
the terminal emulator is the right layer for things like complex script
shaping, but not RTL reordering.


Meow!
-- 
???????
??????? Remember, the S in "IoT" stands for Security, while P stands
??????? for Privacy.
???????

From unicode at unicode.org  Wed Jan 30 09:24:15 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Wed, 30 Jan 2019 16:24:15 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <20190130151047.alykgc6e7nmev4b5@angband.pl>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
 <CAGWcZkLxBV=5W9hoJJxF9SFQ3V35V9k3Dwe=a_+QhLwJkdn_gQ@mail.gmail.com>
 <20190130143247.5uftt6btf3izuprd@angband.pl>
 <CAGWcZkKKq0VO7GvqtOQcfqzjC64KfaqaBVmT4NEivLNJYF4ZQA@mail.gmail.com>
 <20190130151047.alykgc6e7nmev4b5@angband.pl>
Message-ID: <CAGWcZk+jQ8UuyMzM-Z1N4DRZCo=P+6npJnCYxGLwAa-D3fLgKQ@mail.gmail.com>

> A formatted table is pretty unsuitable for automated processing, and
> obviously meant for human display.

Could you please clarify how exactly that data looks like? Maybe a
tiny hexdump of an example?

Is the RTL piece of text already stored in visual order, that is,
beginning with the leftmost (last logical) letter of the word? If so
then you can sure display it properly in BiDi-unaware rendering
engines (including most terminal emulators currently, as well as in
"explicit" mode according to my specification). That is, whoever
produces that data reverses that word for you?

Or is the RTL piece of text still in its logical order? Then in what
piece of software does this formatted data show up to you in a
readable way?

> You're a terminal emulator maintainer, thus it's natural for you to think
> it's the right place to come up with a solution.

No. I've been a maintainer/developer/contributor to all kinds of
software, including (but not limited to) terminal emulators, apps
running inside terminal emulators, or a pretty complex RTL homepage.
I'm doing my best in looking at the entire ecosystem, and coming up
with a good BiDi-aware interface between terminal emulators and
applications.

> I'd argue that it's not --
> all a terminal emulator can do is to display already formatted text, there's
> no sane way to move things around.

You missed that your use case with this table is not the only possible
use case. There are others where the terminal needs to do BiDi. My
work aims to address multiple use cases at once, yours being one of
them.


cheers,
egmont

From unicode at unicode.org  Wed Jan 30 09:56:00 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Wed, 30 Jan 2019 17:56:00 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 (message from Egmont Koblinger on Wed, 30 Jan 2019 14:36:42 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
Message-ID: <83va2616nj.fsf@gnu.org>

> From: Egmont Koblinger <egmont at gmail.com>
> Date: Wed, 30 Jan 2019 14:36:42 +0100
> Cc: unicode at unicode.org
> 
> - GNU Emacs reshuffles the characters according to the BiDi algorithm,
> expecting that the terminal emulator doesn't do any BiDi.

Yes, users are told to disable bidi reordering of the terminal, if the
terminal supports that.

> - It doesn't do Arabic shaping.

It doesn't do _any_ shaping.  Complex script shaping is left to the
terminal, because it's impossible to do shaping in any reasonable way
without controlling the fonts being used and accessing the font
information, and this is not possible when you run on a terminal --
the user configures the terminal emulator, and the emulator chooses
the fonts it likes/needs according to that configuration.

(Emacs does support complex script shaping on GUI displays.)

> - When it comes to visually wrapping a line because it doesn't fit in
> the current width, Emacs goes its own way which doesn't match what the
> Unicode BiDi algorithm says.

Yes, this deviation is documented in the Emacs manuals.  The reason
for that is that the Emacs implementation of the UBA reorders
characters on the fly (i.e., it implements a function that can be
called character by character, and returns the next character in the
visual order).  This was done due to a special structure of the Emacs
display engine and efficiency considerations.

In practice, this problem happens very rarely.

From unicode at unicode.org  Wed Jan 30 10:06:10 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Wed, 30 Jan 2019 18:06:10 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZk+dVBaFR-G_UC2r4NWxwbEZq70rj_4UHZhOgF4v-yYz=w@mail.gmail.com>
 (message from Egmont Koblinger via Unicode on Wed, 30 Jan 2019
 15:07:22 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
 <CAGWcZk+dVBaFR-G_UC2r4NWxwbEZq70rj_4UHZhOgF4v-yYz=w@mail.gmail.com>
Message-ID: <83tvhq166l.fsf@gnu.org>

> Date: Wed, 30 Jan 2019 15:07:22 +0100
> Cc: unicode at unicode.org
> From: Egmont Koblinger via Unicode <unicode at unicode.org>
> 
> Another possible approach is to leave the terminal doing BiDi, but
> embed all the text fragments in FSI...PDI blocks.

Does anyone know of a terminal emulator which supports isolates?

From unicode at unicode.org  Wed Jan 30 10:10:36 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Wed, 30 Jan 2019 18:10:36 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkLxBV=5W9hoJJxF9SFQ3V35V9k3Dwe=a_+QhLwJkdn_gQ@mail.gmail.com>
 (message from Egmont Koblinger via Unicode on Wed, 30 Jan 2019
 15:25:32 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
 <CAGWcZkLxBV=5W9hoJJxF9SFQ3V35V9k3Dwe=a_+QhLwJkdn_gQ@mail.gmail.com>
Message-ID: <83r2cu15z7.fsf@gnu.org>

> Date: Wed, 30 Jan 2019 15:25:32 +0100
> Cc: unicode at unicode.org
> From: Egmont Koblinger via Unicode <unicode at unicode.org>
> 
> > ????????????????????
> > ? filename1 ?  123 ?
> > ? FILENAME2 ?   17 ?
> > ????????????????????
> >
> > I'm afraid there's no good way to do BiDi without support from individual
> > programs.
> 
> In this particular example, when the output consists of RTL text in
> logical order (the emitter does not reorder the characters to their
> visual order, nor emit any BiDi controls), combined with line drawing
> and such, there is hardly anything we could do purely on the terminal
> emulator's side.

I think the application could use TAB characters to get to the next
cell, then simplistic reordering would also work.

But in general, yes: this is one of the examples why sophisticated
text-editing applications cannot leave this to the terminal.  (Another
example is handling mouse clicks, if the terminal supports that.)

From unicode at unicode.org  Wed Jan 30 10:22:02 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Wed, 30 Jan 2019 18:22:02 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkL3TOkq3DEPBM0iU3Ysgd8qHhCrWDDj3SEUVzW-varMhQ@mail.gmail.com>
 (message from Egmont Koblinger via Unicode on Wed, 30 Jan 2019
 15:49:34 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <CAGWcZkL3TOkq3DEPBM0iU3Ysgd8qHhCrWDDj3SEUVzW-varMhQ@mail.gmail.com>
Message-ID: <83o97y15g5.fsf@gnu.org>

> Date: Wed, 30 Jan 2019 15:49:34 +0100
> Cc: unicode at unicode.org
> From: Egmont Koblinger via Unicode <unicode at unicode.org>
> 
> I outline in the document problems that arise from the terminal
> emulator performing shaping on its contents in "explicit" mode, which
> is to be used by Emacs and others. The terminal emulator is not aware
> of the characters that are chopped off at the edge of the screen,
> required for shaping. The terminal emulator is not aware of which
> characters happen to be placed next to each other, but belong to
> semantically different UI elements, that is, shouldn't be shaped.
> 
> (And as a side note, FriBidi doesn't provide a method for doing
> shaping on _visual_ order. I'm unsure about other libraries, and
> unsure if there's an algorithm for it at all.)
> 
> Honestly, I have no idea how to best address all these problems at
> once. This is where we can think of extensions "expliti mode level 2",
> use control characters that explicitly specify how to shape certain
> glyphs. This is subject to further research.

Personally, I think we should simply assume that complex script
shaping is left to the terminal, and if the terminal cannot do that,
then that's a restriction of working on a text terminal.  There's
nothing you can do here, because correct shaping requires too many
features that applications running on text terminals cannot use, and
OTOH a terminal emulator who wants to perform shaping needs
information from the application (like the directionality of the text
and its language) that there's no way for the application to provide.

From unicode at unicode.org  Wed Jan 30 10:31:42 2019
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Wed, 30 Jan 2019 17:31:42 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <83va2616nj.fsf@gnu.org>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <83va2616nj.fsf@gnu.org>
Message-ID: <20190130163142.ewknnpfnnl5y4m2c@angband.pl>

On Wed, Jan 30, 2019 at 05:56:00PM +0200, Eli Zaretskii via Unicode wrote:
> > - It doesn't do Arabic shaping.
> 
> It doesn't do _any_ shaping.  Complex script shaping is left to the
> terminal, because it's impossible to do shaping in any reasonable way
> without controlling the fonts being used and accessing the font
> information, and this is not possible when you run on a terminal

It's the inverse of the situation with RTL reordering.  The interface
between the program and the terminal is a character cell grid (really, a
sequence of printables and \e-based codes, but that's a technical detail).

The program (emacs in this case) can do arbitrary reordering of characters
on the grid, it also has lots of information the terminal doesn't.  For
example, what are you going to do when there's a line longer than what fits
on the screen?  Emacs will cut and hide part of it; any attempts to reorder
that paragraph by the terminal are outright broken as you don't _have_ the
paragraph.  Same for a popup window on the middle of the screen partially
obscuring some text underneath.  And if you argue "so make emacs print your
new code to disable formatting", so do thousands of other programs that are
less sophisticated than emacs.

On the other hand, all that the program can output is a sequence of Unicode
codepoints.  These don't include shaping information, and are not supposed
to.  The shaping is explicitly meant to be done by the terminal, and it's
the terminal who's equipped with _most_ of the needed data (it might lack
context just outside screen's end or under an overlapped window, but that's
not specific to complex shaping -- same can happen for the other half of a
CJK character).  You know if the font used supports shaping, you can have
access to a graphic view (as opposed to the array of codepoints) -- heck,
it's only you who know the text is rendered on a screen rather than a
Braille device.  And if you miss an opportunity to shape something, the
result is still readable to the user, merely not as good as it could be.


Meow!
-- 
???????
??????? Remember, the S in "IoT" stands for Security, while P stands
??????? for Privacy.
???????

From unicode at unicode.org  Wed Jan 30 13:06:02 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Wed, 30 Jan 2019 12:06:02 -0700
Subject: Encoding italic
Message-ID: <20190130120602.665a7a7059d7ee80bb4d670165c8327d.b6ade639ea.wbe@email03.godaddy.com>

Martin J. D?rst wrote:
 
> Here's a little dirty secret about these tag characters: They were
> placed in one of the astral planes explicitly to make sure they'd use
> 4 bytes per tag character, and thus quite a few bytes for any actual
> complete tags.
 
<theory type="conspiracy" serious="false">

Aha. That explains why SCSU had to be banished to the hut, right around
the same time the Plane 14 language tags were deprecated. In SCSU,
astral characters can be 1 byte just like BMP characters.

</theory> 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 

From unicode at unicode.org  Wed Jan 30 15:24:06 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Wed, 30 Jan 2019 14:24:06 -0700
Subject: Encoding italic
Message-ID: <20190130142406.665a7a7059d7ee80bb4d670165c8327d.40483f20e6.wbe@email03.godaddy.com>

Kent Karlsson wrote:
 
> Yes, great. But as I've said, we've ALREADY got a
> default-ignorable-in-display (if implemented right)
> way of doing such things.
>
> And not only do we already have one, but it is also
> standardised in multiple standards from different
> standards institutions. See for instance "ISO/IEC 8613-6,
> Information technology --- Open Document Architecture (ODA)
> and Interchange Format: Character content architecture".
 
I looked at ITU T.416, which I believe is equivalent to ISO 8613-6 but
has the advantage of not costing me USD 179, and it looks very similar
to ISO 6429 (ECMA-48, formerly ANSI X3.64) with regard to the things we
are talking about: setting text display properties such as bold and
italics by means of escape sequences.
 
Can you explain how ISO 8613-6 differs from ISO 6429 for what we are
doing, and if it does not, why we should not simply refer to the more
familiar 6429?
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Wed Jan 30 16:02:48 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 30 Jan 2019 22:02:48 +0000
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
Message-ID: <20190130220248.1a11f80e@JRWUBU2>

On Wed, 30 Jan 2019 15:33:38 +0100
Fr?d?ric Grosshans via Unicode <unicode at unicode.org> wrote:

> Le 30/01/2019 ? 14:36, Egmont Koblinger via Unicode a ?crit?:
> > - It doesn't do Arabic shaping. In my recommendation I'm arguing
> > that in this mode, where shuffling the characters is the task of
> > the text editor and not the terminal, so should it be for Arabic
> > shaping using presentation form characters.  
> 
> I guess Arabic shaping is doable through presentation form
> characters, because the latter are character inherited from legacy
> standards using them in such solutions.

So long as you don't care about local variants, e.g. U+0763 ARABIC
LETTER KEHEH WITH THREE DOTS ABOVE.  It has no presentation form
characters.

Basic Arabic shaping, at the level of a typewriter, is straightforward
enough to leave to a terminal emulator, as Eli has suggested.  Lam-alif
would be trickier - one cell or two?

> But if you want to support
> other ?arabic like? scripts (like Syriac, N?ko), or even some LTR
> complex scripts, like Myanmar or Khmer, this ?solution? cannot work,
> because no equivalent of ?presentation form characters? exists for
> these scripts

I believe combining marks present issues even in implicit modes.  In
implicit mode, one cannot simply delegate the task to normal text
rendering, for one has to allocate text to cells.  There are a number
of complications that spring to mind:

1) Some characters decompose to two characters that may otherwise lay
claim to their own cells:

U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to <06D2,
0654>.  Do you intend that your scheme be usable by Unicode-compliant
processes?

2) 2-part vowels, such as U+0D4A MALAYALAM VOWEL SIGN O, which
canonically decomposes into a preceding combining mark U+0D46 MALAYALAM
VOWEL SIGN E and following combining mark U+0D3E MALAYALAM VOWEL SIGN
AA.

3) Similar 2-part vowels that do not decompose, such as U+17C4 KHMER
VOWEL SIGN OO.  OpenType layout decomposes that into a preceding
'U+17C1 KHMER VOWEL SIGN E' and the second part.

4) Indic conjuncts.
(i) There are some conjuncts, such as Devanagari K.SSA, where a
display as <KA, VIRAMA>, <SSA> is simply unacceptable.  In some
closely related scripts, this conjunct has the status of a character.

(ii) In some scripts, e.g. Khmer, the virama-equivalent is not an
acceptable alternative to form a consonant stack.  Khmer could
equally well have been encoded with a set of subscript consonants in
the same manner as Tibetan.

(iii) In some scripts, there are marks named as medial consonants
which function in exactly the same way as <'virama', consonant>; it is
silly to render them in entirely different manners.

5) Some non-spacing marks are spacing marks in some contexts.  U+102F
MYANMAR VOWEL SIGN U is probably the best known example.

Richard.


From unicode at unicode.org  Wed Jan 30 16:37:05 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 30 Jan 2019 14:37:05 -0800
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <20190130220248.1a11f80e@JRWUBU2>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <20190130220248.1a11f80e@JRWUBU2>
Message-ID: <ba10b4bc-e571-6631-ca8a-68f3e6149747@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190130/545552ae/attachment.html>

From unicode at unicode.org  Wed Jan 30 18:38:56 2019
From: unicode at unicode.org (Kent Karlsson via Unicode)
Date: Thu, 31 Jan 2019 01:38:56 +0100
Subject: Encoding italic
In-Reply-To: <20190130142406.665a7a7059d7ee80bb4d670165c8327d.40483f20e6.wbe@email03.godaddy.com>
Message-ID: <D8780230.447D4%kent.karlsson14@telia.com>

I did say "multiple" and "for instance". But since you ask:

ITU T.416/ISO/IEC 8613-6 defines general RGB & CMY(K) colour control
sequences, which are deferred in ECMA-48/ISO 6429. (The RGB one
is implemented in Cygwin (sorry for mentioning a product name).)
(The "named" ones, though very popular in terminal emulators, are
all much too stark, I think, and the exact colour for them are
implementation defined.)

ECMA-48/ISO 6429 defines control sequences for CJK emphasising, which
traditionally does not use bold or italic. Compare those specified for CSS
(https://www.w3.org/TR/css-text-decor-3/#propdef-text-decoration-style and
https://www.w3.org/TR/css-text-decor-3/#propdef-text-emphasis-style).
These are not at all mentioned in ITU T.416/ISO/IEC 8613-6, but should
be of interest for the generalised subject of this thread.

There are some other differences as well, but those are the major ones
with regard to text styling. (I don't know those standards to a tee.
I've just looked at the "m" control sequences for text styling. And yes,
I looked at the free copies...)

/Kent Karlsson

PS
If people insist on that EACH character in "plain text" italic/bold/etc
"controls" be default ignorable: one could just take the control sequences
as specified, but map the printable characters part to the corresponding
tag characters... Not that I think that that is really necessary.


Den 2019-01-30 22:24, skrev "Doug Ewell via Unicode" <unicode at unicode.org>:

> Kent Karlsson wrote:
>  
>> Yes, great. But as I've said, we've ALREADY got a
>> default-ignorable-in-display (if implemented right)
>> way of doing such things.
>> 
>> And not only do we already have one, but it is also
>> standardised in multiple standards from different
>> standards institutions. See for instance "ISO/IEC 8613-6,
>> Information technology --- Open Document Architecture (ODA)
>> and Interchange Format: Character content architecture".
>  
> I looked at ITU T.416, which I believe is equivalent to ISO 8613-6 but
> has the advantage of not costing me USD 179, and it looks very similar
> to ISO 6429 (ECMA-48, formerly ANSI X3.64) with regard to the things we
> are talking about: setting text display properties such as bold and
> italics by means of escape sequences.
>  
> Can you explain how ISO 8613-6 differs from ISO 6429 for what we are
> doing, and if it does not, why we should not simply refer to the more
> familiar 6429?
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
> 


From unicode at unicode.org  Wed Jan 30 19:35:36 2019
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Wed, 30 Jan 2019 20:35:36 -0500
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkKGs_QEv2vq19FcBJDzB+eCv_6R+fmTypLO9S=TVMPHzg@mail.gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83ef8v2ws1.fsf@gnu.org>
 <CAGWcZkKGs_QEv2vq19FcBJDzB+eCv_6R+fmTypLO9S=TVMPHzg@mail.gmail.com>
Message-ID: <d51365a5-19fe-0a97-b5ea-66d84c6d4dbb@kli.org>

On 1/30/19 8:58 AM, Egmont Koblinger via Unicode wrote:
> There's another side to the entire BiDi story, though. Simple
> utilities like "echo", "cat", "ls", "grep" and so on, line editing
> experience of your shell, these kinds. It's absolutely not feasible to
> add BiDi support to these utilities. Here the only viable approach is
> to have the terminal emulator do it.

How will "ls -l" possibly work?? This is an example of the "table" 
layout you were already discussing.

I think us command-line troglodytes just have to deal with not having a 
whole lot of BiDi support.? There's simply no way any terminal emulator 
could possibly know what makes sense and what doesn't for a given line 
of text, coming from some random program.? Your "grep" could be grepping 
from a file with ANY layout, not necessarily one conducive to terminal 
layout, and so on.

~mark


From unicode at unicode.org  Wed Jan 30 20:02:50 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 30 Jan 2019 18:02:50 -0800
Subject: Encoding italic
In-Reply-To: <D8780230.447D4%kent.karlsson14@telia.com>
References: <D8780230.447D4%kent.karlsson14@telia.com>
Message-ID: <0d8f8ca1-21d7-88fd-10ff-048eaded1aae@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190130/f3390586/attachment.html>

From unicode at unicode.org  Wed Jan 30 21:46:30 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Wed, 30 Jan 2019 19:46:30 -0800
Subject: Encoding italic
In-Reply-To: <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <e404fa2c-2721-0840-9fde-08057133e678@gaultney.org>
 <f8bafc4b-58ec-307e-3d5e-08db87ced356@gmail.com>
 <CAMZ=zj6PJxX6jnS4N52e3qjvdjUf-Drvr=St_2NuaR4ugd7kKQ@mail.gmail.com>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
Message-ID: <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>

On Sun, Jan 27, 2019 at 12:04 PM James Kass via Unicode
<unicode at unicode.org> wrote:
> A new beta of BabelPad has been released which enables input, storing,
> and display of italics, bold, strikethrough, and underline in plain-text

Okay? Ed can do that too, along with nano and notepad. It's called
HTML (TeX, Troff). If by plain-text, you mean self-interpeting,
without external standards, then it's simply impossible.

-- 
Kie ekzistas vivo, ekzistas espero.

From unicode at unicode.org  Thu Jan 31 00:20:15 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 30 Jan 2019 22:20:15 -0800
Subject: Encoding italic
In-Reply-To: <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
Message-ID: <64641cf1-cfc0-b74a-b17c-85b7efc75ea9@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190130/40243c32/attachment.html>

From unicode at unicode.org  Thu Jan 31 01:35:02 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 31 Jan 2019 07:35:02 +0000
Subject: Encoding italic
In-Reply-To: <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
Message-ID: <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com>


David Starner wrote,

 >> ... italics, bold, strikethrough, and underline in plain-text
 >
 > Okay? Ed can do that too, along with nano and notepad. It's called
 > HTML (TeX, Troff). If by plain-text, you mean self-interpeting,
 > without external standards, then it's simply impossible.

HTML source files are in plain-text.? Hopefully everyone on this list 
understands that and has already explored the marvelous benefits offered 
by granting users the ability to make exciting and effective page 
layouts via any plain-text editor.? HTML is standard and interchangeable.

As Tex Texin observed, differences of opinion as to where we draw the 
line between text and mark-up are somewhat ideological.? If a compelling 
case for handling italics at the plain-text level can be made, then the 
fact that italics can already be handled elsewhere doesn?t matter.? If a 
compelling case cannot be made, there are always alternatives.

As for use of other variant letter forms enabled by the math 
alphanumerics, the situation exists.? It?s an interesting phenomenon 
which is sometimes worthy of comment and relates to this thread because 
the math alphanumerics include italics.? One of the web pages referring 
to third-party input tools calls the practice ?super cool Unicode text 
magic?.


From unicode at unicode.org  Thu Jan 31 01:44:35 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Wed, 30 Jan 2019 23:44:35 -0800
Subject: Encoding italic
In-Reply-To: <64641cf1-cfc0-b74a-b17c-85b7efc75ea9@ix.netcom.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
 <64641cf1-cfc0-b74a-b17c-85b7efc75ea9@ix.netcom.com>
Message-ID: <001f01d4b938$d0514080$70f3c180$@xencraft.com>

David, Asmus,
 
?       ?without external standards, then it's simply impossible.?

?       ?And without external standard, not interoperable.?

As you both know there are de jure as well as de facto standards. So for years people typed : - ) as a smiley without a de facto standard and at some point long before emoji, systems began converting these to smiley faces.

Even the utf-8 BOM began as one company?s non-interoperable convention for encoding identifier which later became part of the de facto standard.

Ideally interoperability means supported everywhere but we have many useful mechanisms that simply don?t do harm without being interpreted.

For example, Unicode relies on this for backward compatibility when it introduces new characters, properties, algorithms, et al that are not understood by all systems but are tolerated by older ones.

=====================================

While I am at it, I am amused by the arguments earlier in this thread as well as other threads, that go:

?       If the feature was needed developers would have implemented it by now. It isn?t implemented so the standard doesn?t need it.

?       The feature was implemented without the standard, so we don?t need it in the standard.

If men were meant to fly they would have wings?

Apparently, for some, it is only when there are many conflicting implementations that a feature demonstrates both that it is a requirement and also that it should be standardized.

In fact, this is sometimes not a bad view as it prevents adding features to the standard that go unused yet add complexity. 

But, it can also set too high a bar. And often it isn?t a true criteria but just resistance to change.

You  don?t need italics. When I went to school we just tilted the terminal a few degrees and voila.

(You don?t need a car. When I went to school we walked 6 miles to get there. Uphill both ways. J )

tex

 
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag via Unicode
Sent: Wednesday, January 30, 2019 10:20 PM
To: unicode at unicode.org
Subject: Re: Encoding italic

 
On 1/30/2019 7:46 PM, David Starner via Unicode wrote:

On Sun, Jan 27, 2019 at 12:04 PM James Kass via Unicode
 <mailto:unicode at unicode.org> <unicode at unicode.org> wrote:

A new beta of BabelPad has been released which enables input, storing,
and display of italics, bold, strikethrough, and underline in plain-text

 
Okay? Ed can do that too, along with nano and notepad. It's called
HTML (TeX, Troff). If by plain-text, you mean self-interpeting,
without external standards, then it's simply impossible.
 

It's either "markdown" or control/tag sequences. Both are out of band information.

And without external standard, not interoperable.

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190130/346bc6d5/attachment.html>

From unicode at unicode.org  Thu Jan 31 01:59:28 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Wed, 30 Jan 2019 23:59:28 -0800
Subject: Encoding italic
In-Reply-To: <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
 <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com>
Message-ID: <CAMZ=zj5QPpv=nb6-HfU9eGCPPmvo+ka3-qMNDJJ+zN7mA3ieug@mail.gmail.com>

On Wed, Jan 30, 2019 at 11:37 PM James Kass via Unicode
<unicode at unicode.org> wrote:
> As Tex Texin observed, differences of opinion as to where we draw the
> line between text and mark-up are somewhat ideological.  If a compelling
> case for handling italics at the plain-text level can be made, then the
> fact that italics can already be handled elsewhere doesn?t matter.  If a
> compelling case cannot be made, there are always alternatives.

To the extent I'd have ideology here, it's that that line is arbitrary
and needs to fit practical demands. Should we have eight-bit bytes?
I'm not sure that was the best solution, and other systems worked just
fine, but we've got a computing environment that makes anything else
unpractical. Unlike that question, italics has never been considered
part of plain text and has always been considered outside of plain
text. The fact that italics can be handled elsewhere very much weighs
against the value of your change. Everything you want to do can be
done and is being done, except when someone chooses not to do it.

-- 
Kie ekzistas vivo, ekzistas espero.


From unicode at unicode.org  Thu Jan 31 02:25:50 2019
From: unicode at unicode.org (Andrew Cunningham via Unicode)
Date: Thu, 31 Jan 2019 19:25:50 +1100
Subject: Encoding italic
In-Reply-To: <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
 <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com>
Message-ID: <CAGJ7U-Vm5HbbyRgkDeSgh9bzdoZOAA7Tq5Udc+pD2-6Uw7yieg@mail.gmail.com>

On Thursday, 31 January 2019, James Kass via Unicode <unicode at unicode.org>
wrote:.
>
>
> As for use of other variant letter forms enabled by the math
> alphanumerics, the situation exists.  It?s an interesting phenomenon which
> is sometimes worthy of comment and relates to this thread because the math
> alphanumerics include italics.  One of the web pages referring to
> third-party input tools calls the practice ?super cool Unicode text magic?.
>
>
Although not all devices can render such text. Many Android handsets on the
market do not have a sufficiently recent version of Android to have system
fonts that can render such existing usage.


-- 
Andrew Cunningham
lang.support at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190131/06f26d60/attachment.html>

From unicode at unicode.org  Thu Jan 31 02:28:41 2019
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Thu, 31 Jan 2019 08:28:41 +0000
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <20190130220248.1a11f80e@JRWUBU2>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <20190130220248.1a11f80e@JRWUBU2>
Message-ID: <e2e825bb-ac69-46e8-23eb-289231f190fb@it.aoyama.ac.jp>

On 2019/01/31 07:02, Richard Wordingham via Unicode wrote:
> On Wed, 30 Jan 2019 15:33:38 +0100
> Fr?d?ric Grosshans via Unicode <unicode at unicode.org> wrote:
> 
>> Le 30/01/2019 ? 14:36, Egmont Koblinger via Unicode a ?crit?:
>>> - It doesn't do Arabic shaping. In my recommendation I'm arguing
>>> that in this mode, where shuffling the characters is the task of
>>> the text editor and not the terminal, so should it be for Arabic
>>> shaping using presentation form characters.
>>
>> I guess Arabic shaping is doable through presentation form
>> characters, because the latter are character inherited from legacy
>> standards using them in such solutions.
> 
> So long as you don't care about local variants, e.g. U+0763 ARABIC
> LETTER KEHEH WITH THREE DOTS ABOVE.  It has no presentation form
> characters.

Same also for characters used for other languages than Arabic.

> Basic Arabic shaping, at the level of a typewriter, is straightforward
> enough to leave to a terminal emulator, as Eli has suggested.  Lam-alif
> would be trickier - one cell or two?

Same for other characters. A medial Beh/Teh/Theh/... (???) in any 
reasonably decent rendering should take quite a bit less space than a 
Seen or Sheen (???). I remember that the multilingual Emacs version 
mostly written by Ken'ichi Handa (was it called mEmacs or nEmacs or 
something like that?) had different widths only just for Arabic. In 
Thunderbird, which is what I'm using here, I get hopelessly 
stretched/squeezed glyph shapes, which definitely don't look good.

Regards,   Martin.


From unicode at unicode.org  Thu Jan 31 02:55:59 2019
From: unicode at unicode.org (Tex via Unicode)
Date: Thu, 31 Jan 2019 00:55:59 -0800
Subject: Encoding italic
In-Reply-To: <CAMZ=zj5QPpv=nb6-HfU9eGCPPmvo+ka3-qMNDJJ+zN7mA3ieug@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
 <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <CAMZ=zj5QPpv=nb6-
 HfU9eGCPPmvo+ka3-qMNDJJ+zN7mA3ieug@mail.gmail.com>
Message-ID: <001701d4b942$ca834e50$5f89eaf0$@xencraft.com>

David,

"italics has never been considered part of plain text and has always been considered outside of plain text. "

Time to change the definition if that is what is holding you back. As has been said before, interlinear annotation, emoji and other features of Unicode which  are now considered plain text were not in the original definition. If Unicode encoded an italic mechanism it would be part of plain text, just as the many other styled spaces, dashes and other characters have become plain text despite being typographic.

"The fact that italics can be handled elsewhere very much weighs against the value of your change. Everything you want to do can be done and is being done, except when someone chooses not to do it."

I heard a recent similar argument that goes: walls have been around since medieval times and they work really well... (Except they provably don't.)

As with the many problems with walls not being effective, you choose to ignore the legitimate issues pointed out on the list with the lack of italic standardization for Chinese braille, text to voice readers, etc.
The choice of plain text isn't always voluntary. And the existing alternatives, like math italic characters, are problematic.

tex


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of David Starner via Unicode
Sent: Wednesday, January 30, 2019 11:59 PM
To: Unicode Mailing List
Subject: Re: Encoding italic

On Wed, Jan 30, 2019 at 11:37 PM James Kass via Unicode
<unicode at unicode.org> wrote:
> As Tex Texin observed, differences of opinion as to where we draw the
> line between text and mark-up are somewhat ideological.  If a compelling
> case for handling italics at the plain-text level can be made, then the
> fact that italics can already be handled elsewhere doesn?t matter.  If a
> compelling case cannot be made, there are always alternatives.

To the extent I'd have ideology here, it's that that line is arbitrary
and needs to fit practical demands. Should we have eight-bit bytes?
I'm not sure that was the best solution, and other systems worked just
fine, but we've got a computing environment that makes anything else
unpractical. Unlike that question, italics has never been considered
part of plain text and has always been considered outside of plain
text. The fact that italics can be handled elsewhere very much weighs
against the value of your change. Everything you want to do can be
done and is being done, except when someone chooses not to do it.

-- 
Kie ekzistas vivo, ekzistas espero.


From unicode at unicode.org  Thu Jan 31 03:03:12 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 31 Jan 2019 09:03:12 +0000
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <d51365a5-19fe-0a97-b5ea-66d84c6d4dbb@kli.org>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83ef8v2ws1.fsf@gnu.org>
 <CAGWcZkKGs_QEv2vq19FcBJDzB+eCv_6R+fmTypLO9S=TVMPHzg@mail.gmail.com>
 <d51365a5-19fe-0a97-b5ea-66d84c6d4dbb@kli.org>
Message-ID: <20190131090312.178ae9ce@JRWUBU2>

On Wed, 30 Jan 2019 20:35:36 -0500
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> On 1/30/19 8:58 AM, Egmont Koblinger via Unicode wrote:
> > There's another side to the entire BiDi story, though. Simple
> > utilities like "echo", "cat", "ls", "grep" and so on, line editing
> > experience of your shell, these kinds. It's absolutely not feasible
> > to add BiDi support to these utilities. Here the only viable
> > approach is to have the terminal emulator do it.  
> 
> How will "ls -l" possibly work?? This is an example of the "table" 
> layout you were already discussing.

I think the answer is that it will use the same trickery as with a
default setting for the --color argument.  Colour codes are emitted
only when the output is a terminal.  Presumably the same would go for
Bidi controls.

> 
> I think us command-line troglodytes just have to deal with not having
> a whole lot of BiDi support.? There's simply no way any terminal
> emulator could possibly know what makes sense and what doesn't for a
> given line of text, coming from some random program.? Your "grep"
> could be grepping from a file with ANY layout, not necessarily one
> conducive to terminal layout, and so on.

So how do editors work now?

To avoid confusion, you will have to work with the terminal set to
having LTR paragraphs (or RTL instead); that is how Notepad works.

Richard.


From unicode at unicode.org  Thu Jan 31 03:11:22 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Thu, 31 Jan 2019 10:11:22 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <83va2616nj.fsf@gnu.org>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <83va2616nj.fsf@gnu.org>
Message-ID: <CAGWcZkKQvtcc_Mkc9Q9m-mxAYtLzquCGvHRqb5X7sxWP07=S8g@mail.gmail.com>

Hi Eli,

> It doesn't do _any_ shaping.  Complex script shaping is left to the
> terminal, because it's impossible to do shaping in any reasonable way
> [...]

Partially, you are right. On the other hand, as far as I know, shaping
should take into account the neighboring glyphs even if those are not
visible (e.g. overflow from the viewport), and the terminal is unaware
of what those glyps are. This is an area that "presentation form"
characters can address for Arabic ? although as it was pointed out,
not for Syrian and some others.

I'd say it's subject to further research and improvement to find the
ideal behavior.


cheers,
egmont


From unicode at unicode.org  Thu Jan 31 03:21:52 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Thu, 31 Jan 2019 10:21:52 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <83tvhq166l.fsf@gnu.org>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
 <CAGWcZk+dVBaFR-G_UC2r4NWxwbEZq70rj_4UHZhOgF4v-yYz=w@mail.gmail.com>
 <83tvhq166l.fsf@gnu.org>
Message-ID: <CAGWcZkLRFt5cT+tPpuAk5=47MKoOexApzkCPQa5bFCHogLKD_w@mail.gmail.com>

Hi Eli,

> Does anyone know of a terminal emulator which supports isolates?

GNOME Terminal's (VTE's) current work-in-progress implementation does
remember BiDi control characters just like it remembers combining
accents, that is, tied to the preceding letter's cell. It uses FriBidi
1.0 for the BiDi work, so yes, it supports Unicode 6.3's isolates.

There's one significant issue, though. Because we currently just
misuse our existing infrastructure of combining accents for the BiDi
controls, BiDi controls at the very beginning of a paragraph are
dropped. Addressing this issue would need core changes to the terminal
emulation behavior, such as introducing in-between-cells storage, or
zero-width special characters belonging to a cell _before_ the cell's
actual letter, or something like this. I outline one idea in my
specification, but it's subject to discussion to finalize it.

(There's also a less significant issue: copy-pasting fragments of text
probably doesn't produce the contents that make the most sense wrt.
BiDi controls. I'm not sure what other software do here, though.)

Mintty is also actively working on BiDi support, I believe its author
just recently added support for isolates. It uses its own BiDi
implementation.


cheers,
egmont

From unicode at unicode.org  Thu Jan 31 03:28:27 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Thu, 31 Jan 2019 10:28:27 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <83r2cu15z7.fsf@gnu.org>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
 <CAGWcZkLxBV=5W9hoJJxF9SFQ3V35V9k3Dwe=a_+QhLwJkdn_gQ@mail.gmail.com>
 <83r2cu15z7.fsf@gnu.org>
Message-ID: <CAGWcZkLKRFyd7u-a5uONXfNVS8RfaP+WNCq0rFQ50L1MON_byg@mail.gmail.com>

Hi Eli,

On Wed, Jan 30, 2019 at 5:10 PM Eli Zaretskii <eliz at gnu.org> wrote:

> I think the application could use TAB characters to get to the next
> cell, then simplistic reordering would also work.

TAB handling is extremely complicated, because in terminal emulation
TAB is not a character, TAB is a control instruction (like escape
sequences) that moves the cursor (and jumps through the existing
content, if any, without erasing it). Some terminal emulators perform
some magic to remember TABs in certain circumstances, but they cannot
always do so.

There are plenty of other problems, e.g. how they are handled at the
end of line (no, they don't wrap to the next line), how their
positions are user-configurable and not necessarily at every 8th
column etc., I'm not going into these details now if you don't mind,
it's just not a feasible approach.

cheers,
egmont

From unicode at unicode.org  Thu Jan 31 03:41:02 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Thu, 31 Jan 2019 10:41:02 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <83o97y15g5.fsf@gnu.org>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <CAGWcZkL3TOkq3DEPBM0iU3Ysgd8qHhCrWDDj3SEUVzW-varMhQ@mail.gmail.com>
 <83o97y15g5.fsf@gnu.org>
Message-ID: <CAGWcZkK_ReSjNVX2dtpRyJjuzwN9G5Km9pQuYFxCozQ+KRCb9g@mail.gmail.com>

Hi,

> Personally, I think we should simply assume that complex script
> shaping is left to the terminal, and if the terminal cannot do that,
> then that's a restriction of working on a text terminal.

I cannot read any of the Arabic, Syriac etc. scripts, but I did lots
of experimenting with picking random Arabic words, displaying in the
terminal their unshaped as well as shaped (using presentation form
characters) variants, and compared them to pango-view's (harfbuzz's)
rendering.

To my eyes the version I got in the terminal with the presentation
form characters matched the "expected" (pango-view) rendering
extremely closely. Of course there's still some tradeoffs due to fixed
with cells (just as in English, arguably an "i" and "w" having the
same width doesn't look as nice as with proportional fonts). In the
mean time, the unshaped characters looks vastly differently.

> OTOH a terminal emulator who wants to perform shaping needs
> information from the application

And the presentation form characters are pretty much exactly that
information, aren't they (for Arabic)?

> There's nothing you can do here [...] there's no way for the application to provide

Instead of saying that it's not possible, could we perpahs try to
solve it, or at least substantially improve the situation? I mean, for
example we can introduce control characters that specify the language.
We can introduce a flag that tells the terminal whether to do shaping
or not. There are probably plenty of more ideas to be thrown in for
discussion and improvement.


cheers,
egmont

From unicode at unicode.org  Thu Jan 31 03:58:54 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Thu, 31 Jan 2019 10:58:54 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <20190130163142.ewknnpfnnl5y4m2c@angband.pl>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <83va2616nj.fsf@gnu.org> <20190130163142.ewknnpfnnl5y4m2c@angband.pl>
Message-ID: <CAGWcZk+O4KWMZ88E367MgJbR8v+Tv-6p0PHQuPrSWMB8vLyK4w@mail.gmail.com>

Hi,

On Wed, Jan 30, 2019 at 5:31 PM Adam Borowski <kilobyte at angband.pl> wrote:

> The program (emacs in this case) can do arbitrary reordering of characters
> on the grid, it also has lots of information the terminal doesn't.  For
> example, what are you going to do when there's a line longer than what fits
> on the screen?  Emacs will cut and hide part of it; any attempts to reorder
> that paragraph by the terminal are outright broken as you don't _have_ the
> paragraph.  Same for a popup window on the middle of the screen partially
> obscuring some text underneath.

This is absolutely correct so far.

> And if you argue "so make emacs print your
> new code to disable formatting", so do thousands of other programs that are
> less sophisticated than emacs.

Yes, I do argue that emacs will need to print a new escape sequence.
Which is much-much-much-much-much better than having to tell users to
go into the settings of their macOS Terminal / Konsole /
gnome-terminal etc. and disable BiDi there, isn't it?

Could you please give me a brief idea about those "thousands of other
programs" that will need to be adjusted? What other apps can do BiDi?
Not even Vim/NeoVim can do it.

If an app doesn't support BiDi, it's broken anyways when encountering
RTL text. It'll still be broken, just broken differently. Did you mean
all these programs as those thousands?

For ncurses apps there's also a workaround that you could apply:
create a terminfo where the ti/te entries not only switch to/from the
alternate screen but also disable/enable BiDi. In that case all these
thousand ones will be "fixed" (that is: broken in the "old" way rather
than broken in the "new" way).

On the other hand, what you absolutely can *not* do automatically by
emitting escape sequences at the right times, is to enclose the output
of much lighter utilities like "echo", "cat", "grep", "head" and so on
with any kind of BiDi controls.

> On the other hand, all that the program can output is a sequence of Unicode
> codepoints.  These don't include shaping information

With "presentation form" characters, yes, they can, they do including
shaping information.

> and are not supposed
> to.  The shaping is explicitly meant to be done by the terminal,

Why?

> and it's
> the terminal who's equipped with _most_ of the needed data

Why? It's the app that knows the context characters, it's the app that
knows the language.

What is it that the terminal knows, but the app doesn't although
should, or what is it that the terminal doesn't know if presentation
form characters are used?

What is it that the app knows but cannot pass to the terminal?
Shouldn't we then extend the protocol so that it can pass these, too?


e.

From unicode at unicode.org  Thu Jan 31 04:03:57 2019
From: unicode at unicode.org (David Starner via Unicode)
Date: Thu, 31 Jan 2019 02:03:57 -0800
Subject: Encoding italic
In-Reply-To: <001701d4b942$ca834e50$5f89eaf0$@xencraft.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
 <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com>
 <CAMZ=zj5QPpv=nb6-HfU9eGCPPmvo+ka3-qMNDJJ+zN7mA3ieug@mail.gmail.com>
 <001701d4b942$ca834e50$5f89eaf0$@xencraft.com>
Message-ID: <CAMZ=zj5iusgrKpxtU2KbJ8dvZEZ+c7ci-O71Zp9gz=xpnKCiqA@mail.gmail.com>

On Thu, Jan 31, 2019 at 12:56 AM Tex <textexin at xencraft.com> wrote:
>
> David,
>
> "italics has never been considered part of plain text and has always been considered outside of plain text. "
>
> Time to change the definition if that is what is holding you back.

That's not a definition; that's a fact. Again, it's like the 8-bit
byte; there are systems with other sizes of byte, but you usually
shouldn't worry about it. Building systems that don't have 8-bit bytes
are possible, but it's likely to cost more than it's worth.

> As has been said before, interlinear annotation, emoji and other features of Unicode which  are now considered plain text were not in the original definition.

https://www.w3.org/TR/unicode-xml/#Interlinear (which used to be
Unicode Technical Report #20) says "The interlinear annotation
characters were included in Unicode only in order to reserve code
points for very frequent application-internal use. ... Including
interlinear annotation characters in marked-up text does not work
because the additional formatting information (how to position the
annotation,...) is not available. ... The interlinear annotation
characters are also problematic when used in plain text, and are not
intended for that purpose."

Emoji, as have been pointed out several times, were in the original
Unicode standard and date back to the 1980s; the first DOS character
page has similes at 0x01 and 0x02.

> If Unicode encoded an italic mechanism it would be part of plain text, just as the many other styled spaces, dashes and other characters have become plain text despite being typographic.

If Unicode encoded an italic mechanism, then some "plain text" would
include italics. Maybe it would be successful, and maybe it would join
the interlinear annotation characters as another discouraged poorly
supported feature.

> As with the many problems with walls not being effective, you choose to ignore the legitimate issues pointed out on the list with the lack of italic standardization for Chinese braille, text to voice readers, etc.

Text to voice readers don't have problems with the lack of italic
standardization; they have problems with people using mathematical
characters instead of actual letters.

> The choice of plain text isn't always voluntary.

The choice of using single-byte character sets isn't always voluntary.
That's why we should use ISO-2022, not Unicode. Or we can expect
people to fix their systems. What systems are we talking about, that
support Unicode but compel you to use plain text? The use of Twitter
is surely voluntary.

-- 
Kie ekzistas vivo, ekzistas espero.

From unicode at unicode.org  Thu Jan 31 04:16:32 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Thu, 31 Jan 2019 11:16:32 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <ba10b4bc-e571-6631-ca8a-68f3e6149747@ix.netcom.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <20190130220248.1a11f80e@JRWUBU2>
 <ba10b4bc-e571-6631-ca8a-68f3e6149747@ix.netcom.com>
Message-ID: <CAGWcZk+=YRimCwFUc8F79-aGLyZvhyFp1BHYv6W3W+T--tF6EA@mail.gmail.com>

Hi,

> Arabic terminals and terminal emulators existed at the time of Unicode 1.0.

I haven't found any mention of them, let alone any documentation about them.

> If you are trying to emulate those services, for example so that older software can run, you would need to look at how these programs expected to be fed their data.

My goal is not to have those ancient software run. My goal is to look
into the future. Address the requests often seen in current terminal
emulator's bugtrackers. Stop the utterly unacceptable user experience
of current self-proclaimed BiDi-aware terminals where in order to run
Emacs you need to fiddle with the terminal's settings. Show that BiDi
in terminals is a much more complex story than just shuffling around
the characters, thus stopping new emulators from taking this broken
route which causes about as much damage as good. Create a platform on
top of which modern BiDi-aware experience can be created, to make both
"cat" and "emacs" work properly out of the box for BiDi users.

> I see little reason to reinvent things here, because we are talking about emulating legacy hardware. Or are we not?

As per the above, no, not really. I'm not aware of any hardware that
supported BiDi, was there any? I look at terminal emulators as
extremely powerful tools for getting all kinds of work done. They are
continuously being improved, nowadays many terminal emulators contain
plenty of features that weren't there in any hardware one. I'm looking
for smoothlessly extending the terminal emulator experience to the RTL
/ BiDi world.

> It's conceivable, that with modern fonts, one can show some characters that could not be supported on the actual legacy hardware, because that was limited by available character memory and available pre-Unicode character sets. As long as the new characters otherwise fit the paradigm (character per cell) they can be supported without other changes in the protocol beyond change in character set.

Which protocol, the protocol of non-BiDi-aware terminals that lays out
everything from left to right, so the output of "echo", "cat" etc. are
reversed; or the protocol of self-proclaimed BiDi-aware terminals
where it's literally impossible to create a proper BiDi-aware text
editor?

My work focuses on proving that both of these modes are needed, and
how the necessary mode switches could happen automatically behind the
scenes.

> However, I would not expect an emulator to accept data in NFD for example.

Many emulators do.


cheers,
egmont


From unicode at unicode.org  Thu Jan 31 04:22:01 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 31 Jan 2019 10:22:01 +0000
Subject: Encoding italic
In-Reply-To: <CAMZ=zj5iusgrKpxtU2KbJ8dvZEZ+c7ci-O71Zp9gz=xpnKCiqA@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
 <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com>
 <CAMZ=zj5QPpv=nb6-HfU9eGCPPmvo+ka3-qMNDJJ+zN7mA3ieug@mail.gmail.com>
 <001701d4b942$ca834e50$5f89eaf0$@xencraft.com>
 <CAMZ=zj5iusgrKpxtU2KbJ8dvZEZ+c7ci-O71Zp9gz=xpnKCiqA@mail.gmail.com>
Message-ID: <f0bbef65-39af-daa3-8b69-4d450b88943c@gmail.com>


David Starner wrote,

 > Emoji, as have been pointed out several times, were in the original
 > Unicode standard and date back to the 1980s; the first DOS character
 > page has similes at 0x01 and 0x02.

That's disingenuous.


From unicode at unicode.org  Thu Jan 31 04:45:53 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Thu, 31 Jan 2019 11:45:53 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZk+O4KWMZ88E367MgJbR8v+Tv-6p0PHQuPrSWMB8vLyK4w@mail.gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <83va2616nj.fsf@gnu.org> <20190130163142.ewknnpfnnl5y4m2c@angband.pl>
 <CAGWcZk+O4KWMZ88E367MgJbR8v+Tv-6p0PHQuPrSWMB8vLyK4w@mail.gmail.com>
Message-ID: <CAGWcZkKDWrFvAkfpXFRPmERE1XU31X4OSnhkiQ8WkzKU8fkM1Q@mail.gmail.com>

Hi,

> > And if you argue "so make emacs print your
> > new code to disable formatting", so do thousands of other programs that are
> > less sophisticated than emacs.
>
> Yes, I do argue that emacs will need to print a new escape sequence.
> Which is much-much-much-much-much better than having to tell users to
> go into the settings of their macOS Terminal / Konsole /
> gnome-terminal etc. and disable BiDi there, isn't it?

Let me phrase it slightly differently. Emacs will not "need" to print
a new escape sequence, but will have the possibility to do so.

VTE is pretty certainly going to switch its default behavior to what
Konsole, PuTTY, Mlterm, macOS Terminal do now: to perform BiDi on its
contents. This mode is not suitable for Emacs or for any BiDi-aware
text editor.

Similarly to these terminal emulators, GNOME Terminal (and hopefully
other VTE-based frontends) will also most likely have a user setting
to force disable BiDi.

But as opposed to the aforementioned terminals, VTE will also turn off
BiDi upon a designated escape sequence.

VTE is the terminal widget behind several emulator apps, such as GNOME
Terminal, Xfce Terminal, Tilix, Terminator, Guake... I don't have
metrics, but according to various user polls I have the feeling that
VTE's usage share among Linux users is pretty significant, somewhere
in the ballpark of 50%.

Of course Emacs, or any other text editor, can still point its users
to the terminal's setting to disable BiDi. And then if the user also
wishes to have BiDi for "cat", they'll have to keep toggling it back
and forth. Or Emacs can emit the new escape sequence and then it will
be fully automatic.

Which one puts less supporting burden on Emacs's developers and
supporters? Which one is the better for the users? I think the answer
is the same to these two questions, and you sure know which answer I'm
thinking of.

According to this specification, nothing is going to be "worse" than
it already is in those few aforementioned terminal emulators. The new
default behavior will be the same as their behavior. We'll just
further extend it with the possibility of switching back to the old
mode without annoying the user.

I hope this clarifies a lot.


cheers,
egmont

From unicode at unicode.org  Thu Jan 31 05:18:02 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Thu, 31 Jan 2019 12:18:02 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <20190131090312.178ae9ce@JRWUBU2>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83ef8v2ws1.fsf@gnu.org>
 <CAGWcZkKGs_QEv2vq19FcBJDzB+eCv_6R+fmTypLO9S=TVMPHzg@mail.gmail.com>
 <d51365a5-19fe-0a97-b5ea-66d84c6d4dbb@kli.org>
 <20190131090312.178ae9ce@JRWUBU2>
Message-ID: <CAGWcZk+Fa4A6VNMSPKhUHQXXecu1fXnMMUUpcX+2uZjdUryB_w@mail.gmail.com>

On Thu, Jan 31, 2019 at 10:05 AM Richard Wordingham via Unicode
<unicode at unicode.org> wrote:

> > How will "ls -l" possibly work?  This is an example of the "table"
> > layout you were already discussing.
>
> I think the answer is that it will use the same trickery as with a
> default setting for the --color argument.  Colour codes are emitted
> only when the output is a terminal.  Presumably the same would go for
> Bidi controls.

Exactly, that's what I have in mind in the long run. If coreutils
folks like the idea, "ls" could have a new option
--bidi=never/auto/always. With BiDi mode, it would enclose each of the
logical segments of strings that potentially contain RTL text
(filenames, dates etc.) separately inside an FSI...PDI block. That way
its output would look as desired (over the terminal's new default
"implicit" mode), since the terminal would take care of BiDi-ing each
FSI...PDI block.


cheers,
egmont

From unicode at unicode.org  Thu Jan 31 05:46:48 2019
From: unicode at unicode.org (Egmont Koblinger via Unicode)
Date: Thu, 31 Jan 2019 12:46:48 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <20190130220248.1a11f80e@JRWUBU2>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <20190130220248.1a11f80e@JRWUBU2>
Message-ID: <CAGWcZkKGWXmDJs=dVhWcaJaESa9XzzJpw03WjkqSGgYDe3cmyQ@mail.gmail.com>

Hi Richard,

> Basic Arabic shaping, at the level of a typewriter, is straightforward
> enough to leave to a terminal emulator, as Eli has suggested.

What is "basic" Arabic shaping exactly?

I can see problems with leaving it to a terminal. It's not aware of
the neighboring character if the string is cropped. It's not able to
separate different UI elements that happen to be adjacent in the
terminal, separated by different background color or such.

On the other hand, let's reverse the question:

"Basic Arabic shaping, at the level of a typewriter, is
straightforward enough to be implemented in the application, using
presentation form characters, as I suggest". Could you please point
out the problems with this statement?

> I believe combining marks present issues even in implicit modes.  In
> implicit mode, one cannot simply delegate the task to normal text
> rendering, for one has to allocate text to cells.  There are a number
> of complications that spring to mind:
>
> 1) Some characters decompose to two characters that may otherwise lay
> claim to their own cells:
>
> U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to <06D2,
> 0654>.  Do you intend that your scheme be usable by Unicode-compliant
> processes?

Decompose during which step? During shaping?

Or do you mean they are NFC-NFD counterparts of each other?

Most terminal emulators are able to handle combining accents, and of
course implicit mode would take them into account when rearranging the
letters. Terminal emulators don't do explicit (de)composing, a.k.a.
NFC->NFD or NFD->NFC conversion (at least I'm not aware of any that
does).

> 4) Indic conjuncts.
> (i) There are some conjuncts, such as Devanagari K.SSA, where a
> display as <KA, VIRAMA>, <SSA> is simply unacceptable.  In some
> closely related scripts, this conjunct has the status of a character.

We (in GNOME Terminal / VTE) do have an open bug about Devanagari
spacing marks (currently they don't show up properly), plus Virama and
friends. I'd like to address the essentials along with the BiDi
implementation; although here we should discuss the design and not a
particular implementation thereof :)

In case you're interested, at
https://bugzilla.gnome.org/show_bug.cgi?id=584160 comments 45-48, 95
and perhaps a few others comments I wondered whether certain joining
operations should be done on the emulation layer or the display layer.
The answer is not yet clear. We can't fix suddenly everything, but
it's nice to move forward step by step. It's also proposed that we
used HarfBuzz, but it's unclear to me at this point how the grid
alignment could be preserved in the mean time.

"simply unacceptable" ? I'm not familiar with those languages,
cultures and so on, but I'd be hesitant to go as far as calling
anything "unacceptable". E.g. there's a physical typewriter in our
family, as far as I remember it has no digits 1 or 0 (use the letters
lowercase L and anycase O instead), it doesn't contain all the
accented letters of my mother tounge so sometimes a similarly looking
one has to be used. In today's computer world, I'd say such
limitations are "unacceptable", but at that time this was what we had
to live with.

Terminal emulators, due to their strict character grid nature and
their legacy behavior of many decades, are a platform where a certain
level of compromise might be necessary for some scripts. I cannot tell
where to draw the line, cannot tell what is "extremely bad" vs. "not
nice" vs. "kind of okay but could be better", but we can't do
everything in a terminal emulator that a graphical app could do. If
someone wants to have a pixel perfect look, terminal emulators are not
for them. Maybe looking at typewriters of those scripts could be a
good starting point. Anyway, we've drifted quite far away.


What I've already implemented in VTE (in a work-in-progress branch),
and to my eyes looks quite nice, is Arabic shape using presentation
form characters as done by FriBidi (in implicit mode only). According
to the API of this library, this shaping process keeps a 1:1 mapping
between the original and shaped letters (at least the number of
Unicode codepoints ? I haven't double checked their terminal width,
but I really hope they don't mess with us here). That is, I don't have
to deal with a character cell splitting into two, or two character
cells joining into one during shaping. Does this sound okay so far?


cheers,
egmont


From unicode at unicode.org  Thu Jan 31 08:21:40 2019
From: unicode at unicode.org (James Kass via Unicode)
Date: Thu, 31 Jan 2019 14:21:40 +0000
Subject: Encoding italic
In-Reply-To: <CAMZ=zj5iusgrKpxtU2KbJ8dvZEZ+c7ci-O71Zp9gz=xpnKCiqA@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
 <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com>
 <CAMZ=zj5QPpv=nb6-HfU9eGCPPmvo+ka3-qMNDJJ+zN7mA3ieug@mail.gmail.com>
 <001701d4b942$ca834e50$5f89eaf0$@xencraft.com>
 <CAMZ=zj5iusgrKpxtU2KbJ8dvZEZ+c7ci-O71Zp9gz=xpnKCiqA@mail.gmail.com>
Message-ID: <f3c956ac-f72f-ad0e-d9c0-273ee8b0df41@gmail.com>


David Starner wrote,

 > The choice of using single-byte character sets isn't always voluntary.
 > That's why we should use ISO-2022, not Unicode. Or we can expect
 > people to fix their systems. What systems are we talking about, that
 > support Unicode but compel you to use plain text? The use of Twitter
 > is surely voluntary.

This marketing-related web page,

https://litmus.com/blog/best-practices-for-plain-text-emails-a-look-at-why-theyre-important

...lists various reasons for using plain-text e-mail.? Here?s an excerpt.

?Some people simply prefer it. Plain and simple?some people prefer text 
emails. ... Some users may also see HTML emails as a security and 
privacy risk, and choose not to load any images and have visibility over 
all links that are included in an email. In addition, the increased 
bandwidth that image-heavy emails tend to consume is another driver of 
why users simply prefer plain-text emails.?

Besides marketing, there?s also newsletters and e-mail discussion 
groups.? Some of those discussion groups are probably scholarly. Anyone 
involved in that would likely embrace ?super cool Unicode text magic? 
and it?s surprising if none of them have stumbled across the math 
alphanumerics yet.

A web search for the string ?plain text only? leads to all manner of 
applications for which searchers are trying to control their 
environments.? There?s all kinds of reasons why some people prefer to 
use plain-text, it?s often an informed choice and it isn?t limited to 
e-mail.

It?s true that people don?t have to use Twitter.? People don?t have to 
turn on their computers, either.


From unicode at unicode.org  Thu Jan 31 08:32:53 2019
From: unicode at unicode.org (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?= via Unicode)
Date: Thu, 31 Jan 2019 15:32:53 +0100
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkK_ReSjNVX2dtpRyJjuzwN9G5Km9pQuYFxCozQ+KRCb9g@mail.gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <CAGWcZkL3TOkq3DEPBM0iU3Ysgd8qHhCrWDDj3SEUVzW-varMhQ@mail.gmail.com>
 <83o97y15g5.fsf@gnu.org>
 <CAGWcZkK_ReSjNVX2dtpRyJjuzwN9G5Km9pQuYFxCozQ+KRCb9g@mail.gmail.com>
Message-ID: <e790bb70-545e-8e3a-6de9-275c227e5bc0@gmail.com>

Le 31/01/2019 ? 10:41, Egmont Koblinger a ?crit?:
> Hi,
>
>> Personally, I think we should simply assume that complex script
>> shaping is left to the terminal, and if the terminal cannot do that,
>> then that's a restriction of working on a text terminal.
> I cannot read any of the Arabic, Syriac etc. scripts, but I did lots
> of experimenting with picking random Arabic words, displaying in the
> terminal their unshaped as well as shaped (using presentation form
> characters) variants, and compared them to pango-view's (harfbuzz's)
> rendering.
>
> To my eyes the version I got in the terminal with the presentation
> form characters matched the "expected" (pango-view) rendering
> extremely closely. Of course there's still some tradeoffs due to fixed
> with cells (just as in English, arguably an "i" and "w" having the
> same width doesn't look as nice as with proportional fonts). In the
> mean time, the unshaped characters looks vastly differently.
>
>> OTOH a terminal emulator who wants to perform shaping needs
>> information from the application
> And the presentation form characters are pretty much exactly that
> information, aren't they (for Arabic)?
>
>> There's nothing you can do here [...] there's no way for the application to provide
> Instead of saying that it's not possible, could we perpahs try to
> solve it, or at least substantially improve the situation? I mean, for
> example we can introduce control characters that specify the language.
> We can introduce a flag that tells the terminal whether to do shaping
> or not. There are probably plenty of more ideas to be thrown in for
> discussion and improvement.
>
>
> cheers,
> egmont


From unicode at unicode.org  Thu Jan 31 04:16:15 2019
From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode)
Date: Thu, 31 Jan 2019 10:16:15 +0000 (GMT)
Subject: Encoding italic
In-Reply-To: <CAMZ=zj5iusgrKpxtU2KbJ8dvZEZ+c7ci-O71Zp9gz=xpnKCiqA@mail.gmail.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <36805103-528d-1686-4ed4-060f3666cbdf@orange.fr>
 <93fc8728-20ba-db60-aec6-e9693696c48a@gaultney.org>
 <5b06e0fc-9fc6-e714-1b9e-8309a0c5c6c3@gmail.com>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
 <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com>
 <CAMZ=zj5QPpv=nb6-HfU9eGCPPmvo+ka3-qMNDJJ+zN7mA3ieug@mail.gmail.com>
 <001701d4b942$ca834e50$5f89eaf0$@xencraft.com>
 <CAMZ=zj5iusgrKpxtU2KbJ8dvZEZ+c7ci-O71Zp9gz=xpnKCiqA@mail.gmail.com>
Message-ID: <69f43412.412.168a368b74a.Webtop.72@btinternet.com>

Is the way to try to resolve this for a proposal document to be produced 
for using Variation Selector 14 in order to produce italics and for the 
proposal document to be submitted to the Unicode Technical Committee?

If the proposal is allowed to go to the committee rather than being 
ruled out of scope, then we can know whether the Unicode Technical 
Committee will allow the encoding.

William Overington

Thursday 31 January 2019


From unicode at unicode.org  Thu Jan 31 09:04:42 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Thu, 31 Jan 2019 17:04:42 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkKQvtcc_Mkc9Q9m-mxAYtLzquCGvHRqb5X7sxWP07=S8g@mail.gmail.com>
 (message from Egmont Koblinger on Thu, 31 Jan 2019 10:11:22 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <83va2616nj.fsf@gnu.org>
 <CAGWcZkKQvtcc_Mkc9Q9m-mxAYtLzquCGvHRqb5X7sxWP07=S8g@mail.gmail.com>
Message-ID: <83tvhozx4l.fsf@gnu.org>

> From: Egmont Koblinger <egmont at gmail.com>
> Date: Thu, 31 Jan 2019 10:11:22 +0100
> Cc: unicode at unicode.org
> 
> > It doesn't do _any_ shaping.  Complex script shaping is left to the
> > terminal, because it's impossible to do shaping in any reasonable way
> > [...]
> 
> Partially, you are right. On the other hand, as far as I know, shaping
> should take into account the neighboring glyphs even if those are not
> visible (e.g. overflow from the viewport), and the terminal is unaware
> of what those glyps are. This is an area that "presentation form"
> characters can address for Arabic ? although as it was pointed out,
> not for Syrian and some others.

Arabic presentation forms are more like an exception than a rule, I
hope you understand this by now.  Most languages/scripts don't have
such forms, and even for Arabic they cover only a part of what needs
to be done to present correctly shaped text.  Complex script shaping
is much more than just substituting some glyphs with others, it
requires an intimate knowledge of the font being used and its
capabilities, and the ability to control how various glyphs of a
grapheme cluster are placed relative to one another, something that an
application running on a text terminal cannot do.

So I suggest that you don't consider Arabic presentation forms a
representative of the direction in which terminal emulators supporting
such scripts should evolve.

From unicode at unicode.org  Thu Jan 31 09:07:14 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Thu, 31 Jan 2019 17:07:14 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkLRFt5cT+tPpuAk5=47MKoOexApzkCPQa5bFCHogLKD_w@mail.gmail.com>
 (message from Egmont Koblinger on Thu, 31 Jan 2019 10:21:52 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
 <CAGWcZk+dVBaFR-G_UC2r4NWxwbEZq70rj_4UHZhOgF4v-yYz=w@mail.gmail.com>
 <83tvhq166l.fsf@gnu.org>
 <CAGWcZkLRFt5cT+tPpuAk5=47MKoOexApzkCPQa5bFCHogLKD_w@mail.gmail.com>
Message-ID: <83sgx8zx0d.fsf@gnu.org>

> From: Egmont Koblinger <egmont at gmail.com>
> Date: Thu, 31 Jan 2019 10:21:52 +0100
> Cc: Adam Borowski <kilobyte at angband.pl>, unicode at unicode.org
> 
> > Does anyone know of a terminal emulator which supports isolates?
> 
> GNOME Terminal's (VTE's) current work-in-progress implementation does
> remember BiDi control characters just like it remembers combining
> accents, that is, tied to the preceding letter's cell. It uses FriBidi
> 1.0 for the BiDi work, so yes, it supports Unicode 6.3's isolates.

So we will some day have one such terminal emulator.  That's good, but
a text-mode application that needs to support bidi cannot rely on its
users all having access to that single terminal.

> There's one significant issue, though. Because we currently just
> misuse our existing infrastructure of combining accents for the BiDi
> controls, BiDi controls at the very beginning of a paragraph are
> dropped. Addressing this issue would need core changes to the terminal
> emulation behavior, such as introducing in-between-cells storage, or
> zero-width special characters belonging to a cell _before_ the cell's
> actual letter, or something like this. I outline one idea in my
> specification, but it's subject to discussion to finalize it.

This is indeed a significant issue, because it means applications
cannot force the terminal use a certain non-default base paragraph
direction.

From unicode at unicode.org  Thu Jan 31 09:10:02 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Thu, 31 Jan 2019 17:10:02 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkLKRFyd7u-a5uONXfNVS8RfaP+WNCq0rFQ50L1MON_byg@mail.gmail.com>
 (message from Egmont Koblinger on Thu, 31 Jan 2019 10:28:27 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <20190129175046.x4jkv25mfnun7q7b@angband.pl>
 <CAGWcZkLxBV=5W9hoJJxF9SFQ3V35V9k3Dwe=a_+QhLwJkdn_gQ@mail.gmail.com>
 <83r2cu15z7.fsf@gnu.org>
 <CAGWcZkLKRFyd7u-a5uONXfNVS8RfaP+WNCq0rFQ50L1MON_byg@mail.gmail.com>
Message-ID: <83r2cszwvp.fsf@gnu.org>

> From: Egmont Koblinger <egmont at gmail.com>
> Date: Thu, 31 Jan 2019 10:28:27 +0100
> Cc: Adam Borowski <kilobyte at angband.pl>, unicode at unicode.org
> 
> On Wed, Jan 30, 2019 at 5:10 PM Eli Zaretskii <eliz at gnu.org> wrote:
> 
> > I think the application could use TAB characters to get to the next
> > cell, then simplistic reordering would also work.
> 
> TAB handling is extremely complicated, because in terminal emulation
> TAB is not a character, TAB is a control instruction (like escape
> sequences) that moves the cursor (and jumps through the existing
> content, if any, without erasing it).

The reordering happens before TABs are converted to cursor motion,
does it not?  If so, their effect on reordering, by virtue of the TAB
being Segment Separator for the UBA purposes, could happen
nonetheless.  Right?

From unicode at unicode.org  Thu Jan 31 09:14:21 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Thu, 31 Jan 2019 17:14:21 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkK_ReSjNVX2dtpRyJjuzwN9G5Km9pQuYFxCozQ+KRCb9g@mail.gmail.com>
 (message from Egmont Koblinger on Thu, 31 Jan 2019 10:41:02 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <CAGWcZkL3TOkq3DEPBM0iU3Ysgd8qHhCrWDDj3SEUVzW-varMhQ@mail.gmail.com>
 <83o97y15g5.fsf@gnu.org>
 <CAGWcZkK_ReSjNVX2dtpRyJjuzwN9G5Km9pQuYFxCozQ+KRCb9g@mail.gmail.com>
Message-ID: <83pnsczwoi.fsf@gnu.org>

> From: Egmont Koblinger <egmont at gmail.com>
> Date: Thu, 31 Jan 2019 10:41:02 +0100
> Cc: Fr?d?ric Grosshans <frederic.grosshans at gmail.com>, 
> 	unicode at unicode.org
> 
> > Personally, I think we should simply assume that complex script
> > shaping is left to the terminal, and if the terminal cannot do that,
> > then that's a restriction of working on a text terminal.
> 
> I cannot read any of the Arabic, Syriac etc. scripts, but I did lots
> of experimenting with picking random Arabic words, displaying in the
> terminal their unshaped as well as shaped (using presentation form
> characters) variants, and compared them to pango-view's (harfbuzz's)
> rendering.
> 
> To my eyes the version I got in the terminal with the presentation
> form characters matched the "expected" (pango-view) rendering
> extremely closely.

I suggest that you show the result to someone who does read Arabic.
Small changes can be very unpleasant to the eyes of an Arabic reader.

As for Arabic presentation forms, I already explained why they cannot
be considered a solution that is anywhere near the complete one.

> > OTOH a terminal emulator who wants to perform shaping needs
> > information from the application
> 
> And the presentation form characters are pretty much exactly that
> information, aren't they (for Arabic)?

Much more is needed for correct shaping.

> Instead of saying that it's not possible, could we perpahs try to
> solve it, or at least substantially improve the situation? I mean, for
> example we can introduce control characters that specify the language.
> We can introduce a flag that tells the terminal whether to do shaping
> or not. There are probably plenty of more ideas to be thrown in for
> discussion and improvement.

You could do that, but it will require a lot of non-trivial processing
from the applications.  Text-mode applications don't want any complex
tinkering, they want just to write their text and be done.  The more
overhead you add to that simple task, the less probable it is that
applications will support such a terminal.

From unicode at unicode.org  Thu Jan 31 09:18:13 2019
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Thu, 31 Jan 2019 16:18:13 +0100
Subject: Encoding italic
In-Reply-To: <f3c956ac-f72f-ad0e-d9c0-273ee8b0df41@gmail.com>
References: <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
 <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com>
 <CAMZ=zj5QPpv=nb6-HfU9eGCPPmvo+ka3-qMNDJJ+zN7mA3ieug@mail.gmail.com>
 <001701d4b942$ca834e50$5f89eaf0$@xencraft.com>
 <CAMZ=zj5iusgrKpxtU2KbJ8dvZEZ+c7ci-O71Zp9gz=xpnKCiqA@mail.gmail.com>
 <f3c956ac-f72f-ad0e-d9c0-273ee8b0df41@gmail.com>
Message-ID: <20190131151813.byeaj4fen5uptpb4@angband.pl>

On Thu, Jan 31, 2019 at 02:21:40PM +0000, James Kass via Unicode wrote:
> David Starner wrote,
> > The choice of using single-byte character sets isn't always voluntary.
> > That's why we should use ISO-2022, not Unicode. Or we can expect
> > people to fix their systems. What systems are we talking about, that
> > support Unicode but compel you to use plain text? The use of Twitter
> > is surely voluntary.
> 
> This marketing-related web page,
> 
> https://litmus.com/blog/best-practices-for-plain-text-emails-a-look-at-why-theyre-important
> 
> ...lists various reasons for using plain-text e-mail.

They're only from a spammer's point of view.

> Besides marketing, there?s also newsletters and e-mail discussion groups.?
> Some of those discussion groups are probably scholarly. Anyone involved in
> that would likely embrace ?super cool Unicode text magic? and it?s
> surprising if none of them have stumbled across the math alphanumerics yet.

Then there are technical mailing lists.  In particular, on every single list
other than Unicode I'm subscribed to, a HTML-only mail would get you flamed
by several list members; even a plain+HTML alternative can get you an
earful.

Then there's LKML and other lists hosted at vger, where a mail that as much
as has a HTML version attached will get outright rejected at mail software
level.

After 2? decades of participating mailing in mailing lists, I got aversion
to HTML mails burned in as a kind of involuntary reflex.  Upon seeing Asmus'
mails, the ingrained reflex kicks in, I start getting upset, only to realize
what list I'm reading and that it's him who's a regular here, not me.

So even when in principle adding such features would be possible, many
communities decide to prefer interoperability over newest types of bling.
Some prefer top-posted HTML mails, some prefer Twitter, some Unicode plain
text, some perhaps want plain ASCII only.

> It?s true that people don?t have to use Twitter.? People don?t have to turn
> on their computers, either.

And sometimes they use a Braille reader or a text console.


Meow!
-- 
???????
??????? Remember, the S in "IoT" stands for Security, while P stands
??????? for Privacy.
???????

From unicode at unicode.org  Thu Jan 31 09:26:41 2019
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Thu, 31 Jan 2019 17:26:41 +0200
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZk+O4KWMZ88E367MgJbR8v+Tv-6p0PHQuPrSWMB8vLyK4w@mail.gmail.com>
 (message from Egmont Koblinger via Unicode on Thu, 31 Jan 2019
 10:58:54 +0100)
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <83va2616nj.fsf@gnu.org> <20190130163142.ewknnpfnnl5y4m2c@angband.pl>
 <CAGWcZk+O4KWMZ88E367MgJbR8v+Tv-6p0PHQuPrSWMB8vLyK4w@mail.gmail.com>
Message-ID: <83mungzw3y.fsf@gnu.org>

> Date: Thu, 31 Jan 2019 10:58:54 +0100
> Cc: unicode at unicode.org
> From: Egmont Koblinger via Unicode <unicode at unicode.org>
> 
> Yes, I do argue that emacs will need to print a new escape sequence.
> Which is much-much-much-much-much better than having to tell users to
> go into the settings of their macOS Terminal / Konsole /
> gnome-terminal etc. and disable BiDi there, isn't it?

I'm not sure I agree.  Most users can disable bidi reordering of the
terminal once and for all.  They don't need it.

If terminals supported some control sequence to turn on and off the
reordering, it might be a useful feature to support such sequences.
But IME just querying the emulator whether it supports that or not is
a hassle, and generally slows down the application startup.  So it's a
mixed blessing.

> > On the other hand, all that the program can output is a sequence of Unicode
> > codepoints.  These don't include shaping information
> 
> With "presentation form" characters, yes, they can, they do including
> shaping information.

Let's please stop talking about presentation forms, they solve only a
small part of the shaping problem.

> > and it's
> > the terminal who's equipped with _most_ of the needed data
> 
> Why? It's the app that knows the context characters, it's the app that
> knows the language.

But only the emulator knows which fonts it uses, and only the emulator
can access the information about the font, like what OTF features it
supports, what glyphs it has, etc.

From unicode at unicode.org  Thu Jan 31 10:57:53 2019
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Thu, 31 Jan 2019 08:57:53 -0800
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkK_ReSjNVX2dtpRyJjuzwN9G5Km9pQuYFxCozQ+KRCb9g@mail.gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <CAGWcZkL3TOkq3DEPBM0iU3Ysgd8qHhCrWDDj3SEUVzW-varMhQ@mail.gmail.com>
 <83o97y15g5.fsf@gnu.org>
 <CAGWcZkK_ReSjNVX2dtpRyJjuzwN9G5Km9pQuYFxCozQ+KRCb9g@mail.gmail.com>
Message-ID: <ef7a1db1-5816-484d-3f5a-e5295ed2f0e2@att.net>


On 1/31/2019 1:41 AM, Egmont Koblinger via Unicode wrote:
> I mean, for
> example we can introduce control characters that specify the language.

That is a complete non-starter for the Unicode Standard. And if the 
terminal implementation introduces such as one-off hacks, they will fail 
completely for interoperability.

https://en.wikipedia.org/wiki/IETF_language_tag

That belongs to the markup level, not to the plain text stream.

--Ken


From unicode at unicode.org  Thu Jan 31 11:31:43 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Thu, 31 Jan 2019 10:31:43 -0700
Subject: Proposal for BiDi in terminal emulators
Message-ID: <20190131103143.665a7a7059d7ee80bb4d670165c8327d.fbe7b17b3f.wbe@email03.godaddy.com>

Egmont Koblinger wrote:
 
> "Basic Arabic shaping, at the level of a typewriter, is
> straightforward enough to be implemented in the application, using
> presentation form characters, as I suggest". Could you please point
> out the problems with this statement?
 
As multiple people have pointed out, Arabic presentation forms don't
cover the whole Arabic script and are not generally recommended for new
applications, though they are not formally deprecated.

If you take a look at the parallel discussion about italics in plain
text, you will see a corollary in the use of Mathematical Alphanumeric
Symbols: they look tempting and are (usually) easy to render, but among
other things, they only cover [A-Za-z???-??-?] and thus miss much
of the text that may need to be italicized.
 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Thu Jan 31 11:49:25 2019
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Thu, 31 Jan 2019 10:49:25 -0700
Subject: Encoding italic
Message-ID: <20190131104925.665a7a7059d7ee80bb4d670165c8327d.38c41ce419.wbe@email03.godaddy.com>

Kent Karlsson wrote:
 
> ITU T.416/ISO/IEC 8613-6 defines general RGB & CMY(K) colour control
> sequences, which are deferred in ECMA-48/ISO 6429. (The RGB one
> is implemented in Cygwin (sorry for mentioning a product name).)
 
Fair enough. This thread is mostly about italics and bold and such, not
colors, but the point is well taken that one of these leads invariably
to the others, especially if the standard or flavor in question
implements them.
 
> ECMA-48/ISO 6429 defines control sequences for CJK emphasising, which
> traditionally does not use bold or italic.
 
But that's OK. For low-level mechanisms like these, it should be
incumbent on the user to say, "Yes, I can use this styling with that
script, but I shouldn't; it would look terrible and would fly in the
face of convention." ISO 6429 also allows green text on a cyan
background, which is about as good an idea as CJK italics.
 
> Compare those specified for CSS
> (https://www.w3.org/TR/css-text-decor-3/#propdef-text-decoration-style and
> https://www.w3.org/TR/css-text-decor-3/#propdef-text-emphasis-style).
> These are not at all mentioned in ITU T.416/ISO/IEC 8613-6, but should
> be of interest for the generalised subject of this thread.
 
I'm hoping we can continue to restrict this thread to plain text.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 

From unicode at unicode.org  Thu Jan 31 12:59:45 2019
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 31 Jan 2019 10:59:45 -0800
Subject: Encoding italic
In-Reply-To: <001701d4b942$ca834e50$5f89eaf0$@xencraft.com>
References: <197c84bf-266e-d802-c443-9717de5b2899@gaultney.org>
 <74abf741-3382-5753-7476-a5463b2e2fb7@gmail.com>
 <CALgEMhw=UOb_9aR56OtPa4dimjMbP8DF4W8TYpWptfoAwyTbJw@mail.gmail.com>
 <bf74696d-d5c3-2a31-b54e-263224be42de@gmail.com>
 <CA+p4_H398eEYDco7fQwJ6Cq0X1NmFj6QVwAfa+euhSvpwgN2XA@mail.gmail.com>
 <277dce38-00d2-750a-f553-3354e06f4076@ix.netcom.com>
 <003001d4b47d$a3d628a0$eb8279e0$@xencraft.com>
 <cd8b4f6d-b29e-c960-0717-489d902ef1ce@ix.netcom.com>
 <2a993124.1d13.1688442c8e7.Webtop.71@btinternet.com>
 <CAGJ7U-Xc-9JJ+j6PFqLhkkhWhHOBDk5Lg-B57t3xrBVuy__eqA@mail.gmail.com>
 <26a4dbe1-7eb9-7d1d-e3ed-1cfe2793711e@ix.netcom.com>
 <6ef58528-66ca-1be4-aa01-90ebbd5229bd@gmail.com>
 <645cf608-0781-0147-00cc-49aa3866f9a9@gmail.com>
 <CAMZ=zj4obz968qHEv22KxDCe=BpH5M=c4S9jx2Pftjz0XnzxyQ@mail.gmail.com>
 <2cea843d-00f5-ed25-de11-69562b8be9b7@gmail.com> <CAMZ=zj5QPpv=nb6-
 HfU9eGCPPmvo+ka3-qMNDJJ+zN7mA3ieug@mail.gmail.com>
 <001701d4b942$ca834e50$5f89eaf0$@xencraft.com>
Message-ID: <b477bfbc-9d97-871d-c3d9-65fc2b062708@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190131/8d921aec/attachment.html>

From unicode at unicode.org  Thu Jan 31 16:11:09 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 31 Jan 2019 22:11:09 +0000
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <e2e825bb-ac69-46e8-23eb-289231f190fb@it.aoyama.ac.jp>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <20190130220248.1a11f80e@JRWUBU2>
 <e2e825bb-ac69-46e8-23eb-289231f190fb@it.aoyama.ac.jp>
Message-ID: <20190131221109.0509e607@JRWUBU2>

On Thu, 31 Jan 2019 08:28:41 +0000
Martin J. D?rst via Unicode <unicode at unicode.org> wrote:

> > Basic Arabic shaping, at the level of a typewriter, is
> > straightforward enough to leave to a terminal emulator, as Eli has
> > suggested.  Lam-alif would be trickier - one cell or two?  
> 
> Same for other characters. A medial Beh/Teh/Theh/... (???) in any 
> reasonably decent rendering should take quite a bit less space than a 
> Seen or Sheen (???). I remember that the multilingual Emacs version 
> mostly written by Ken'ichi Handa (was it called mEmacs or nEmacs or 
> something like that?) had different widths only just for Arabic. In 
> Thunderbird, which is what I'm using here, I get hopelessly 
> stretched/squeezed glyph shapes, which definitely don't look good.

It's a long time since I last knowingly read typewritten Arabic script,
but on reading the description of Haddad's design of the Arabic
typewriter, I see what you mean.  My point is correct, but your point
is another argument for having single- and double-width characters.

Richard.


From unicode at unicode.org  Thu Jan 31 17:17:19 2019
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 31 Jan 2019 23:17:19 +0000
Subject: Proposal for BiDi in terminal emulators
In-Reply-To: <CAGWcZkKGWXmDJs=dVhWcaJaESa9XzzJpw03WjkqSGgYDe3cmyQ@mail.gmail.com>
References: <CAGWcZkLAW1jEWAQ-nhDtpDXeJEXZZAdrhidE=xrpD2gLteDhLQ@mail.gmail.com>
 <83k1in30kh.fsf@gnu.org>
 <CAGWcZkKjSerZYRiBFM6OxOpiSLepQ4YjHLHAzv7e6C4cr26D6g@mail.gmail.com>
 <dd5fd234-440e-2fe8-5e28-4ef2d81eb116@gmail.com>
 <20190130220248.1a11f80e@JRWUBU2>
 <CAGWcZkKGWXmDJs=dVhWcaJaESa9XzzJpw03WjkqSGgYDe3cmyQ@mail.gmail.com>
Message-ID: <20190131231719.2b545f7f@JRWUBU2>

On Thu, 31 Jan 2019 12:46:48 +0100
Egmont Koblinger <egmont at gmail.com> wrote:

> Hi Richard,
> 
> > Basic Arabic shaping, at the level of a typewriter, is
> > straightforward enough to leave to a terminal emulator, as Eli has
> > suggested.  
> 
> What is "basic" Arabic shaping exactly?

Just using initial, medial and final forms, with no vertical stacking,
In terms of glyphs, none of glyphs of the presentation forms with
'LIGATURE' in the name would be used.

> I can see problems with leaving it to a terminal. It's not aware of
> the neighboring character if the string is cropped.

Cropped why?  If the problem is the truncation of lines, one can simple
store the next character.

> It's not able to
> separate different UI elements that happen to be adjacent in the
> terminal, separated by different background color or such.

ZWJ and ZWNJ can handle that.

> On the other hand, let's reverse the question:
> 
> "Basic Arabic shaping, at the level of a typewriter, is
> straightforward enough to be implemented in the application, using
> presentation form characters, as I suggest". Could you please point
> out the problems with this statement?

Apart from using presentation form characters in raw text being a sin?

If a general text manipulating application, e.g. cat, grep or awk, is
writing to a file, it should not convert normal Arabic characters to
presentation forms.  You are now asking a general application to
determine whether it is writing to a terminal or not, and alter its
output if it is writing to a terminal.  If the terminal window is
actually an emacs text buffer, I would not want such output to be
converted.  It is entirely natural to convert an emacs text buffer to
a file. 

> > I believe combining marks present issues even in implicit modes.  In
> > implicit mode, one cannot simply delegate the task to normal text
> > rendering, for one has to allocate text to cells.  There are a
> > number of complications that spring to mind:
> >
> > 1) Some characters decompose to two characters that may otherwise
> > lay claim to their own cells:
> >
> > U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to
> > <06D2,  
> > 0654>.  Do you intend that your scheme be usable by
> > Unicode-compliant processes?  
> 
> Decompose during which step? During shaping?
> 
> Or do you mean they are NFC-NFD counterparts of each other?

The latter.

> > 4) Indic conjuncts.
> > (i) There are some conjuncts, such as Devanagari K.SSA, where a
> > display as <KA, VIRAMA>, <SSA> is simply unacceptable.  In some
> > closely related scripts, this conjunct has the status of a
> > character.  
> 
> We (in GNOME Terminal / VTE) do have an open bug about Devanagari
> spacing marks (currently they don't show up properly), plus Virama and
> friends. I'd like to address the essentials along with the BiDi
> implementation; although here we should discuss the design and not a
> particular implementation thereof :)
> 
> In case you're interested, at
> https://bugzilla.gnome.org/show_bug.cgi?id=584160 comments 45-48, 95
> and perhaps a few others comments I wondered whether certain joining
> operations should be done on the emulation layer or the display layer.
> The answer is not yet clear. We can't fix suddenly everything, but
> it's nice to move forward step by step. It's also proposed that we
> used HarfBuzz, but it's unclear to me at this point how the grid
> alignment could be preserved in the mean time.

Thanks for the link.

There are two different beasties.  There are text windows into which
the user and the application communicate using text, and this text
tends to be rendered properly, as one might aim to do with HarfBuzz,
and as an Emacs text buffer running the shell tries to do.  (Emacs
needs a lot of help - I can't write a generic Tai Tham OpenType .flt
file :-(  In my opinion, these are highly appropriate for application
like diff, grep and cat.  Do we have a good name for them/  They are,
perhaps, 'teletype emulators'.

> "simply unacceptable" ? I'm not familiar with those languages,
> cultures and so on, but I'd be hesitant to go as far as calling
> anything "unacceptable". E.g. there's a physical typewriter in our
> family, as far as I remember it has no digits 1 or 0 (use the letters
> lowercase L and anycase O instead), it doesn't contain all the
> accented letters of my mother tounge so sometimes a similarly looking
> one has to be used. In today's computer world, I'd say such
> limitations are "unacceptable", but at that time this was what we had
> to live with.
> 
> Terminal emulators, due to their strict character grid nature and
> their legacy behavior of many decades, are a platform where a certain
> level of compromise might be necessary for some scripts. I cannot tell
> where to draw the line, cannot tell what is "extremely bad" vs. "not
> nice" vs. "kind of okay but could be better", but we can't do
> everything in a terminal emulator that a graphical app could do. If
> someone wants to have a pixel perfect look, terminal emulators are not
> for them. Maybe looking at typewriters of those scripts could be a
> good starting point. Anyway, we've drifted quite far away.

But it as an issue that needs to be addressed.  As a terminal can be
addressed by cell, an application may need to keep track of what text
went into each cell. Misery results when the application gets it wrong.

> What I've already implemented in VTE (in a work-in-progress branch),
> and to my eyes looks quite nice, is Arabic shape using presentation
> form characters as done by FriBidi (in implicit mode only). According
> to the API of this library, this shaping process keeps a 1:1 mapping
> between the original and shaped letters (at least the number of
> Unicode codepoints ? I haven't double checked their terminal width,
> but I really hope they don't mess with us here). That is, I don't have
> to deal with a character cell splitting into two, or two character
> cells joining into one during shaping. Does this sound okay so far?

No.  How many cells do CJK ideographs occupy?  We've had a strong hint
that a medial BEH should occupy one cell, while an isolated BEH should
occupy two.

Richard.