Another take on the English Apostrophe in Unicode

Tue Jun 16 12:02:26 CDT 2015

On Mon, Jun 15, 2015, 17:12, Doug Ewell  wrote:

> Marcel Schneider wrote:
[...]
>> Microsoft’s choice of mashing up apostrophe and close-quote to end up
>> with an unprocessable hybrid was wrong. Very wrong.

> Windows-1252 and the other Windows code pages were developed during the
> 1980s, before Unicode, when almost all non-Asian character sets were
> limited to 256 code points. The distinctions between apostrophe and
> right-single-quote, weighed against the confusion caused by encoding two
> identical-looking characters, would never have been sufficient back then
> to justify separate encoding in this limited space.

I replied:

> The problem is not about code pages [...]

I thank you for your answers and I'll come back upon some of them below. There's some new fact to bring first. 

I concede that my last reply yesterday in the evening was incorrect. 

Additionally to Microsoftʼs action in the late nineties urging Unicode to give up its useful apostrophe recommendation (U+02BC), the design of code page Windows-1252 is in my scope, indeed.

Since I learned there are very good and outweighing reasons to use U+02BC in English, and that Unicodeʼs respective recommendation has been withdrawn with respect to a widespread practice founded on CP Windows-1252, I soon suspected there would have been means to get the apostrophe into this code page. Here I need to recall that I always liked Windows-1252 for its completing the ISO 8859-1 charset (which was so useless* it had to be replaced with ISO 8859-15).
* Please read this paper (in French):
http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf

Now that I examined closely CP1252ʼs layout, I found five empty code points, five code points left out, in the C1 ranges that Microsoft allocated to complete ISO 8859−1. Further, in this range, I found two MODIFIER LETTERS, CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL TILDE (152, 0x98, U+02DC). Obviously these two were added to disambiguate the extensively used spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on one side, and the diacritics on the other side. There is to say that when Windows was first released, the left and right single quotes were the only printable characters in these two ranges. All other characters plus × and ÷ came later. However, CP1252 remained stable since Windows 98, for which € and the žŽ pair were added. And five places were left empty.

>From this on I got convinced that it would have been very easy to place the letter apostrophe for example at code point 144 (0x90), near the single turned comma quotation mark 0x91 and the single comma quotation mark (right-single-quote) 0x92 which Microsoft recommended for use as apostrophe.

About the “confusion” everybody refers to, there is to say that the only way to get people confused, is to do things and not to explain anything to anybody. 

The core problem would have been that code pages were designed with glyph-based *character* encoding in mind, not semantics-based *text* encoding. 

I repeat that others had done even worse. Others, that is some of the so-called expert members of the ISO WG designing 8859-1, as two of them not even aimed at encoding all needed characters, by refusing deliberately to encode the lower- and uppercase Œ digraph, and even the uppercase Ÿ. Microsoftʼs big merit has been to produce a ready remedy to this bungling, that as far as belongs to the OE digraph, was meant to match defective peripherics.

Unfortunately, Microsoft visibly didnʼt finish this job, by aiming at encoding characters only, and thus not allocating more than one code point to that squiggle, whilst several places were left.

Well, all that are errors of the past. If I donʼt see a need, I wonʼt meet it. By leaving œ and Œ off the charset, they got × and ÷ in, at least. Where things ran really bad, was when Unicode was on, and code pages Procrustesʼ beds were out. At least, they should have been. Whence that survival of CP1252-based confusion?

Briefly, todayʼs text processing is suffering from the apostrophe-close-quote confusion. This confusion is firstly out of date, and secondly it was unnecessary from the beginning on. Avoiding this confusion at a trivial level (by not getting users confused to have to use two similar squiggles), is shifting it at process level, where the damage it causes is far bigger. Trust me, users who find themselves unable to set apart the apostrophes when theyʼre going to replace single quotes, wonʼt bless Microsoft for the input simplicity! Ted Clancyʼs blog post is here to prove.
https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/

It was time to get rid of that confusion when Unicode recommended U+02BC for apostrophe. Microsoftʼs choice not to comply was wrong again. Very wrong.

Let's come back to some of your replies.

On Mon, Jun 15, 2015, 20:14, Doug Ewell  wrote:

> I'd guess there are very few users who consciously see the use of U+2019
> as both apostrophe and right-single-quote as a vestige of code pages, or
> as a conscious effort by Evil Microsoft™ to force them into anything.

Quite sure. These are habits, not constraints. I'm not sharing such views about a battle between Google and Microsoft and about ethical prefixes to allocate to companies. The problem is that when the result proves to be bad, the idea was, too.

The mismatch between apostrophe and close-quote is now part of our culture. We must get back pragmatic and see the advantages and disadvantages of each option (ambiguating, disambiguating), not say "I believe there are no disadvantages in ambiguating" or "there is no reason to disambiguate" or "people will get confused, let them alone" or the like. These all are statements. We must look at real people and listen to what they say to us. Ted Clancy is one of them. When he's worried about that malfunctioning of text-processing, who will keep smiling and stay saying "There's no problem, there's no reason to fix that, it's all OK like it is"? 

That's to despise people, that's to spit at their face.

> Perhaps a UTC member can confirm whether this is fact or speculation.
> Markus Kuhn's comment from 1999 about "couldn't Unicode follow
> Microsoft...?" doesn't prove that Unicode was in fact strong-armed by
> Microsoft.

Yes, please let us know.

Marcel Schneider
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150616/2d0666bc/attachment.html>