Another take on the English apostrophe in Unicode

Marcel Schneider charupdate at orange.fr
Mon Jun 15 04:49:59 CDT 2015


On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ☕️  wrote:

> On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider  wrote:

>> When we take the topic down again from linguistics to the core mission of Unicode, that is character encoding and text processing standardisation, ellipsis and Swedish abbreviation colon differ from the single closing quotation mark in this, that they are not to be processed.

>> Linguistics, however, delivered the foundation on which Unicode issued its first recommendation on what character to use for apostrophe. The result was neither a matter of opinion, nor of probabilities.

>> Actually, the choice is between perpetuating confusion in word processing, and get people confused for a little time when announcing that U+2019 for apostrophe was a mistake.


> Quite nice of you to inform me of the core mission of Unicode—I must have somehow missed that.

> More seriously, it is not all so black and white. As we developed Unicode, we considered whether to separate characters by function, eg, an END OF SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed the benefits.

>In practice, whenever characters are essentially identical—and by that I mean that the overlap between the acceptable glyphs for each character is very high—people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes.

>So we only separated essentially identical characters in limited cases: such as letters from different scripts.

 

It was a very good idea to disambiguate also apostrophe and single quote, and I feel it's not paid too much because it simplified greatly the processing of quotation marks in English. I mean, the replacement of each pair of one kind by a pair of another kind. When I search for quotes in a text, I don't want to be distracted by apostrophes. Don't worry about equivalence classes, they already present to us a word without apostrophe as equivalent to the same letters with an apostrophe/quote between. It's every time better the computer knows what a character is exactly, even when at output it doesn't need to let us know, than that it comes up with a useless mixup.


 

You just brought up another good idea too: Period-terminated abbreviations are listed as exceptions in word processors. Another list could contain all words with leading apostrophe and all words with trailing apostrophe. This might allow to filter search results and to separate definitely apostrophes and single comma quotation marks. And at input, the smart quotes algorithms will become even smarter. Say, really smart.


 

I don't believe working people would mix up letter apostrophe and close-quote if they were on keyboard. And even now that they aren't, people don't, because people just hit the apostrophe key, which without any dumb smart quotes algorithm leads always to visually satisfying results, as shown in the Unicode documentation. For good desktop publishing, people must work hard anyway, so it would be nice to give them the means, and not to overburden them with routine tasks due to deficient text encoding.


 

The way things are working today is not satisfying concerning the English apostrophe. I still can't believe that the Unicode Committees were wrong when recommending U+02BC. Restoring this advantage today, will be at the honor of all involved parties, and we and future generations will thank you very much. 

 

If they'll exist.


 

Best regards,


Marcel Schneider


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150615/1d2ff69c/attachment.html>


More information about the Unicode mailing list