Global apostrophe solution? (Part of: A new take on the English apostrophe in Unicode; Keyman Developer for free?; Input methods at the age of Unicode)

Thu Jul 23 03:50:11 CDT 2015

As I donʼt know if the apostrophe issue** has been satisfactorily resolved, Iʼd like to briefly check that up, making a few statements to agree or not to agree with:

1 - We are all allowed to use U+02BC for the English apostrophe.  U+2019 is only a de facto preference, mainly with respect to end-users and wysiwyg word processing.  Unicode is thus a user-oriented standard.  However we must also take into consideration the font-related issues: U+02BC missing, or varying in shape following different expectations, like in these three sans-serif fonts (tested in LibreOffice):

2 - UAX #29 is not intended to work fine for English, so English implementations need to be tailored. These two statements are inferred from the Notes at § 4.1.1. This tailoring is however often not completed, as we can deduce from the behavior of word processors applying the UAX #29 recommendation:
| A further complication is the use of the same character as an apostrophe
| and as a quotation mark. Therefore leading or trailing apostrophes
| are best excluded from the default definition of a word.

3 - As in English, a leading U+2019 is never a quotation mark (as opposed to Scandinavian usage), leading apostrophes should be included in word definition, at the same level as in-word apostrophes.  Only the possessive mark apostrophe would end up to be left out when trailing.  This however is inconsistent, so a complete tailoring of UAX #29 for English must include algorithms that take a trailing U+2019 as a quote only if preceded by U+2018 within a number of words... but this too is uncomplete.

4 - Conversion of British single quotes to double quotes needs special processing to identify the close-quotes: applying a number of search rules, submitting each instance to the operator for validation.  This routine task is very annoying but remains limited to technicians (editors, typesetters), while the disambiguation of the apostrophe would affect the public on the whole.  As Marc Davis wrote on Mon, Jun 15, 2015 at 10:19 AM:

> In practice, whenever characters are essentially identical—and by that I mean that the overlap between the acceptable glyphs for each character is very high—people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway.

Consequently, the introduction of U+02BC in English usage would not produce reliable data.

5 - The use of angle quotation marks for quotations in English (both British and American) would eliminate the apostrophe problem and bring a number of substantial advantages:

+ Quotations, especially when consisting in single words, are better highlighted and are no longer confusable with the use of scare quotes.

+ This may result in a move inside the psychological relationship towards quotations and quoting, which could eventually improve the handling of intellectual property.  A certain menace in this domain, due to word processing and internet, has been detected by Roman linguist Raffaele Simone.

+ British and American English would use the same quotes convention, so no quotes conversion would be necessary any longer.  This process streamlining could facilitate exchanges, locale barriers being overcome while localesʼ “flavour” (Iʼm quoting, not scaring, hereʼs my source: http://babelstone.blogspot.fr/2006/03/unicode-character-names-part-2-name-is.html) will be preserved trough word orthography.

+ Scare quotes would always have the same appearance, inside as well as outside of quotations. Their meaning is independent of quotation, so it seems consistent that they be not affected by their environment.

6 - Additionally, the use of U+0027 could be preferred for highlighting words, a usage found in technical documents like the Unicode documentation.  (However, even the inword apostrophe is in most cases represented by U+0027.)
As a result, the use of U+2018 is not needed any longer and should be strongly discouraged, at least in lanquages like English and French, to prevent U+2019 from being used as a quotation mark.  This is far easier and better feasible than completing all fonts with U+02BC, urge users to deal with *two* different but identically looking “squiggles” (quotation), and track incorrect use. Having then an old and a new quotation marks convention visibly side by side, would probably be less cumbersome than having two apostrophes that look identical in most of the complete fonts but behave differently.

7 - As an input method for angle quotation marks, we can use the autocorrect while waiting that this and nested quotes management is implemented in word processing.  To achieve this, six entries may be required:
<  → «
«< → ‹
‹< → <
>  → »
»> → ›
›> → >
In Microsoft Word (supporting punctuations and symbols as autocorrect triggers), this will result in getting the double quotes with one keystroke, the single quotes (less used) with two keystrokes, and finally the less-than/greater-than signs with three keystrokes.
Following user preferences, the latter may be raised, and four entries only would be required:
<< → «
«< → ‹
>> → »
»> → ›

For a solution working in *all* applications, we can program extended keyboard layouts, notably using Keyman Developer, a software that I see as an important part of Unicode implementation by its easy-to-understand and flexible layout programming, matching expectations that were uttered soon after the first releases of the Unicode Standard.

8 - I (or even: We) still not know why the apostrophe has not been disambiguated with one of the quotation marks, while the hyphen-minus (mentioned in the parent thread) has been (U+2010 vs U+2212).  Iʼm not sure to buy the argument that “essential identity” (this is derived quotation, not scaring!) can be deduced from glyphic resemblence.  And indeed it hasnʼt been much times in Unicode history, given that the purpose is “to encode characters, not glyphs.”  The following quotation of TUS has not exactly this meaning: (§ 1.3, p. 6) “the standard defines how characters are interpreted, not how glyphs are rendered”.  

In the case of “that squiggle” '’', TUS doesnʼt fully define how it is interpreted, only whether itʼs a letter (U+02BC) or a punctuation (U+2019), but *not* whether itʼs an apostrophe or a single closing quote, even while the two are essentially different (not in appearance, but in what philosophers called “essentia”, which is “the being”).  They “are the same in outward form but different in essence.”  To prove that to ourselves, we may look at German usage: single quotes are U+201A and U+2018, apostrophe is U+2019.  If the same principles had been applied, U+201A should have been merged with the comma, because we canʼt tell the difference: ‚,‚,‚, (the 1st, 3rd and 5th are quotation marks).  And here at least, the semantics would have been legible even for computers: leading comma is quote, trailing comma is comma.  The actual apostrophe convention in English is illegible semantics.

The curly apostropheʼs misfortune might have been to be encoded at the same time as the curly quote, while the (curly) comma was pre-existent to itʼs curly quote counterpart.  Ultimately, the punctuation apostrophe has *not* been encoded in Unicode.  Hence the *original* recommendation to use the letter apostrophe, which is very consistent with English usage.  Even more, we already learned that since 1983, the apostrophe may be considered as the 27th letter of the Latin alphabet: http://unicode.org/pipermail/unicode/2015-June/001914.html

9 - By not encoding the punctuation apostrophe, Unicode could rely upon the typographical tradition, realizing some scale economies and making the Standard more end-user friendly in some way.  This reflects however a tendency that prioritizes the appearance.  In Unicode this tendency is far from being omnipresent, it is surely very marginal in Unicode, and it’s presence is due to the influence of the software industry where that tendency is naturally more widespread, for economical reasons, that is mainly because the demand on usersʼ side has already a component (among others) which handles appearance as a satisfactory good and not asking for more than that a given item looks fine, no matter whatʼs behind...

Actually, as far as the English apostrophe is concerned, the process burden is moved from input to treatment.  Users can enter text without bothering, while on the other side, other people must work hard to fix a number of recurrent problems...

Now the goal would be to know if a part of the problem is conveniently resolved, and if there is an agreement on some of the different points listed above.  Ted Clancy and all who launched and responded the parent thread, are invited to share their feelings and how they see the topic today.

Best regards,

Marcel

** Note for archive readers:  Please refer to Ted Clancyʼs blogpost and the subsequent discussion:
http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0047.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150723/02902414/attachment.html>