a character for an unknown character

Marcel Schneider charupdate at orange.fr
Fri Dec 30 13:13:41 CST 2016


On Fri, 30 Dec 2016 12:37:27 +0000, Richard Wordingham wrote:
> On Fri, 30 Dec 2016 01:23:55 +0100 (CET) Marcel Schneider wrote:
> > On Wed, 28 Dec 2016 19:05:17 -0800, Asmus Freytag wrote: 
> > > On 12/28/2016 5:47 PM, Richard Wordingham wrote: 
> > U+02BC being shifted from a letter to a punctuation must have been 
> > anticipated at encoding, since the original recommendation was to use 
> > it as apostrophe throughout. Unifying the letter apostrophe and the 
> > punctuation apostrophe made IMO more sense—despite of the conflicting 
> > properties 
> What conflicts? Both prototypically mark absences. 

I meant the gc=Lm of U+02BC vs U+2019 that has gc=Pf.

> The rationale seems to be that English uses both the punctuation 
> apostrophe and the U+2019 RIGHT SINGLE QUOTATION MARK. If users aren't 
> being trained to use U+2212 MINUS SIGN, and habitually disable grammar 
> and spell-checking, most won't make the right choice between U+02BC and 
> U+2019. 

I donʼt see well why so. The MINUS SIGN should be on the MINUS key at the 
same level as the PLUS SIGN. (That brings the necessity to at least shift 
the underscore at 0x10/Option/AltGr level, where the PLUS key might have 
the PLUS-MINUS SIGN.) The unability to determine whether a punctuation is 
an apostrphe or a quotation mark, is most found in computers, not humans.

> > Perhaps the letters for hexadecimal digits should have been encoded 
> > separately? 
> The idea has been rejected several times. 

Indeed that would have been useless. Where confusable, hex digits are 
prefixed or suffixed.

> > > > 5) The nightmare of spacing single and double dots. 
> > > ? spacing vs. combining? Not sure what you mean. 
> > I think Richard refers to U+2024 ONE DOT LEADER and U+2025 TWO DOT 
> > LEADER, along with U+002E FULL STOP. 
> That's not the half of it. For starters, just look at the confusables 
> for U+00B7 MIDDLE DOT: 
> U+2022 BULLET 

This is included out of an abundance of caution. Visually, • vs · are distinct.
Iʼd be sad if there were no bullet; I use it also manually.

> U+2027 HYPHENATION POINT 

Ƭhis is distinct in that, it is centred on the lowercase letters, while the 
middle dot is centred on the uppercase letters.

> U+2219 BULLET OPERATOR 
> U+22C5 DOT OPERATOR 

According to Wikipedia, "Interpunct", these are often silently replaced by U+00B7.
But these have gc=Sm. And U+2219 seems to be centred on digits, U+22C5 on lowercase.

> U+2E31 WORD SEPARATOR MIDDLE DOT 
> U+30FB KATAKANA MIDDLE DOT 

These seem to me identical to U+00B7 and U+2022 respectively. Perhaps weʼre here 
faced with two examples of what Asmus referred to as “incorrectly encoded more 
than once” (talking of “Many other "simple" marks: lines, circles, triangles, 
hooks, and squares, or groups of them”). 
http://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0115.html
I believe however that it was correct to make them dedicated characters for 
precise scripts, somehow like ANO TELEIA.

> There's an argument that the unification of U+00B7 and U+0387 ANO 
> TELEIA is a unification too far. A font for Greek may need to work out 
> which it is to position it correctly. 
> For double dots, there're the confusables for U+003A COLON: 
> U+05C3 HEBREW PUNCTUATION SOF PASUQ 
> U+2236 RATIO 
> There's a whole raft of visargas, some of which match and some of 
> which don't. What happened to the principle that diacritics are unified 
> by form? I suspect the answer is that encoding was established while 
> principles were still developing. 
> > > > As a result, I have no idea whether the singular of "fithp" (one 
> > > > of Larry Niven's alien species) should be spelt with U+02BC or 
> > > > U+2019, though in ASCII I can just write "fi'". 
> > 
> > Normally on an English or French keyboard layout, all three are 
> > accessed on live keys. 
> That accessibility is news to me -

It is really new for me. The sort of keyboard layout update. “Normally” means here 
the way that it *should* be normal on English and French keyboards. On French ones 
because U+02BC is preferred in Breton language; on English ones because U+02BC 
is preferred by a current of spelling English.

> normally I just have to fight a word processor if I want U+0027.

To improve the user experience here, one needs to move this from the autocorrect 
to the keyboard layout. In Word, one may disable the bundle but add a custom 
autocorrect that transforms U+0027 always immediately to U+2019 or to U+02BC. 
Then, hitting Backspace brings first U+0027 back. Quotation marks require then 
adding other autocorrect items, using digraphs.

> However, I still don't know whether to spell the word «fiʼ» or «fi’».
> I've only seen it in print. 

That depends on the spelling convention. If the apostrophe and the single comma 
quote are disunified, then U+02BC is used to spell the word «fiʼ» (your first 
option). You might also wish to ask the publisher, but Iʼm unsure whether he 
will appreciate to have to join publicly one or the other spelling current.
(As of me, I normally use U+02BC in English, and U+2019 in French, given the 
diverging quotation mark usages and apostrophe semantics. In French mode, I have 
the latter in the Base shift state, and the former in the Shift shift state on 
the same key, but Iʼm developing another model where the letter apostrophe is 
in 0x10/Option/AltGr on a letter key in both French and Languages mode.)

Marcel



More information about the Unicode mailing list