Re-use of Modifier Letters for Superscript Abbreviations (was: Re: a character for an unknown character)

Marcel Schneider charupdate at orange.fr
Sat Dec 31 04:45:02 CST 2016


On Sat, 31 Dec 2016 09:20:30 +0000, Richard Wordingham wrote:
[…]
> It's in a different universe, restricted to one book, namely Footfall. 

Thank you for the reference.

[…]
> Did you look in the article about Klingon, namely 
> https://en.wikipedia.org/wiki/Klingon_language , or 
> in the article about Klingons, namely 
> https://en.wikipedia.org/wiki/Klingon ? The quote is from the former. 

Iʼve looked up the wrong one, didnʼt think of the language article. 
Thanks for the link.

Iʼm now looking back at another quotation of yours, to spin off a new thread again 
about the topic that I urgently need to gather more information about:

On Fri, 30 Dec 2016 22:17:12 +0000, Richard Wordingham wrote:
> 
> On Fri, 30 Dec 2016 20:13:41 +0100 (CET) Marcel Schneider wrote: 
> >
> > > U+2E31 WORD SEPARATOR MIDDLE DOT 
> > > U+30FB KATAKANA MIDDLE DOT 
> >
> > These seem to me identical to U+00B7 and U+2022 respectively. Perhaps 
> > weʼre here faced with two examples of what Asmus referred to as 
> > “incorrectly encoded more than once” (talking of “Many other "simple" 
> > marks: lines, circles, triangles, hooks, and squares, or groups of 
> > them”). 
> 
> I was talking about what "fuels the misperception that Unicode somehow 
> encodes symbols based on a single conventional usage". 

I persist believing that particular scripts like Avestan and Samaritan Aramaic 
can require special characters like the WORD SEPARATOR MIDDLE DOT. Not fueling 
a misperception of Unicode character encoding couldʼt drive the UTC to reject this 
(for version 5.2). The KATAKANA MIDDLE DOT in turn is a part of the standard since 
the beginning, like the BULLET. I imagine that a generic bullet may not be suitable 
for Katakana.

To get an idea of how character encoding works, people wonʼt look at scripts they 
donʼt know. Given that there is a misperception, one way to not fuel it could be 
to encourage character re-use. Actually this is rather discouraged, as in the 
example of Latin modifier letters that are (basically) preformatted superscripts. 
TUS states that there is no functional difference between those that have the word 
SUPERSCRIPT in their name, and those that donʼt:

TUS 9.0, §7.8, p. 327:
| The superscript forms of the i and n letters can be found in the
| Superscripts and Subscripts block (U+2070..U+209F). The fact that the latter 
| two letters contain the word “superscript” in their names instead of “modifier 
| letter” is an historical artifact of original sources for the characters, and 
| is not intended to convey a functional distinction in the use of these 
| characters in the Unicode Standard.
http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G24762

Probably that is intended to discourage their use as superscripts. 
Superscript digits too are confined to phonetics, and the use of superscript two 
and three in measurement units is merely tolerated, not encouraged:

TUS 9.0, §22.4, p. 786:
| In addition, superscript digits are used to indicate tone in transliteration 
| of many languages. The use of superscript two and superscript three is common 
| legacy practice when referring to units of area and volume in general texts.
http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G42931

Cnnsequently, the notation of the acceleration unit 'ms⁻²' doesnʼt seem to be
sustained by Unicode. Though this may be considered a technical notation, so 
that there would be a reason to allow it.

These examples are intended to demonstrate the ambiguity of the recommendation 
to use markup and rich text format whenever vertical alignment matters, except 
in phonetics. I suspect that political correctness with respect to non-Latin 
scripts could eventually have biased Unicode’s policy, whereas Western Arabic 
digits and Latin letters are probably the only characters to be used extensively 
in super- and subscript position.

As a result, the misperception of Unicode as a one-codepoint-per-usage standard 
is even more fueled, and I can now better understand why our NB intended to have 
French ordinal indicator(s) encoded in Unicode aside the already existing 
superscript Latin small letter(s). 

But admitting that encoding new French ordinal indicators is a really good idea, 
Iʼm curious of the response of the UTC. However, given that the regular process 
will take two years, would Unicode agree that in the meantime, the modifier 
letters be put in their place on the on-coming keyboard layout?


Marcel



More information about the Unicode mailing list