Missing Latin superscript lowercase letters

Sat Mar 25 12:29:13 CDT 2023

> 22 mars 2023 kl. 08:53 skrev Asmus Freytag via Unicode <unicode at corp.unicode.org>:

>On 3/21/2023 9:36 PM, Kent Karlsson via Unicode wrote:
>>There is no law of nature (or of [c]omputing) that says that math expressions
>>must be non-plain text. Just because all of neqn/eqn, (La)TeX, MathML, OMML, and indeed
>>UnicodeMath are representations of math expressions that are *not* plain text does not
>>mean that math expressions must be expressed by a higher level protocol. I.e. it could
>>very well be a text level protocol (where the ”math controls” are not expressed as
>>printable text, but as control codes).

>Using control sequences or codes for your markup does not make your content plain text.
>The fact remains that mathematical notation is fundamentally recursive when it comes to
>super/subscript: it's not individual letters, but entire expressions that are super/subscripted
>(and at least in theory, they cover the full range  of mathematical expressions) and they are
>recursive: they can contain nested super/subscripted expressions. Again, in theory, this recursion
>is not limited, except that for reasons of practicality such recursion has to be realized in ways that 
>the overall expression remains legible.

True, but…

>Therefore, if your goal is mathematical notation, you want an operator that super/subscripts an 
>expression and not code points for single characters. The key takeaway is the natural scoping:
>super/subscripting is applied on the level of a whole expression. That means that your markup
>needs to be scoped and that definitely makes it rich text.

Two cases:

Bidi algorithm. Whether based solely on characters’s bidi properties or also bidi controls are used, the bidi handling is intrinsically scoped. I think you still call that plain text. Further, in HTML the bidi control characters aren’t used, instead there are attributes to most tags that control the bidi handling. Even though markup is used, I think you still think of it as plain text… Indeed, a math expression is more plain text than a bidi text. From the appearance of the math expression you can derive the structure (ignoring “phantom” expressions, which I included only because MathML has that, and they seem to sometimes be practical; and some reservation for stretch, which only should have an effect on some symbols). On the other hand, for a bidi processed text, you cannot guarantee the recovery of the given structure from the displayed text, indeed I think that in general is impossible; not so plain…

Combining characters. Before Unicode, ECMA-48 defined CSI 1 SP _<text>CSI 2 SP _ for “combining” the characters in<text> to a single displayed character (for certain implementation defined values of <text>). Unicode “replaced” (that is probably not what happened historically, but technically it can be seen that way) that by instead having combining characters. Without that invention, we would in hypothetical-HTML have a tag for doing such combinations (since HTML does not like C0/C1 control codes…). And… these combining characters do not work on a single character, but on the combining sequence (a “scope”) that precedes it; and they can indeed be seen as a special kind of control characters. You still consider these scoped controls to be plain text.

So that the controls have a “scope” that is more than a single character (as in both of the cases above) or are recursive (as in both of the cases above) does apparently not exclude a feature from being regarded as plain text. So I maintain that what is plain text or not is much in the eye of the beholder (regardless of internal representation), not only w.r.t. scopeness and recursiveness, but also possibility to correctly derive the structure of a text (which in general is impossible for bidi).

>The existing single characters are (almost) all encoded for use in phonetic notation, which is not 
>recursive and doesn't super/subscript entire expressions. Instead it uses super/subscripting to 
>indicate modification. Hence "modifier letters".

True. And I have said nothing against that point. Indeed, I said that those characters do *not*
belong in a math expression, and should “stay” with (mostly) phonetic notation.

>>Further, if some symbol/letter for some reason only ever occurred in superscript
>>position in math expressions, such examples would still be supporting evidence for
>>that symbol/letter. The closest practical example I can think of is the degree sign, which
>>in origin is a superscript 0.

>The degree sign is either the exception that proves the rule, or something else: a symbol
>that occurs frequently in contexts that are not full mathematical expressions, as it is typical

True, but I was arguing against Peter Constable's postulation that something that (for whatever
reason) occurs only in a superscript position in a math expression could not have its encoding 
supported by an example where it occurred in a superscript position in a math expression.
THAT postulation is false. (And the closest example I could think of was the degree sign; there
MAY be examples of yet unencoded characters that only occur in superscript position in math
expressions.)

>for unit symbols. When used with temperature, it's interesting to note that not all temperature
>scales use it consistently. You don't see it with Fahrenheit very often, for example, reflecting
>differences in traditional keyboard layouts.

Ok, let’s digress a bit… I do see that too, in news articles (in web apps) from USA and British news
companies and see also “C” when degrees Celsius is meant. But writing farad (F) or coulomb (C)
when referring to temperature is just horrible, and only embarrassing for the journalist who wrote
that. (Another related horror is “kph”, and there you cannot even blame keyboard layouts.)

>Note that many unit symbols have one-off encodings that Unicode had to support via compatibility
>characters or even canonical duplicates (think micro and Ohm vs. their Greek letter counterparts).
>Without the need to support a transition from pre-existing character sets, these duplicates would
>not exist. But they do

Yes. (But not relevant to this discussion.)

>and so does the degree sign.

The degree sign is not a compatibility character. It “divorced” from superscript 0 looong before
computers…

>Neither of them, however, form precedents for non-compatibility characters.

Not sure what that sentence means, since the premise is skewed.

>A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20230325/d346d1de/attachment.htm>