A sign/abbreviation for "magister"

Marcel Schneider via Unicode unicode at unicode.org
Fri Nov 2 13:52:05 CDT 2018


On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
[quoted mail]
> 
> Using variation selectors is only appropriate for these existing 
> (preencoded) superscript letters ª and º so that they display the 
> appropriate (underlined or not underlined) glyph.

And it is for forcing the display of DIGIT ZERO with a short stroke:
0030 FE00; short diagonal stroke form; # DIGIT ZERO
https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt

 From that it becomes unclear why that isn’t applied to 4, 7, z and Z
mentioned in this thread, to be displayed open or with a short bar.

> It is not a solution for creating superscripts on any letters and 
> mark that it should be rendered as superscript (notably, the base 
> letter to transform into superscript may also have its own combining
> diacritics, that must be encoded explicitly, and if you use the 
> varaition selector, it should allow variation on the presence or 
> absence of the underline (which must then be encoded explicitly as a
> combining character.

I totally agree that abbreviation indicating superscript should not be
encoded using variation selectors, as already stated I don’t prefer it.
> 
> So finally what we get with variation selectors is: <baseline letter,
> variation selector, combining diacritic> and <baselineletter 
> precombined with the diacritic, variation selector> which is NOT 
> canonically equivalent.

That seems to me like a flaw in canonical equivalence. Variations must
be canonically equivalent, and the variation selector position should
be handled or parsed accordingly. Personally I’m unaware of this rule.
> 
> Using a combining character avoids this caveat: <baseline letter, 
> combining diacritic, combining abbreviation mark> and <baselineletter
> precombined with the diacritic, combining abbreviation mark> which
> ARE canonically equivalent. And this explicitly states the semantic
> (something that is lost if we are forced to use presentational
> superscripts in a higher level protocol like HTML/CSS for rich text
> format, and one just extracts the plain text; using collation will
> not help at all, except if collators are built with preprocessing
> that will first infer the presence of a <combining abbreviation mark>
> to insert after each combining sequence of the plain-text enclosed in
> a italic style).

That exactly outlines my concern with calls for relegating superscript
as an abbreviation indicator to higher level protocols like HTML/CSS.
> 
> There's little risk: if the <combining abbreviation mark> is not 
> mapped in fonts (or not recognized by text renderers to create 
> synthetic superscript scripts from existing recognized clusters), it
> will render as a visible .notdef (tofu). But normally text renderers
> recognize the basic properties of characters in the UCD and can see
> that <combining abbreviation mark> has a combining mark general 
> property (it also knows that it has a 0 combinjing class, so 
> canonical equivalences are not broken) to render a better symbols 
> than the .notdef "tofu": it should better render a dotted circle. 
> Even if this tofu or dotted circle is rendered, it still explicitly 
> marks the presence of the abbreviation mark, so there's less 
> confusion about what is preceding it (the combining sequence that was
> supposed to be superscripted).

The problem with the <combining abbreviation mark> you are proposing
is that it contradicts streamlined implementation as well as easy
input of current abbreviations like ordinal indicators in French and,
optionally, in English. Preformatted superscripts are already widely
implemented, and coding of "4ᵉ" only needs two characters, input
using only three fingers in two times (thumb on AltGr, press key
E04 then E12) with an appropriately programmed layout driver. I’m
afraid that the solution with <combining abbreviation mark> would be
much less straightforward.
> 
> The <combining abbreviation mark> can also have its own <variation 
> selector> to select other styles when they are optional, such as 
> adding underlines to the superscripted letter, or rendering the 
> letter instead as underscript, or as a small baseline letter with a 
> dot after it: this is still an explicit abbreviation mark, and the 
> meaning of the plein text is still preserved: the variation selector
> is only suitable to alter the rendering of a cluster when it has 
> effectively several variants and the default rendering is not 
> universal, notably across font styles initially designed for specific
> markets with their own local preferences: the variation selector
> still allows the same fonts to map all known variants distinctly,
> independantly of the initial arbitrary choice of the default glyph
> used when the variation selector is missing).

I don’t think German users would welcome being directed to input a
<combining abbreviation mark> plus a <variation selector> instead of
a period.
> 
> Even if fonts (or text renderers may map the <combining abbreviation
> mark> to variable glyphs, this is purely stylictic, the semantic of
> the plain text is not lost because the <combining abbreviation mark>
> is still there. There's no need of any rich-text to encode it (the 
> rich -text styles are not explicitly encoding that a superscript is 
> actually an abbreviation mark, so it cannot also allow variation like
> rendering an underscript, or a baseline small glyph with an added
> dot. Typically a <combining abbreviation mark> used in an English
> style would render the letter (or cluster) before it as a "small"
> letter without any added dot.

The advantage of preformatted superscripts is that the English user
can decide whether he or she wishes the ordinal indicators to be
baseline or superscript, while being sure of stable rendering.
> 
> So I really think that <combining abbreviation mark> is far better 
> than:
>
> * using preencoded superscript letters (they don't map all the
> necessary repertoire of clusters where the abbreviation is needed,
> it now just covers Basic Latin, ten digits, plus and minus signs, and
> the dot or comma, plus a few other letters like stops;

As seen in this thread, preformatted superscripts are standardized and
implemented to get combining diacritics, eg "ᵃ́", "ᵉ́". Encoding any more
precomposed letters that can be represented as combining sequences is
out of scope, and that is the reason why no accented letters will ever
be encoded as preformatted superscripts. Correct display of the "Sᵗᵉ́"
abbreviation for French "Société" ("Company") is already working in
browsers, depending on the fonts present on the machine and set in the
settings, unless a correct webfont is downloaded and installed ad hoc.

> it's impossible to rencode the full Unicode repertoire and its allowed 
> combining sequences or extended default grapheme clusters!),

This persistent and passionate refrain boils down, as already pointed
by others and me in this thread, to a continuum bias strawman fight,
(ie the refrain is repeated to fight a strawman constructed using the
continuum bias, which consists in using a continuum to move someone’s
position to an extreme position that is ultimately off-topic).
>
> * or using variation selectors to make them appear as a superscript
> (does not work with all clusters containing other diacritics like
> accents),
>
> * or using rich-text styling (from which you cannot safely
> infer any semantic (there no warranty that M<sup>r</sup> in HTML is
> actually an abbreviation of "Mister"; in HTML this is encoded
> elsewhere as <abbr title="Mister">M<sup>r</sup></abbr> or
> <abbr>M<sup>r</sup></abbr> (the semantic of the abbreviation has to
> be looked a possible <abbr> container element and the meaning of the
> abbreviation is to look inside its title attribute, so obviously this
> requires complex preprocessing before we can infer a plaintext
> version <M,r,combining abbreviation mark> (suitable for example in
> plain-text searches where you don't want to match a mathematical
> object M, like a matrix, elevated to the power r, or a single
> plaintext M followed by a footnote call noted by the letter "r").

Indeed HTML is a powerful language to provide rich and meaningful
content with many features, so that in comparison, plain text could
seem unreadable because it contains all those abbreviations and
symbols you need to know. By contrast, plain text in any natural
language is to contain just enough information that it is readable
for a native reader, and that is the purpose of Unicode.

Therefore, dismissing superscript abbreviation indicators to higher
level protocols is like looking at a language from outside and
telling: “These are abbreviations anyway, so you probably should also
add tooltips for people to learn the meaning.”
> 
> It solves all practical problems: legacy encoding using the 
> preencoded superscript Latin letters (aka "modifier letters") should
> have never been used or needed (not even for IPA usage which could 
> have used an explicit <combining IPA symbol mark> for its 
> superscripted symbols, or for its distinctive "a" and "g"). We should
> not have needed to encode the variants for "a" and "g": these were
> old hacks that broke the Unicode character encoding model since the
> beginning.

The principle of Unicode is to encode anything that is semantically
distinctive in plain text, so encoding IPA letters is totally OK.

> However only roundtrip compatibility with legacy non UCS
> charsets milited only for keeping the ordinal feminine or ordinal
> masculine mark, or the "Numero" cluster (actually made of two
> letters, the second one followed by an implicit abbreviation mark,
> but transformed in the legacy charset to be treated as a single
> unbreakable cluster containing only one symbol; even Unicode 
> considers the abbreviated Numero as being only "compatibility 
> equivalent" to the letter N followed by the masculine ordinal symbol,
> the latter being also only "compatibility equivalent" to a letter o
> with an implicit superscript, but also with an optional combining
> underline).

These pre-Unicode charsets are a proof that superscripts are required.
> 
> All these superscripts in Unicode (as well as Mathematical "styled" 
> letters, which were also completely unnecessary and will necessarily
> be incomplete for the intended usage) are now to be treated only as
> legacy practices, they should be deprecated in favor of the more 
> semantic and logical character encoding model, deprecating complelely
> the legacy visual encoding.

Mathematicians like them, and even not being a mathematician, I feel
that there are really a lot of styled “alphabets to choose from”, as
Ken Whistler advised on this list in 2015. What uncovered usages
are you referring to?
> 
> Only precombined characters, recognized by canonical equivalences are
> part of the standard and may be kept as "non"-legacy: they still fit
> in the logical encoding. As well the extended default grapheme 
> clusters include the precomposed Hangul LVT and LV syllables, and CGJ
> used before combining marks with non-zero combining class, and 
> variation selectors used only after base letters with the zero 
> combining class and that start the extended default graphgeme 
> clusters.
> 
> Let's return to the root of the far better logical encoding which 
> remains the recommended practice. All the rest is legacy (some of 
> them came from decision taken to preserve roundtrip compatibility 
> with legacy charsets, including prepended letters in Thai, and so we
> have a few compatibility characters (which are not the recommended 
> practive), but the rest was bad decisions made by Unicode and ISO WG
> to break the logical character encoding model.

That criticism only applies to presentation forms, that Unicode was
forced to take in at setup, and whose use Unicode ever discouraged,
as seen also in this thread.

So all languages using superscript to indicate abbreviations are
still better served with preformatted superscript letters.

The new turn is that many languages, eg Italian, Polish, Portuguese
and Spanish, need variation sequences for single or double
underscoring, which whill work with OpenType fonts having the
appropriate glyph sets, while the variation selector is ignorable
for most other machine processing purposes.

Best regards,

Marcel


More information about the Unicode mailing list