A sign/abbreviation for "magister"

Philippe Verdy via Unicode unicode at unicode.org
Fri Nov 2 11:45:58 CDT 2018


Le ven. 2 nov. 2018 à 16:20, Marcel Schneider via Unicode <
unicode at unicode.org> a écrit :

> That seems to me a regression, after the front has moved in favor of
> recognizing Latin script needs preformatted superscript. The use case is
> clear, as we have ª, º, and n° with degree sign, and so on as already
> detailed in long e-mails in this thread and elsewhere. There is no point
> in setting up or maintaining a Unicode policy stating otherwise, as such
> a policy would be inconsistent with longlasting and extremely widespread
> practice.
>

Using variation selectors is only appropriate for these existing
(preencoded) superscript letters ª and º so that they display the
appropriate (underlined or not underlined) glyph. It is not a solution for
creating superscripts on any letters and mark that it should be rendered as
superscript (notably, the base letter to transform into superscript may
also have its own combining diacritics, that must be encoded explicitly,
and if you use the varaition selector, it should allow variation on the
presence or absence of the underline (which must then be encoded explicitly
as a combining character.

So finally what we get with variation selectors is:
   <baseline letter, variation selector, combining diacritic> and
  <baselineletter precombined with the diacritic, variation selector> which
is NOT canonically equivalent.

Using a combining character avoids this caveat:
  <baseline letter, combining diacritic, combining abbreviation mark> and
  <baselineletter precombined with the diacritic, combining abbreviation
mark> which ARE canonically equivalent.
And this explicitly states the semantic (something that is lost if we are
forced to use presentational superscripts in a higher level protocol like
HTML/CSS for rich text format, and one just extracts the plain text; using
collation will not help at all, except if collators are built with
preprocessing that will first infer the presence of a <combining
abbreviation mark> to insert after each combining sequence of the
plain-text enclosed in a italic style).

There's little risk: if the <combining abbreviation mark> is not mapped in
fonts (or not recognized by text renderers to create synthetic superscript
scripts from existing recognized clusters), it will render as a visible
.notdef (tofu). But normally text renderers recognize the basic properties
of characters in the UCD and can see that <combining abbreviation mark> has
a combining mark general property (it also knows that it has a 0 combinjing
class, so canonical equivalences are not broken) to render a better symbols
than the .notdef "tofu": it should better render a dotted circle. Even if
this tofu or dotted circle is rendered, it still explicitly marks the
presence of the abbreviation mark, so there's less confusion about what is
preceding it (the combining sequence that was supposed to be superscripted).

The <combining abbreviation mark> can also have its own <variation
selector> to select other styles when they are optional, such as adding
underlines to the superscripted letter, or rendering the letter instead as
underscript, or as a small baseline letter with a dot after it: this is
still an explicit abbreviation mark, and the meaning of the plein text is
still preserved: the variation selector is only suitable to alter the
rendering of a cluster when it has effectively several variants and the
default rendering is not universal, notably across font styles initially
designed for specific markets with their own local preferences: the
variation selector still allows the same fonts to map all known variants
distinctly, independantly of the initial arbitrary choice of the default
glyph used when the variation selector is missing).

Even if fonts (or text renderers may map the <combining abbreviation mark>
to variable glyphs, this is purely stylictic, the semantic of the plain
text is not lost because the <combining abbreviation mark> is still there.
There's no need of any rich-text to encode it (the rich -text styles are
not explicitly encoding that a superscript is actually an abbreviation
mark, so it cannot also allow variation like rendering an underscript, or a
baseline small glyph with an added dot. Typically a <combining abbreviation
mark> used in an English style would render the letter (or cluster) before
it as a "small" letter without any added dot.

So I really think that <combining abbreviation mark> is far better than:
* using preencoded superscript letters (they don't map all the necessary
repertoire of clusters where the abbreviation is needed, it now just covers
Basic Latin, ten digits, plus and minus signs, and the dot or comma, plus a
few other letters like stops; it's impossible to rencode the full Unicode
repertoire and its allowed combining sequences or extended default grapheme
clusters!),
* or using variation selectors to make them appear as a superscript (does
not work with all clusters containing other diacritics like accents),
* or using rich-text styling (from which you cannot safely infer any
semantic (there no warranty that M<sup>r</sup> in HTML is actually an
abbreviation of "Mister"; in HTML this is encoded elsewhere as <abbr
title="Mister">M<sup>r</sup></abbr> or <abbr>M<sup>r</sup></abbr> (the
semantic of the abbreviation has to be looked a possible <abbr> container
element and the meaning of the abbreviation is to look inside its title
attribute, so obviously this requires complex preprocessing before we can
infer a plaintext version <M,r,combining abbreviation mark> (suitable for
example in plain-text searches where you don't want to match a mathematical
object M, like a matrix, elevated to the power r, or a single plaintext M
followed by a footnote call noted by the letter "r").

It solves all practical problems: legacy encoding using the preencoded
superscript Latin letters (aka "modifier letters") should have never been
used or needed (not even for IPA usage which could have used an explicit
<combining IPA symbol mark> for its superscripted symbols, or for its
distinctive "a" and "g"). We should not have needed to encode the variants
for "a" and "g": these were old hacks that broke the Unicode character
encoding model since the beginning. However only roundtrip compatibility
with legacy non UCS charsets milited only for keeping the ordinal feminine
or ordinal masculine mark, or the "Numero" cluster (actually made of two
letters, the second one followed by an implicit abbreviation mark, but
transformed in the legacy charset to be treated as a single unbreakable
cluster containing only one symbol; even Unicode considers the abbreviated
Numero as being only "compatibility equivalent" to the letter N followed by
the masculine ordinal symbol, the latter being also only "compatibility
equivalent" to a letter o with an implicit superscript, but also with an
optional combining underline).

All these superscripts in Unicode (as well as Mathematical "styled"
letters, which were also completely unnecessary and will necessarily be
incomplete for the intended usage) are now to be treated only as legacy
practices, they should be deprecated in favor of the more semantic and
logical character encoding model, deprecating complelely the legacy visual
encoding.

Only precombined characters, recognized by canonical equivalences are part
of the standard and may be kept as "non"-legacy: they still fit in the
logical encoding. As well the extended default grapheme clusters include
the precomposed Hangul LVT and LV syllables, and CGJ used before combining
marks with non-zero combining class, and variation selectors used only
after base letters with the zero combining class and that start the
extended default graphgeme clusters.

Let's return to the root of the far better logical encoding which remains
the recommended practice. All the rest is legacy (some of them came from
decision taken to preserve roundtrip compatibility with legacy charsets,
including prepended letters in Thai, and so we have a few compatibility
characters (which are not the recommended practive), but the rest was bad
decisions made by Unicode and ISO WG to break the logical character
encoding model.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181102/f8a3d2ac/attachment.html>


More information about the Unicode mailing list