Encoding <combining abbreviation mark>

Mon Nov 5 17:12:33 CST 2018

On 04/11/2018 20:19, Philippe Verdy via Unicode wrote:
[…]
> Even the mere fallback to render the <combining abbreviation mark> as
> a dotted circle (total absence of support) will not block completely
> reading the abbreviation:
> 
> * you'll see "2e◌" (which is still better than only "2e", with
> minimal impact) instead of
> 
> * "2◌" (which is worse ! this is still what already happens when you
> use the legacy encoded <superscript e> which is also semantically
> ambiguous for text processing), or
> 
> * "2e." (which is acceptable for rendering but ambiguous semantically
> for text processing)

I’m afraid the dotted circle instead of the .notdef box would be confusing.

> 
> So compare things faily: the solution I propose is EVEN
> MOREINTEROPERABLE than using <superscript Latin  letters> (which is
> also impossible for noting all abbrevations as it is limited to just
> a few letters, and most of the time limited to only the few lowercase
> IPA symbols). It puts an end to the pressure to encode superscript
> letters.

Actually it encompasses all Latin lowercase base letters except q.

As of putting an end to that pressure, that is also possible by encoding
the missing ones once and for all. As already stated, until the opposite
is posted authoritatively to this List, Latin script is deemed the only
one making extensive use of superscript to denote abbreviations, due to
strong and longlasting medieval practice acting as a template on a few
natural languages, namedly those enumerated so far, among which Polish.

> 
> If you want to support other notations (e.g. in chemical or
> mathematics notations, where both superscript and subscript must be
> present and stack together, and where the allowed varaition using a
> dot or similar) you need another encoding and the existing legacy
> <superscript Latin  letters> are not suitable as well.

I don’t lobby to support mathematics with more superscripts, but for
sure UnicodeMath would be able to use them when the set is complete.
What I did for chemical notations is to remind that chemistry seems
to be disfavored compared to mathematics, because instead of peculiar
subscripts it uses subscript Greek small letters. Three of them, as
has been reported on this List. They are being refused because they
are letters of a script. If they were fancy symbols, they would be
encoded, like alchemical symbols and mathematical symbols are.

Further, on 04/11/2018 20:51, Philippe Verdy via Unicode wrote:
[…]
> Once again you need something else for these technical notations, but
> NOT the proposed <combining abbreviation mark>, and NOT EVEN the
> existing "modifier letters" <superscript letter X>, which were in
> fact first introduced only for IPA […]
> […] these letters are NOT conveying any semantic of an abbreviation,
> and this is also NOT the case for their usage as IPA symbols).

They do convey that semantic if used in a natural language giving
superscript the semantics of an abbreviation.

Unicode does not encode semantics, TUS specifies.

> 
> There's NO interoperability at all when taking **abusively** the
> existing  "modifier letters" <superscript letter X> or <superscript
> digit> for use in abbreviations […].

The interoperabillty I mean is between formats and environments.
Interoperable in that sense is what is in the plain text backbone.

> Keep these "modifier letters" or <superscript digit> or <superscript
> punctuation> for use as plain letters or plain digits or plain
> punctuation or plain symbols (including IPA) in natural languages.

That is what I’m suggesting to do: Superscript letters are plain
abbreviation indicators, notably ordinal indicators and indicators
in other abbreviations, used in natural languages.

> Anything else is abusive ans hould be considered only as "legacy"
> encoding, not recommended at all in natural languages.

Put "traditional" in the place of "legacy", and you will come close
to what is actually going on when coding palaeographic texts is
achieved using purposely encoded Latin superscripts. The same
applies to living languages, because it is interoperable and fits
therefore Unicode quality standards about digitally representing
the world’s languages.

Finally, on 04/11/2018 21:59, Philippe Verdy via Unicode wrote:
>
> I can take another example about what I call "legacy encoding" (which
> really means that such encoding is just an "approximation" from which
> no semantic can be clearly infered, except by using a non-determinist
> heuristic, which can frequently make "false guesses").
> 
> Consider the case of the legacy Hangul "half-width" jamos: […]
>
> The same can be said about the heuristics that attempt to infer an
> abbreviation semantic from existing superscript letters (either
> encoded in Unicode, or encoded as plain letters modified by
> superscripting style in CSS or HTML, or in word processors for
> example): it fails to give the correct guess most of the time if
> there's no user to confirm the actual intended meaning

I don’t agree: As opposed to baseline fallbacks, Unicode superscripts
allow the reader to parse the string as an abbreviation, and machines
can be programmed to act likewise.

> 
> Such confirmation is the job of spell correctors in word processors:
> […] the user may type "Mr." then the wavy line will appear under
> these 3 characters, the spell checker will propose to encode it as an
> abbreviation "Mr<combinining abbrevitation mark>" or leave "Mr."
> unchanged (and no longer signaled) in which case the dot remains a
> regular punctuation, and the "r" is not modified. Then the user may
> choose to style the "r" with superscripting or underlining, and a new
> wavy red underline will appear below the three characters "M<styled
> r>.", proposing to only transform the <styled r> as <superscript r>
> or <r,combining underline> and even when the user accepts one of
> these suggestions it will remain "M<superscript r>." or
> "M<r,combining underline>." where it is still possible to infer the
> semantics of an abbreviation (propose to replace or keep the dot
> after it), or doing nothing else and cancel these suggestions (to
> hide the wavy red underline hint, added by the spell checker), or
> instruct the spell checker that the meaning of the superscript r is
> that of a mathematical exponent, or a chemical a notation.

That mainly illustrates why <combining abbreviation mark> is not
interoperable. The input process seems to be too complicated. And if
a base letter is to be transformed to formatted superscript, you do
need OpenType, much like with U+2044 FRACTION SLASH behaving as
intended, ie transforming the preceding digit string to formatted
numerator digits, and the following to denominator digit glyphs.
In that, U+2044 acts as a format control, and so does <combining
abbreviation mark> that you are suggesting to encode.
> 
> In all cases, the user/author has full control of the intended
> meaning of his text and an informed decision is made where all cases
> are now distinguished. "Legacy" encoding can be kept as is (in
> Unicode), even if it's no longer recommended, just like Unicode has
> documented that half-width Hangul is deprecated (it just offers a
> "compatibility decomposition" for NFKD or NFKC, but this is lossy and
> cannot be done automatically without a human decision).
> 
> And the user/author can now freely and easily compose any
> abbreviation he wishes in natural languages, without being limited by
> the reduced "legacy" set of <superscript letters> encoded in Unicode

So far as the full Latin lowercase alphabet, and for use in all-caps
only, eventually the full Latin uppercase alphabet are encoded, I can
see nothing of a limitation, given these letters have the grapheme
cluster base property and therefore work with all combining diacritics.
That is already working with good font support, as demonstrated in the
parent thread.

> (which should no longer be extended, except for use as distinct plain
> letters needed in alphabets of actual natural languages, or as
> possibly new IPA symbols),

One should be able to overcome the pattern tagging superscripts as not
being “plain letters”, because that is irrelevant when they are used as
abbreviation indicators in natural languages, and as such are plain
characters, like eg the Romance ordinal indicators U+00AA and U+00BA;
see also the DEGREE SIGN hijacked as a substitute of <superscript o>
because not superscripting the o in "nᵒ" is considered inacceptable.

>  and without using the styling tricks (of
> HTML/CSS, or of word processor documents, spreadsheets, presentation
> documents allowing "'rich text" formats on top of "plain text") which
> are best suitable for "free styling" of any human text, without any
> additional semantics, […]

Yes I fully agree, if “semantics” is that required for readability in
accordance with standard orthographies in use.

Best regards,

Marcel