A last missing link for interoperable representation

Tue Jan 15 02:09:00 CST 2019

On 15/01/2019 03:02, Asmus Freytag via Unicode wrote:
> On 1/14/2019 5:41 PM, Mark E. Shoulson via Unicode wrote:
>> On 1/14/19 5:08 AM, Tex via Unicode wrote:
>>>
>>> This thread has gone on for a bit and I question if there is any more light that can be shed.
>>>
>>> BTW, I admit to liking Asmus definition for functions that span text being a definition or criteria for rich text.
>>>
>>>
>> Me too.  There are probably some exceptions or weird corner-cases, but it seems to be a really good encapsulation of the distinction which I had never seen before.
>>
> ** blush **
>
> A./
>
>
I did like it too, and I was really amazed that the issue could be boiled down to such a handy shibboleth. It wasn’t until I’m looking harder that I can’t help any more seeing it as a mere rewording of current practice. That is, if we’re using markup (that typically acts on spans and other elements), it’s rich text; if we’re using characters, it’s plain text. The reason why I changed my mind is that the new shibboleth can be misused to relegate to the realm of rich text some feature of a writing system, like using superscript as ordinal indicators (English "3ʳᵈ", French "2ᵉ" [order] or "2ⁿᵈ" [rank], Italian "1ᵃ" or — in Latin-1 — "1ª", the latter being used in German as a narrow form of "prima" that has special semantics there ["top quality" or "great!"]), only on the basis that it is currently emulated using rich text by declaring that "ᵉ" is—or “should” be—a span with superscript markup, so that we end up with "2<sup>e</sup>".

As I’ve (too) slightly pointed in a previous reply, that is not what we should end up with. Abbreviation indicators in Latin script are a case of a single character solution, albeit multiple characters may be involved in a single instance. We can also have inner uppercase, aka camelcase, that cannot be handled by the titlecase attribute. We’re clearly in the realm of plain text, and any other solution may be called an emulation, or a legacy workaround, but not a Unicode conformant interoperable representation.

Also, please note the presence in Unicode, of U+070F SYRIAC ABBREVIATION MARK, a format control… Probably there are also some other format controls in other scripts, performing likely the same job. Remember when a similar solution was suggested for Latin script on this List…

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190115/cca96d2f/attachment.html>