Tag characters and in-line graphics (from Tag characters)

Asmus Freytag (t) asmus-inc at ix.netcom.com
Sun May 31 10:50:05 CDT 2015

On 5/31/2015 5:33 AM, Chris-as-John wrote:
> Yes, Asmus good post. But I don’t really think HTML, even a subset, is 
> really the right solution.

The longer I think about this, what would be needed would be something 
like an "abstract" format. A specification of the capabilities to be 
supported and the types of properties needed to support them in an 
extensible way. HTML and CSS would possibly become an implementation of 
such a specification.

There would still be a place for a character set, that is Unicode, as an 
efficient way to implement the most basic and most standard features of 
text contents, but perhaps some extension mechanism that can handle 
various extensions.

The first level of extension is support for recent (or rare) code points 
in the character set (additional fonts, etc, as you mention).

The next level of extension could be support for collections of custom 
entities that are not available as character sets (stickers and the like).

And finally, there would have to be a way to deal with "one-offs", such 
as actual images that do not form categorizable sets, but are used in an 
ad-hoc manner and behave like custom characters.

And so on.

It should be possible to describe all of this in a way that allows it to 
be mapped to HMTL and CSS or to any other rich text format -- the goal, 
after all is to make such "inline text" as widely and effortlessly 
interchangeable as plain text is today (or at least nearly so).

By keeping the specification abstract, you could accommodate both SGML 
like formats where ascii-string markup is intermixed with the text, as 
well as pure text buffers with place holder code points and links to 
external data.

But, however bored you are with plain Unicode emoji, as long as there 
isn't an agreed upon common format for rich "inline text" I see very 
little chance that those cute facebook emoji will do anything other than 
firmly keep you in that particular ghetto.


> I’m reminded of the design for XML itself, it is supposed to start 
> with a header that defines what that XML will conform to. Those 
> definitions contain some unique identifiers of that XML schema, which 
> happens to be a URL. The URL is partly just a convenient unique 
> identifier, but also, the XML engine, if it doesn’t know about that 
> schema could go to that URL and download the schema, and check that 
> the XML  conforms to that schema.
> Similarly, imagine a text format that had a header with something like:
> \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345
> Now all the characters following in the text will interpret characters 
> that start with 12345 with respect to that character set. What would 
> you find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might 
> find bitmaps, truetype fonts, vector graphics, etc. You might find 
> many many representations of that character set that your rendering 
> engine could cache for future use. The text format wouldn’t be reliant 
> on today’s favorite rendering technology, whether bitmap, truetype 
> fonts, or whatever. Right now, if you go to a website that references 
> unicode that your platform doesn’t know about, you see nothing. If a 
> format like this existed, character sets would be infinitely 
> extensible, everybody on earth could see characters, even if their 
> platform wasn’t previously aware of them, and the format would be 
> independent of today’s rendering technologies. Let’s face it, HTML5 
> changes every few years, and I don’t think anybody wants the 
> fundamental textual representation dependant on an entire layout 
> engine. And also the whole range of what HTML5 can do, even some 
> subset, is too much information. You don’t necessarily want your text 
> to embed the actual character set. Perhaps that might be a useful 
> option, but I think most people would want to uniquely identify the 
> character set, in a way that an engine can download it, but without 
> defining the actual details itself. Of course, certain charsets would 
> probably become pervasive enough that platforms would just include 
> them for convenience. Emojis by major messaging platforms. Maybe 
> characters related to specialised domains like, I don’t know, mapping 
> or specialised work domains or whatever, But without having to be 
> subservient to the central unicode committee.
> As someone who is a keen user of Facebook messenger, and who sees them 
> bring out a new set of emoji almost every week, I think the world will 
> soon be totally bored with the plain basic emoji that unicode has defined.
>> Chris
> On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) 
> <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>> wrote:
>     reading this discussion, I agree with your reaductio ad absurdum
>     of infinitely nested HTML.
>     But I think you are onto something with your hypothetical example
>     of the "subset that works in ALL textual situations".
>     There's clearly a use case for something like it, and I believe
>     many people would intuitively agree on a set of features for it.
>     What people seem to have in mind is something like "inline" text.
>     Something beyond a mere stream of plain text (with effectively
>     every character rendered visibly), but still limited in important
>     ways by general behavior of inline text: a string of it, laid out,
>     must wrap and line break, any objects included in it must behave
>     like characters (albeit of custom width, height and appearance),
>     and so on. Paragraph formatting, stacked layout, header levels and
>     all those good things would not be available.
>     With such a subset clearly defined, many quirky limitations might
>     no longer be necessary; any container that today only takes plain
>     text could be upgraded to take "inline text". I can see some
>     inline containers retaining a nesting limitation, but I could
>     imagine that it is possible to arrive at a consistent definition
>     of such inline format.
>     Going further, I can't shake the impression that without a clean
>     definition of an inline text format along those lines, any
>     attempts at making stickers and similar solutions "stick" are
>     doomed to failure.
>     The interesting thing in defining such a format is not how to
>     represent it in HTML or CSS syntax, but in describing what feature
>     sets it must (minimally) support. Doing it that way would free
>     existing implementations of rich text to map native formats onto
>     that minimally required subset and to add them to their format
>     translators for HMTL or whatever else they use for interchange.
>     Only with a definition can you ever hope to develop a processing
>     model. It won't be as simple as for plain text strings, but it
>     should be able to support common abstractions (like iteration by
>     logical unit). It would have to support the management of external
>     resources - if the inline format allows images, custom fonts, etc.
>     one would need a way to manage references to them in the local
>     context.
>     If your skeptical position proves correct in that this is
>     something that turns out to not be tractable, then I think you've
>     provided conclusive proof why stickers won't happen and why
>     encoding emoji was the only sensible decision Unicode could have
>     taken.
>     A./
>     On 5/30/2015 7:14 AM, John wrote:
>>     Hmm, these "once entities" of which you speak, do they require
>>     javascript? Because I'm not sure what we are looking for here is
>>     static documents requiring a full programming language.
>>     But let's say for a moment that html5 can, or could do the job
>>     here. Then to make the dream come true that you could just cut
>>     and paste text that happened to contain a custom character to
>>     somewhere else, and nothing untoward would happen, would mean
>>     that everything in the computing universe should allow full blown
>>     html. So every Java Swing component, every Apple gui component,
>>     every .NET component, every windows component, every browser,
>>     every Android and IOS component would allow text entry of HTML
>>     entities. OK, so let's say everyone agrees with this course of
>>     action, now the universal text format is HTML.
>>     But in this new world where anywhere that previously you could
>>     input text, you can now input full blown html, does that actually
>>     make sense? Does it make sense that you can for example, put full
>>     blown HTML inside a H1 tag in html itself? That's a lot of
>>     recursion going on there. Or in a MS-Excel cell? Or interspersed
>>     in some otherwise fairly regular text in a Word document?
>>     I suppose someone could define a strict limited subset of HTML to
>>     be that subset that makes sense in ALL textual situations. That
>>     subset would be something like just defining things that act like
>>     characters, and not like a full blown rendering engine. But who
>>     would define that subset? Not the HTML groups, because their
>>     mandate is to define full blown rendering engines. It would be
>>     more likely to be something like the unicode group.
>>     And also, in this brave new world where HTML5 is the new standard
>>     text format, what would the binary format of it be? I mean, if I
>>     have the string of unicode characters <IMG would that be HTML5
>>     image definition that should be rendered as such? Or would it be
>>     text that happens to contain greater than symbol, I, M and G? It
>>     would have to be the former I guess, and thereby there would no
>>     longer be a unicode symbol for the mathematical greater than
>>     symbol. Rather there would be a unicode symbol for opening a HTML
>>     tag, and the text code for greater than would be > Never again
>>     would a computer store > to mean greater than. Do we want HTML to
>>     be so pervasive? Not sure it deserves that.
>>     And from a programmers point of view, he wants to be able to
>>     iterate over an array of characters and treat each one the same
>>     way, regardless if it is a custom character or not. Without that
>>     kind of programmatic abstraction, the whole thing can never gain
>>     traction. I don't think fully blown HTML embedded in your text
>>     can fulfill that. A very strictly defined subset, possibly could.
>>     Sure HTML5 can RENDER stuff adquately, if the only aim of the
>>     game is provide a correct rendering. But to be able to actually
>>     treat particular images embedded as characters, and have some
>>     programming library see that abstraction consistently, I'm not
>>     sure I'm convinced that is possible. Not without nailing down
>>     exactly what html elements in what particular circumstances
>>     constitute a "character".
>>     I guess in summary, yes we have the technology already to render
>>     anything. But I don't think the whole standards framework does
>>     anything to allow the computing universe to actually exchange
>>     custom characters as if they were just any other text. Someone
>>     would actually have to  work on a standard to do that, not just
>>     point to html5.
>>     On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy
>>     <verdy_p at wanadoo.fr <mailto:verdy_p at wanadoo.fr>>, wrote:
>>         2015-05-29 4:37 GMT+02:00 John <idou747 at gmail.com
>>         <mailto:idou747 at gmail.com>>:
>>             "Today the world goes very well with HTML(5) which is now
>>             the bext markup language for document (including for
>>             inserting embedded images that don’t require any external
>>             request”
>>             If I had a large document that reused a particular
>>             character thousands of times, would this HTML markup
>>             require embedding that character thousands of times, or
>>             could I define the character once at the beginning of the
>>             sequence, and then refer back to it in a space efficient way?
>>         HTML(5) allows defining *once* entities for images that can
>>         then be reused thousands of times without repeting their
>>         definition. You can do this as well with CSS styles, just
>>         define a class for a small element. This element may still be
>>         an "image", but the semantic is carried by the class you
>>         assign to it. You are not required to provide an external
>>         source URL for that image if the CSS style provides the content.
>>         You may also use PUAs for the same purpose (however I have
>>         not seen how CSS allows to style individual characters in
>>         text elements as these characters are not elements, and
>>         there's no defined selector for pseudo-elements matching a
>>         single character). PUAs are perfectly usable in the situation
>>         where you have embedded a custom font in your document for
>>         assigning glyphs to characters (you can still do that, but I
>>         would avoid TrueType/OpenType for this purpose, but would use
>>         the SVG font format which is valid in CSS, for defining a
>>         collection of glyphs).
>>         If the document is not restricted to be standalone, of course
>>         you can use links to an external shared CSS stylesheet and to
>>         this SVG font referenced by the stylesheet. With such
>>         approach, you don't even need to use classes on elements, you
>>         use plain-text with very compact PUAs (it's up to you to
>>         decide if the document must be standalone (embedding
>>         everything it needs) or must use external references for
>>         missing definitions, HTML allows both (and SVG as well when
>>         it contains plain-text elements).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150531/9927344c/attachment-0001.html>

More information about the Unicode mailing list