Tag characters and in-line graphics (from Tag characters)

Chris idou747 at gmail.com
Sun May 31 18:33:49 CDT 2015


Of course, anyone can invent a character set. The difficult bit is having a standard way of combining custom character sets. That’s why a standard would be useful.

And while stuff like this can, to some extent, be recognised by magic numbers, and unique strings in headers, such things are unreliable. Just because example.net/mycharset/ <http://example.net/mycharset/> appears near the start of a document, doesn’t necessarily mean it was meant to define a character set. Maybe it was a document discussing character sets.

And while it is tempting to allow the “container” to define the “header” information, whether the container be html defining something in its HEAD tag, or some proprietary format (MS-Word), or whatever, that doesn’t really solve anybody’s problem in a standard way. For a start, what if you want to copy text to the clipboard? You want the thing receiving it to be able to interpret it in a self-contained way.

The 2 obvious implementations for a standard seem to be:

1) A standard (optional) header. Perhaps if the string starts with a special character, then follows a header defining charsets first. These would allocate character ranges for custom characters, and point to where their renderings can be found. Standard programming libraries on all platforms would invisibly act appropriately on these headers. If you concatenated strings with conflicting namespaces, standard libraries would seamlessly reallocate one of the custom namespaces and merge the headers.

2) Make a new character set, let’s call it UTF-64. 32 bits would be allocated for custom character sets. Anybody could apply to a central authority to be allocated a custom id (32 bits=4 billion ids). A central location, kind of like a domain name system, would map that id to the URL where the canonical definition for that character set is.

The 2nd option has the advantage that the file format is fixed width like normal plain text documents. Concatenating custom character set strings is no issue. The canonical location for a character set isn’t forevermore mapped to a particular domain owner. Nothing about the meaning of the characters is defined in the actual bits other than the unique id. The disadvantage is it needs a central authority to maintain the list of ids, and map them to domains.



> On 1 Jun 2015, at 7:26 am, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> The "abstract format" already exists also for HTML (with MIME "charset" extension of the media-type "text/plain" (it can also be embedded in a meta tag, where the HTML source file ius just stored in a filesystem, so that a webserver can parse it and provide the correct MIME header, if the webserver has no repository for metadata and must infer the media type from the file content itself with some guesser).
> 
> It also exists in various conventions for source code (recognized by editors such as vi(m) or Emacs, or for Unic shells using embedded "magic" identifiers near the top of the file.
> 
> You can use it to send an identifier for a private charset without having to request for a registration of the charset in the IANA database (which is not intended for private encodings). The pricate chrset can be named a unique way (consider using a private charset name based on a domain name you own, such as "x-www.example.net-mycharset-1" if you own the domain name "example.net <http://example.net/>"). It will be enough for the initial experimentation for a few years (or more, provided that you renew this domain name). Your charset can contain various defitnitions: a mapping of your codepoints (including PUAs, or standard codepoints, or "hacked" codepoints if you have no other solution to get the correct character properties working with existing algorithms such as case mappings, collation, layout behavior in text renderers).
> 
> Such solution would allow a more predictable management of PUAs (byt allowing to control their scope of use, by binding them, only in some magic header of the document, to a private charset that remains reasonnably unique. for example "x-example.net-mycharset-1" would map to an URL like "//www.example.net/mycharset/1/ <http://www.example.net/mycharset/1/>" containing some schema (it could be the base adress of an XML of JSON file, and of a web font containing the relevant glyphs, and of a character properties database to override the default ones from the standard: if you already know this private charset in your application, you don't need to download any of these files, the URL is just an identifier and you file can still be used in standalone mode, just like you can parse many standard XML schemas by just recognizing the URLs assigned to the XML namespaces, without even having to find a DTD or XML schema definition from an external resource; if needed you app can contain a local repository in some cache folder where you can extend the number of private "charsets" that can be recognized).
> 
> ----
> 
> Full interopability will still not be possible if you need to mix in the same document texts encoded with different private charsets (there's always a risk of collision), without a way to reencode some of them to a joined charset without the collisions) by infering a new private charset (it's not impossible to do, after all this is done already with XML schemas that you can mix together: you just need to rename the XML namespaces, keeping the URLs to which they are bound, when there's a collision on the XML namespace names, a situation that occurs sometimes because of versioning where some features of a schema are not fully upward compatible).
> 
> Yes this complicate things a bit, but much less than when using documents in which PUA assignments are not negociated at all (even minimally to make sure they are compatible when mixing sources); and for which there exits for now absolutely no protocol defined for such negociation (TUS says that PUAs are usable and interchangeable under "private mutual agreement" but still provides no schemes for supporting such mutual agreement, and for this reason, PUAs are alsmost always rejected, and people want true permanent assignments for characters that are very specific, badly documented, or insufficiently known to have reliable permanent properties).
> 
> So let's think about securing the use of PUAs with some identification scheme (for plain-text formats, it should just be allowed to negocaite a single charset for the whole, using the "magic" header tricks that re used since long by charset guessers (including for autodetecting UTF-8 encoded files).
> 
> This would also solve the chicken-and-egg problem where we need more sources to attest an effective usage before encoding new characters, but developping this usages is extremely difficult (and much slower) in our modern technologies where most documents are now handled numerically (in the past it was possible to create a metal font and use it immediately to start editing books, and there were many more people using handwriting and drawings, so it was much less difficult to invent new characters, than it is today, unless you're a big company that has enough resources to develop this usage alone, such as Japanese telcos or Google, Yahoo, Samsung or Microsoft introducing new sets of Emojis for their instant messaging platform, with tons of developers working for them to develop a wide range of services around it...)
> 
> However I'm not saying that Unicode should specify how such private charset containing private assignments could be inserted in headers (I just think that it should describe a mechanism and give example of how common text formats are already used to convery some "magic" identifiers near the top of the file, and then we could describe a service allowing to locate and retrieve the associated definitions of this identifier, and some interchangeable format for these informations.
> 
> 
> 2015-05-31 17:50 GMT+02:00 Asmus Freytag (t) <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>>:
> On 5/31/2015 5:33 AM, Chris-as-John wrote:
>> 
>> Yes, Asmus good post. But I don’t really think HTML, even a subset, is really the right solution.
> 
> The longer I think about this, what would be needed would be something like an "abstract" format. A specification of the capabilities to be supported and the types of properties needed to support them in an extensible way. HTML and CSS would possibly become an implementation of such a specification.
> 
> There would still be a place for a character set, that is Unicode, as an efficient way to implement the most basic and most standard features of text contents, but perhaps some extension mechanism that can handle various extensions. 
> 
> The first level of extension is support for recent (or rare) code points in the character set (additional fonts, etc, as you mention).
> 
> The next level of extension could be support for collections of custom entities that are not available as character sets (stickers and the like).
> 
> And finally, there would have to be a way to deal with "one-offs", such as actual images that do not form categorizable sets, but are used in an ad-hoc manner and behave like custom characters.
> 
> And so on.
> 
> It should be possible to describe all of this in a way that allows it to be mapped to HMTL and CSS or to any other rich text format -- the goal, after all is to make such "inline text" as widely and effortlessly interchangeable as plain text is today (or at least nearly so).
> 
> By keeping the specification abstract, you could accommodate both SGML like formats where ascii-string markup is intermixed with the text, as well as pure text buffers with place holder code points and links to external data.
> 
> But, however bored you are with plain Unicode emoji, as long as there isn't an agreed upon common format for rich "inline text" I see very little chance that those cute facebook emoji will do anything other than firmly keep you in that particular ghetto.
> 
> A./
> 
>> I’m reminded of the design for XML itself, it is supposed to start with a header that defines what that XML will conform to. Those definitions contain some unique identifiers of that XML schema, which happens to be a URL. The URL is partly just a convenient unique identifier, but also, the XML engine, if it doesn’t know about that schema could go to that URL and download the schema, and check that the XML  conforms to that schema.
>> 
>> Similarly, imagine a text format that had a header with something like:
>> \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345 <http://facebook.com/charsets/pusheen-the-cat-emoji/,12345>
>> 
>> Now all the characters following in the text will interpret characters that start with 12345 with respect to that character set. What would you find at atfacebook.com/charsets/pusheen-the-cat-emoji/ <http://facebook.com/charsets/pusheen-the-cat-emoji/>? You might find bitmaps, truetype fonts, vector graphics, etc. You might find many many representations of that character set that your rendering engine could cache for future use. The text format wouldn’t be reliant on today’s favorite rendering technology, whether bitmap, truetype fonts, or whatever. Right now, if you go to a website that references unicode that your platform doesn’t know about, you see nothing. If a format like this existed, character sets would be infinitely extensible, everybody on earth could see characters, even if their platform wasn’t previously aware of them, and the format would be independent of today’s rendering technologies. Let’s face it, HTML5 changes every few years, and I don’t think anybody wants the fundamental textual representation dependant on an entire layout engine. And also the whole range of what HTML5 can do, even some subset, is too much information. You don’t necessarily want your text to embed the actual character set. Perhaps that might be a useful option, but I think most people would want to uniquely identify the character set, in a way that an engine can download it, but without defining the actual details itself. Of course, certain charsets would probably become pervasive enough that platforms would just include them for convenience. Emojis by major messaging platforms. Maybe characters related to specialised domains like, I don’t know, mapping or specialised work domains or whatever, But without having to be subservient to the central unicode committee.
>> 
>> As someone who is a keen user of Facebook messenger, and who sees them bring out a new set of emoji almost every week, I think the world will soon be totally bored with the plain basic emoji that unicode has defined.
>> 
>> 
>>>> Chris
>> 
>> 
>> On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>> wrote:
>> reading this discussion, I agree with your reaductio ad absurdum of infinitely nested HTML.
>> 
>> But I think you are onto something with your hypothetical example of the "subset that works in ALL textual situations".
>> 
>> There's clearly a use case for something like it, and I believe many people would intuitively agree on a set of features for it.
>> 
>> What people seem to have in mind is something like "inline" text. Something beyond a mere stream of plain text (with effectively every character rendered visibly), but still limited in important ways by general behavior of inline text: a string of it, laid out, must wrap and line break, any objects included in it must behave like characters (albeit of custom width, height and appearance), and so on. Paragraph formatting, stacked layout, header levels and all those good things would not be available.
>> 
>> With such a subset clearly defined, many quirky limitations might no longer be necessary; any container that today only takes plain text could be upgraded to take "inline text". I can see some inline containers retaining a nesting limitation, but I could imagine that it is possible to arrive at a consistent definition of such inline format.
>> 
>> Going further, I can't shake the impression that without a clean definition of an inline text format along those lines, any attempts at making stickers and similar solutions "stick" are doomed to failure.
>> 
>> The interesting thing in defining such a format is not how to represent it in HTML or CSS syntax, but in describing what feature sets it must (minimally) support. Doing it that way would free existing implementations of rich text to map native formats onto that minimally required subset and to add them to their format translators for HMTL or whatever else they use for interchange.
>> 
>> Only with a definition can you ever hope to develop a processing model. It won't be as simple as for plain text strings, but it should be able to support common abstractions (like iteration by logical unit). It would have to support the management of external resources - if the inline format allows images, custom fonts, etc. one would need a way to manage references to them in the local context.
>> 
>> If your skeptical position proves correct in that this is something that turns out to not be tractable, then I think you've provided conclusive proof why stickers won't happen and why encoding emoji was the only sensible decision Unicode could have taken.
>> 
>> A./
>> 
>> On 5/30/2015 7:14 AM, John wrote:
>>> 
>>> Hmm, these "once entities" of which you speak, do they require javascript? Because I'm not sure what we are looking for here is static documents requiring a full programming language.
>>> 
>>> But let's say for a moment that html5 can, or could do the job here. Then to make the dream come true that you could just cut and paste text that happened to contain a custom character to somewhere else, and nothing untoward would happen, would mean that everything in the computing universe should allow full blown html. So every Java Swing component, every Apple gui component, every .NET component, every windows component, every browser, every Android and IOS component would allow text entry of HTML entities. OK, so let's say everyone agrees with this course of action, now the universal text format is HTML.
>>> 
>>> But in this new world where anywhere that previously you could input text, you can now input full blown html, does that actually make sense? Does it make sense that you can for example, put full blown HTML inside a H1 tag in html itself? That's a lot of recursion going on there. Or in a MS-Excel cell? Or interspersed in some otherwise fairly regular text in a Word document?
>>> 
>>> I suppose someone could define a strict limited subset of HTML to be that subset that makes sense in ALL textual situations. That subset would be something like just defining things that act like characters, and not like a full blown rendering engine. But who would define that subset? Not the HTML groups, because their mandate is to define full blown rendering engines. It would be more likely to be something like the unicode group.
>>> 
>>> And also, in this brave new world where HTML5 is the new standard text format, what would the binary format of it be? I mean, if I have the string of unicode characters <IMG would that be HTML5 image definition that should be rendered as such? Or would it be text that happens to contain greater than symbol, I, M and G? It would have to be the former I guess, and thereby there would no longer be a unicode symbol for the mathematical greater than symbol. Rather there would be a unicode symbol for opening a HTML tag, and the text code for greater than would be > Never again would a computer store > to mean greater than. Do we want HTML to be so pervasive? Not sure it deserves that.
>>> 
>>> And from a programmers point of view, he wants to be able to iterate over an array of characters and treat each one the same way, regardless if it is a custom character or not. Without that kind of programmatic abstraction, the whole thing can never gain traction. I don't think fully blown HTML embedded in your text can fulfill that. A very strictly defined subset, possibly could. Sure HTML5 can RENDER stuff adquately, if the only aim of the game is provide a correct rendering. But to be able to actually treat particular images embedded as characters, and have some programming library see                     that abstraction consistently, I'm not sure I'm convinced that is possible. Not without nailing down exactly what html elements in what particular circumstances constitute a "character".
>>> 
>>> I guess in summary, yes we have the technology already to render anything. But I don't think the whole standards framework does anything to allow the computing universe to actually exchange custom characters as if they were just any other text. Someone would actually have to  work on a standard to do that, not just point to html5.
>>> 
>>> 
>>> On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy <verdy_p at wanadoo.fr <mailto:verdy_p at wanadoo.fr>>, wrote:
>>> 
>>> 2015-05-29 4:37 GMT+02:00 John <idou747 at gmail.com <mailto:idou747 at gmail.com>>:
>>> "Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don’t require any external request”
>>> If I had a large document that reused a particular character thousands of times, would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way?
>>> 
>>> HTML(5) allows defining *once* entities for images that can then be reused thousands of times without repeting their definition. You can do this as well with CSS styles, just define a class for a small element. This element may still be an "image", but the semantic is carried by the class you assign to it. You are not required to provide an external source URL for that image if the CSS style provides the content.
>>> 
>>> You may also use PUAs for the same purpose (however I have not seen how CSS allows to style individual characters in text elements as these characters are not elements, and there's no defined selector for pseudo-elements matching a single character). PUAs are perfectly usable in the situation where you have embedded a custom font in your document for assigning glyphs to characters (you can still do that, but I would avoid TrueType/OpenType for this purpose, but would use the SVG font format which is valid in CSS, for defining a collection of glyphs).
>>> 
>>> If the document is not restricted to be standalone, of course you can use links to an external shared CSS stylesheet and to this SVG font referenced by the stylesheet. With such approach, you don't even need to use classes on elements, you use plain-text with very compact PUAs (it's up to you to decide if the document must be standalone (embedding everything it needs) or must use external references for missing definitions, HTML allows both (and SVG as well when it contains plain-text elements).
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150601/092a2894/attachment.html>


More information about the Unicode mailing list