Tag characters and in-line graphics (from Tag characters)

Tue Jun 2 20:50:19 CDT 2015

Martin, you seem to be labouring under the impression that HTML5 is a substitute for character encoding. If it is, why do we need unicode? We could just have documents laden with <IMG tags, and restrict ourselves to ascii.

It seems I need to spell out one more time why HTML is not character encoding:

1. HTML5 doesn’t separate one particular representation (font, size, etc) from the actual meaning of the character. So you can’t paste it somewhere and expect to increase its point size or change its font.
2. It’s highly inefficient in space to drop multi-kilobyte strings into a document to represent one character.
3. The entire design of HTML has nothing to do with characters. So there is no way to process a string of characters interspersed with HTML elements and know which of those elements are a “character”. This makes programatic manipulation impossible, and means most computer applications simply will not allow HTML in scenarios where they expect a list of “characters”.
4. There is no way to compare 2 HTML elements and know they are talking about the same character. I could put some HTML representation of a character in my document, you could put a different one in, and there would absolutely no way to know that they are the same character. Even if we are in the same community and agree on the existence of this character.
5. Similarly, there is no way to search or index html elements. If a HTML document contained an image of a particular custom character, there would be no way to ask google or whatever to find all the documents with that character. Different documents would represent it differently. HTML is a rendering technology. It makes things LOOK a particular way, without actually ENCODING anything about it. The only part of of HTML that is actually searchable in a deterministic fashion is the part that is encoded - the unicode part.

Unicode encodes symbols that have “reasonable popularity”. (a) that is not all of them. (b) how can a symbol attain reasonable popularity when it is not in unicode? Of course some can, but others have their popularity hindered by the very fact that they are not encoded!

Take the poop emoji that people recently have been talking about here. It gained popularity because the Japanese telecom companies decided to encode it. If they hadn’t encoded it, well would have become popular through normal culture such that the unicode consortium would have adopted it! No it wouldn’t! The Japanese telcos were able to do this because they controlled their entire user base from hardware on up to encodings. That won’t be happening into the future, so new interesting and potentially universal emojis won’t ever come into existence in the way that this one did because of the control the unicode consortium exercises over this technology. But the problem isn’t restricted to emojis, many other potentially popular symbols can’t come into existence either. The internet *COULD* be the birthplace of lots of interesting new symbols in the same way that Japanese telecom companies birthed the original emojis, but it won’t be because the unicode consortium rules it from the top down.

Summary: 
1. HTML renders stuff, it encodes nothing. It addresses a completely different problem domain. If rendering and encoding were the same problem, unicode can disband now.
2. Unicode encodes stuff, but isn’t extensible in a way that broadly useful. i.e. in a way that allows anybody (or any application) receiving a custom character to know what it is, or how to render it, or to combine it with other custom character sets.
3. The problem under discussion is not a rendering problem. HTML5 lacks nothing in terms of ability to render. Yet the problem remains. Because it’s an encoding problem. Encoding problems are in the unicode domain, not in the HTML5 domain.

You say that character encodings work best when they are used widely and uniformly.  But they can only be as wide or as uniform as reality itself.  We could try and conform reality to technology and… for example… force all the world to use Latin characters and 128 ASCII representations. OR we can conform technology to reality. Not all encodings need to be, or ought to be as universal as requiring one world wide committee to pass judgment on them.

> On 3 Jun 2015, at 11:09 am, Martin J. Dürst <duerst at it.aoyama.ac.jp> wrote:
> 
> On 2015/06/03 07:55, Chris wrote:
> 
>> As you point out, "The UCS will not encode characters without a demonstrated usage.”. But there are use cases for characters that don’t meet UCS’s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations.
> 
> Unicode contains *a lot* of characters for specialized regional, business, or domain specific situations.

> 
>> My question is, given that unicode can’t realistically (and doesn’t aim to) encode every possible symbol in the world, why shouldn’t there be an EXTENSIBLE method for encoding, so that people don’t have to totally rearchitect their computing universe because they want ONE non-standard character in their documents?
> 
> As has been explained, there are technologies that allow you to do (more or less) that. Information technology, like many other technologies, works best when finding common cases used by many people. Let's look at some examples:
> 
> Character encodings work best when they are used widely and uniformly. I don't know anybody who actually uses all the characters in Unicode (except the guys that work on the standard itself). So for each individual, a smaller set would be okay. And there were (and are) smaller sets, not for individuals, but for countries, regions, scripts, and so on. Originally (when memory was very limited), these legacy encodings were more efficient overall, but that's no longer the case. So everything is moving towards Unicode.
> 
> Most Website creators don't use all the features in HTML5. So having different subsets for different use cases may seem to be convenient. But overall, it's much more efficient to have one Hypertext Markup Language, so that's were everybody is converging to.
> 
> From your viewpoint, it looks like having something in between character encodings and HTML is what you want. It would only contain the features you need, and nothing more, and would work in all the places you wanted it to work. Asmus's "inline" text may be something similar.
> 
> The problem is that such an intermediate technology only makes sense if it covers the needs of lots and lots of people. It would add a third technology level (between plain text and marked-up text), which would divert energy from the current two levels and make things more complicated.
> 
> Up to now, such as third level hasn't emerged, among else because both existing technologies were good at absorbing the most important use cases from the middle. Unicode continues to encode whatever symbols that gain reasonable popularity, so every time somebody has a "real good use case" for the middle layer with a symbol that isn't yet in Unicode, that use case gets taken away. HTML (or Web technology in general) also worked to improve the situation, with technologies such as SVG and Web Fonts.
> 
> No technology is perfect, and so there are still some gaps between character encoding and markup, some of which may in due time eventually be filled up, but I don't think a third layer in the middle will emerge soon.
> 
> Regards,   Martin.