HTML entities

Martin J. Dürst duerst at it.aoyama.ac.jp
Fri Mar 19 01:40:32 CDT 2021


Hello Jukka, others,

On 2021/03/18 17:20, Jukka K. Korpela via Unicode wrote:
> Tex (textexin at xencraft.com) wrote:

>> However, you are quoting a doc that has been withdrawn.

> It’s a pity that this well-written and useful document was withdrawn, for
> reasons I don’t understand.

Here are the main reasons, as far as I understand them. Unicode gets 
updated roughly once a year, and Web technology also changes over time. 
There was not enough manpower to keep the document up to date.

In addition, the document was always a kind of tug-of-war between those 
who pushed for more favorable descriptions of specific Unicode 
characters (such as ⁴ in this discussion) or more favorable descriptions 
of markup-based and style-based solutions (such as <sup></sup>). That 
meant that for each update, in addition to dealing with new characters, 
there was a tendency to re-negotiate already established text.

A consequence of this tug-of-war was that the document was written in a 
way that made clear that there was some choice between markup/styling 
and special-purpose Unicode characters, but allowed each side to 
interpret the document in the way they were seeing things.

On top of that, the document was also a joint publication of the Unicode 
Consortium and W3C. So there were cases where a tug-of-war happened 
inside the W3C, inside Unicode, or between the two organizations, or all 
of it at the same time. Publication required approval by both sides, and 
even a minor tweak from one side had to be approved by the other side. 
The schedules of both sides had otherwise no reason to be in sync, so 
the next version of Unicode was often around before the update for the 
previous version was beginning to settle. So at some point, some brave 
soul became aware of this situation and proposed a withdrawal, and 
nobody else had the energy to object.

> Yet, the statement I quoted is valid and relevant on its own. To take an
> even more understandable example, the use of 10<sup>4</sup> versus 10⁴
> means that when an HTML document is saved as plain text, or copied and
> pasted to a plain text environment, or rendered in Braille or speech, the
> expression denoting the number 10,000 suddenly becomes 104.


Well, an then somebody else uses 10<sup>3.5</sup> somewhere. How are you 
going to express this so that it doesn't turn into 103.5 in plain text? 
The problem is that there is always a limit somewhere for plain text. 
There is also always a limit somewhere for markup and styled rendering, 
but it's in a quite different place.


>> If there are issues with how <sup> is implemented and renders, they should
>> be fixed rather than adding what would be many stylized named entities,
>> which would require the same code fixes.
>>
> 
> The <sup> and <sub> elements have been in HTML well over 20 years, with no
> progress in implementations. I can imagine some of the reasons to this.

Out of the box rendering of <sup> and <sub> may be rather crude, but I 
guess it should be possible to do a lot better with some dose of CSS and 
possibly some Web fonts.


> But this is completely independent of the issue of named character
> reference. It does not affect the rendering the least whether SUPERSCRIPT
> FOUR appears in HTML source as such (as character data), as numeric
> reference &#x2074;, or as named reference &sup4;. The only differences
> between the latter two are that 1) the named reference is more mnemonic and
> therefore easier to write and 2) an HTML user agent needs to have an entry
> for it in its mapping table from names to numbers (so the implementation is
> extremely trivial, and the question would be how fast it would be made and
> how fast the installed browser base would be updated).

In theory, it could be made quite quickly. But it is a slippery slope. 
There are always more characters for which somebody may want additional 
named character entities. And so my guess would be that the browser 
makers would be very cautious.

Regards,    Martin.

> Personally, I don’t see a problem in writing &#x2074; (and &#x2075; etc.)
> after I have learned to remember this. But the point is that when people
> complain that &sup4; does not work, then the answer should not be “use
> <sup>4</sup>”. It’s something very different, and there are ways to use
> SUPERSCRIPT FOUR even in circumstances where you cannot type it directly or
> as a named reference.
> 
> Yucca
> 



More information about the Unicode mailing list