Unicode encoding philosophy

Giacomo Catenazzi cate at cateee.net
Wed Oct 11 05:41:49 CDT 2023


On 11 Oct 2023 10:28, Piotr Karocki via Unicode wrote:
>> Additionally some HTML tags are about formatting <p> <h1>, etc.

>   I disagree.
>   HTML is about text structure, not about formatting/rendering.
> <h1> is rendered differently for different output devices: monitor, printer,
> Braille 'display', narrator (text to voice), etc.

Which font size? Which "version" of Braille? Which voice (male/female, 
accent)?, etc.

In any case, you have a different definition of "formatting". Maybe we 
should stop using such word, and use instead "plain text", "structure", 
"style", "rendering" (with lower risk to misinterpret). You consider 
"formatting" only the last two. I consider everything above "plain text" 
as "formatting". Two empty lines on an email is /new paragraph/ and it 
should be displayed so. Should I really distinguish it from italic or 
bold (so using slash, or asterix)? So depending on application, we have 
different terminologies. Seldom we must distinguish it in so many steps. 
This group is one where distinction is important.

Note: Unicode Category Cf (Other, formatting) includes various 
"structure" characters (so as HTML and not CSS "features")

Note: On Microsoft Windows: "Paste without Formatting" is mostly 
plaintext, and some structure (new lines, lists) but not much more. Also 
a third definition of "formatting".


>   Unicode should be used to specify character/glyph/sign, HTML to add text
> structure, and CSS used only to force rendering (so it should be used very
> rarely).

I think it is manichaeist. It may be the aim, but human languages are 
too diverse and complex to create a perfect split of domains. But also 
thematically it is difficult (and artificial) to split in such manner.

Final rendering requires an additional step after "CSS", usually done by 
different engines: layout/shaping/font-rendering. And Unicode Standard 
touch also this part. Interaction of characters is an important topic on 
Unicode Standard: when to do liguatures and graphemes (and grapheme 
clusters), how to avoid them (ZWJ/ZWNJ, variant selectors etc.). Such 
rendering decisions are intrinsic on how to write scripts and language 
(topic of Unicode Standard). So, some styling decisions are done at 
level of Unicode. On the other hand, some are not done at Unicode level. 
(liguatures: you may get on a Roman font, but not on a typewriter font, 
and obviously we have more and different in cursive fonts).


Maybe we should see Unicode has the last step, so HTML (structure), CSS 
(rendering) and Unicode has glyph selection. Which it is more in line 
with reality: words are just black boxes until rendering, we cannot 
format an accent in red with a black base character: formatting stage 
also in HTML happen before getting Unicode "Combining" category. 
(Unicode doesn't mandate a glyph, but it describe possible ligatures, 
and real world cases, decision is off-loaded to font designers, but the 
infrastructure is in Unicode).


Note: I see what you want to tell us. Just I think HTML/CSS cannot be a 
generic (for all languages/uses) markup language/style (and if we expand 
it for such task, the outcome will become ugly). But again: a task for 
future.


ciao
	cate


Appendix: some special cases about strict layering.

Unicode has "forms" (as blocks and with variant selectors). Is it wrong 
to have them? (should be moved to CSS, but they do not have ideas about 
glyphs).

Spaces and new lines are considered control characters (ok, "spaces" may 
have double meaning). So already an strange case, but we can just 
interpret it as a "escape-like" "sequence" at lower layer.

Some box characters, and many technical symbols requires some formatting 
(alignment of lines in case of other symbols nearby 
[right/left/above/below and possible diagonals]). On such case semantic 
of character has strong requirement on structure and rendering. (and you 
can change charmap, but so you can have different font rendering/engine 
also with very specific CSS).

SHY (Soft hyphen, U+00AD or HTML ­): is it structure? style? Glyph 
semantic?

And semantic (you just uncovered with your lat paragraph) is also a 
problem. For now no Unicode/HTML/CSS can style currency or numbers in my 
personal way: CSS doesn't know what are currencies (the part of text). 
HTML doesn't mandate to tag it differently, and Unicode may just help on 
giving a small space character (but also not so useful).


More information about the Unicode mailing list