Unicode encoding philosophy
Giacomo Catenazzi
cate at cateee.net
Wed Oct 11 05:41:49 CDT 2023
On 11 Oct 2023 10:28, Piotr Karocki via Unicode wrote:
>> Additionally some HTML tags are about formatting <p> <h1>, etc.
> I disagree.
> HTML is about text structure, not about formatting/rendering.
> <h1> is rendered differently for different output devices: monitor, printer,
> Braille 'display', narrator (text to voice), etc.
Which font size? Which "version" of Braille? Which voice (male/female,
accent)?, etc.
In any case, you have a different definition of "formatting". Maybe we
should stop using such word, and use instead "plain text", "structure",
"style", "rendering" (with lower risk to misinterpret). You consider
"formatting" only the last two. I consider everything above "plain text"
as "formatting". Two empty lines on an email is /new paragraph/ and it
should be displayed so. Should I really distinguish it from italic or
bold (so using slash, or asterix)? So depending on application, we have
different terminologies. Seldom we must distinguish it in so many steps.
This group is one where distinction is important.
Note: Unicode Category Cf (Other, formatting) includes various
"structure" characters (so as HTML and not CSS "features")
Note: On Microsoft Windows: "Paste without Formatting" is mostly
plaintext, and some structure (new lines, lists) but not much more. Also
a third definition of "formatting".
> Unicode should be used to specify character/glyph/sign, HTML to add text
> structure, and CSS used only to force rendering (so it should be used very
> rarely).
I think it is manichaeist. It may be the aim, but human languages are
too diverse and complex to create a perfect split of domains. But also
thematically it is difficult (and artificial) to split in such manner.
Final rendering requires an additional step after "CSS", usually done by
different engines: layout/shaping/font-rendering. And Unicode Standard
touch also this part. Interaction of characters is an important topic on
Unicode Standard: when to do liguatures and graphemes (and grapheme
clusters), how to avoid them (ZWJ/ZWNJ, variant selectors etc.). Such
rendering decisions are intrinsic on how to write scripts and language
(topic of Unicode Standard). So, some styling decisions are done at
level of Unicode. On the other hand, some are not done at Unicode level.
(liguatures: you may get on a Roman font, but not on a typewriter font,
and obviously we have more and different in cursive fonts).
Maybe we should see Unicode has the last step, so HTML (structure), CSS
(rendering) and Unicode has glyph selection. Which it is more in line
with reality: words are just black boxes until rendering, we cannot
format an accent in red with a black base character: formatting stage
also in HTML happen before getting Unicode "Combining" category.
(Unicode doesn't mandate a glyph, but it describe possible ligatures,
and real world cases, decision is off-loaded to font designers, but the
infrastructure is in Unicode).
Note: I see what you want to tell us. Just I think HTML/CSS cannot be a
generic (for all languages/uses) markup language/style (and if we expand
it for such task, the outcome will become ugly). But again: a task for
future.
ciao
cate
Appendix: some special cases about strict layering.
Unicode has "forms" (as blocks and with variant selectors). Is it wrong
to have them? (should be moved to CSS, but they do not have ideas about
glyphs).
Spaces and new lines are considered control characters (ok, "spaces" may
have double meaning). So already an strange case, but we can just
interpret it as a "escape-like" "sequence" at lower layer.
Some box characters, and many technical symbols requires some formatting
(alignment of lines in case of other symbols nearby
[right/left/above/below and possible diagonals]). On such case semantic
of character has strong requirement on structure and rendering. (and you
can change charmap, but so you can have different font rendering/engine
also with very specific CSS).
SHY (Soft hyphen, U+00AD or HTML ): is it structure? style? Glyph
semantic?
And semantic (you just uncovered with your lat paragraph) is also a
problem. For now no Unicode/HTML/CSS can style currency or numbers in my
personal way: CSS doesn't know what are currencies (the part of text).
HTML doesn't mandate to tag it differently, and Unicode may just help on
giving a small space character (but also not so useful).
More information about the Unicode
mailing list