<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">12 okt. 2023 kl. 11:33 skrev Oren Watson via Unicode <<a href="mailto:unicode@corp.unicode.org" class="">unicode@corp.unicode.org</a>>:</div><br class="Apple-interchange-newline"><div class="">

  
    <meta http-equiv="content-type" content="text/html; charset=UTF-8" class="">

  
  <div class="">

<div class="">replied off list, whoops.<br class=""><br class="">> words are just black boxes until rendering, we cannot format an accent in red with a black base characterThis depends on the layout system. XeLaTeX can absolutely do this. for example here in greek. the unicode text<img style="max-width: 100%" src="cid:3iodb21959d" class=""><br class=""></div><div dir="auto" class="">Ο{\color{red}̔} πα{\color{red}́}ντα βιωφελε{\color{red}́}στατος Αι{\color{red}̓}{\color{red}́}σοπος, ο{\color{red}̔} λογοποιο{\color{red}́}ς, τη{\color{red}͂}{\color{red}ͅ} με{\color{red}̀}ν τυ{\color{red}́}χη{\color{red}ͅ} η{\color{red}̓}{\color{red}͂}ν δου{\color{red}͂}λος,</div></div></div></blockquote><div><br class=""></div>This is very much a grey area. That it is, or can be split into having, a combining character<span class="Apple-tab-span" style="white-space:pre">   </span>does not necessarily mean that the combining character should be separately stylable.</div><div><br class=""></div><div>To be nitpicking, and extremely strict, the combining marks in the text above combine with } as base character, not anything else, and } with a combining mark is not the end meta-bracket in (…)TeX… (You are fortunate that } does not canonically combine with any combining character; that is not the case with > (as used in HTML, XML), which does combine with a certain combining character…)</div><div><br class=""></div><div>/Kent K</div><div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><div dir="auto" class="">Even though formatting and diacritics are interspersed, XeLaTeX has no problem.<br class=""><br class="">HTML renderers don't support this, but apparently this is done in Arabic printing sometimes to separate different kinds of diacritics.<br class=""><br class=""></div><div dir="auto" class="">---<br class=""></div><div dir="auto" class="">Oren Watson (he/him)<br class=""></div><div dir="auto" class=""><a target="_blank" rel="noopener noreferrer" href="mailto:orenwatson@tutanota.com" class="">orenwatson@tutanota.com</a><br class=""></div><blockquote class="tutanota_quote" style="border-left: 1px solid #93A3B8; padding-left: 10px; margin-left: 5px;"><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">11 Oct 2023, 11:51 by <a href="mailto:unicode@corp.unicode.org" class="">unicode@corp.unicode.org</a>:<br class=""></div><blockquote class="tutanota_quote" style="border-left: 1px solid #93A3B8; padding-left: 10px; margin-left: 5px;"><div class="">On 11 Oct 2023 10:28, Piotr Karocki via Unicode wrote:<br class=""></div><blockquote class=""><blockquote class="">Additionally some HTML tags are about formatting <p> <h1>, etc.<br class=""></blockquote></blockquote><blockquote class=""><div class="">I disagree.<br class=""></div><div class="">HTML is about text structure, not about formatting/rendering.<br class=""></div><div class=""><h1> is rendered differently for different output devices: monitor, printer,<br class=""></div><div class="">Braille 'display', narrator (text to voice), etc.<br class=""></div></blockquote><div class=""><br class=""></div><div class="">Which font size? Which "version" of Braille? Which voice (male/female, accent)?, etc.<br class=""></div><div class=""><br class=""></div><div class="">In any case, you have a different definition of "formatting". Maybe we should stop using such word, and use instead "plain text", "structure", "style", "rendering" (with lower risk to misinterpret). You consider "formatting" only the last two. I consider everything above "plain text" as "formatting". Two empty lines on an email is /new paragraph/ and it should be displayed so. Should I really distinguish it from italic or bold (so using slash, or asterix)? So depending on application, we have different terminologies. Seldom we must distinguish it in so many steps. This group is one where distinction is important.<br class=""></div><div class=""><br class=""></div><div class="">Note: Unicode Category Cf (Other, formatting) includes various "structure" characters (so as HTML and not CSS "features")<br class=""></div><div class=""><br class=""></div><div class="">Note: On Microsoft Windows: "Paste without Formatting" is mostly plaintext, and some structure (new lines, lists) but not much more. Also a third definition of "formatting".<br class=""></div><div class=""><br class=""></div><blockquote class=""><div class="">Unicode should be used to specify character/glyph/sign, HTML to add text<br class=""></div><div class="">structure, and CSS used only to force rendering (so it should be used very<br class=""></div><div class="">rarely).<br class=""></div></blockquote><div class=""><br class=""></div><div class="">I think it is manichaeist. It may be the aim, but human languages are too diverse and complex to create a perfect split of domains. But also thematically it is difficult (and artificial) to split in such manner.<br class=""></div><div class=""><br class=""></div><div class="">Final rendering requires an additional step after "CSS", usually done by different engines: layout/shaping/font-rendering. And Unicode Standard touch also this part. Interaction of characters is an important topic on Unicode Standard: when to do liguatures and graphemes (and grapheme clusters), how to avoid them (ZWJ/ZWNJ, variant selectors etc.). Such rendering decisions are intrinsic on how to write scripts and language (topic of Unicode Standard). So, some styling decisions are done at level of Unicode. On the other hand, some are not done at Unicode level. (liguatures: you may get on a Roman font, but not on a typewriter font, and obviously we have more and different in cursive fonts).<br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Maybe we should see Unicode has the last step, so HTML (structure), CSS (rendering) and Unicode has glyph selection. Which it is more in line with reality: words are just black boxes until rendering, we cannot format an accent in red with a black base character: formatting stage also in HTML happen before getting Unicode "Combining" category. (Unicode doesn't mandate a glyph, but it describe possible ligatures, and real world cases, decision is off-loaded to font designers, but the infrastructure is in Unicode).<br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Note: I see what you want to tell us. Just I think HTML/CSS cannot be a generic (for all languages/uses) markup language/style (and if we expand it for such task, the outcome will become ugly). But again: a task for future.<br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">ciao<br class=""></div><div class="">cate<br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Appendix: some special cases about strict layering.<br class=""></div><div class=""><br class=""></div><div class="">Unicode has "forms" (as blocks and with variant selectors). Is it wrong to have them? (should be moved to CSS, but they do not have ideas about glyphs).<br class=""></div><div class=""><br class=""></div><div class="">Spaces and new lines are considered control characters (ok, "spaces" may have double meaning). So already an strange case, but we can just interpret it as a "escape-like" "sequence" at lower layer.<br class=""></div><div class=""><br class=""></div><div class="">Some box characters, and many technical symbols requires some formatting (alignment of lines in case of other symbols nearby [right/left/above/below and possible diagonals]). On such case semantic of character has strong requirement on structure and rendering. (and you can change charmap, but so you can have different font rendering/engine also with very specific CSS).<br class=""></div><div class=""><br class=""></div><div class="">SHY (Soft hyphen, U+00AD or HTML &shy;): is it structure? style? Glyph semantic?<br class=""></div><div class=""><br class=""></div><div class="">And semantic (you just uncovered with your lat paragraph) is also a problem. For now no Unicode/HTML/CSS can style currency or numbers in my personal way: CSS doesn't know what are currencies (the part of text). HTML doesn't mandate to tag it differently, and Unicode may just help on giving a small space character (but also not so useful).<br class=""></div></blockquote><div dir="auto" class=""><br class=""></div></blockquote><div dir="auto" class=""><br class=""></div>  </div>


</div></blockquote></div><br class=""></body></html>