Unicode encoding philosophy

orenwatson at tutanota.com orenwatson at tutanota.com
Thu Oct 12 04:33:32 CDT 2023


replied off list, whoops.

> words are just black boxes until rendering, we cannot format an accent in red with a black base characterThis depends on the layout system. XeLaTeX can absolutely do this. for example here in greek. the unicode text
Ο{\color{red}̔} πα{\color{red}́}ντα βιωφελε{\color{red}́}στατος Αι{\color{red}̓}{\color{red}́}σοπος, ο{\color{red}̔} λογοποιο{\color{red}́}ς, τη{\color{red}͂}{\color{red}ͅ} με{\color{red}̀}ν τυ{\color{red}́}χη{\color{red}ͅ} η{\color{red}̓}{\color{red}͂}ν δου{\color{red}͂}λος,

Even though formatting and diacritics are interspersed, XeLaTeX has no problem.

HTML renderers don't support this, but apparently this is done in Arabic printing sometimes to separate different kinds of diacritics.

---
Oren Watson (he/him)
orenwatson at tutanota.com

>
>
>
>
> 11 Oct 2023, 11:51 by unicode at corp.unicode.org:
>
>> On 11 Oct 2023 10:28, Piotr Karocki via Unicode wrote:
>>
>>>> Additionally some HTML tags are about formatting <p> <h1>, etc.
>>>>
>>> I disagree.
>>> HTML is about text structure, not about formatting/rendering.
>>> <h1> is rendered differently for different output devices: monitor, printer,
>>> Braille 'display', narrator (text to voice), etc.
>>>
>>
>> Which font size? Which "version" of Braille? Which voice (male/female, accent)?, etc.
>>
>> In any case, you have a different definition of "formatting". Maybe we should stop using such word, and use instead "plain text", "structure", "style", "rendering" (with lower risk to misinterpret). You consider "formatting" only the last two. I consider everything above "plain text" as "formatting". Two empty lines on an email is /new paragraph/ and it should be displayed so. Should I really distinguish it from italic or bold (so using slash, or asterix)? So depending on application, we have different terminologies. Seldom we must distinguish it in so many steps. This group is one where distinction is important.
>>
>> Note: Unicode Category Cf (Other, formatting) includes various "structure" characters (so as HTML and not CSS "features")
>>
>> Note: On Microsoft Windows: "Paste without Formatting" is mostly plaintext, and some structure (new lines, lists) but not much more. Also a third definition of "formatting".
>>
>>
>>> Unicode should be used to specify character/glyph/sign, HTML to add text
>>> structure, and CSS used only to force rendering (so it should be used very
>>> rarely).
>>>
>>
>> I think it is manichaeist. It may be the aim, but human languages are too diverse and complex to create a perfect split of domains. But also thematically it is difficult (and artificial) to split in such manner.
>>
>> Final rendering requires an additional step after "CSS", usually done by different engines: layout/shaping/font-rendering. And Unicode Standard touch also this part. Interaction of characters is an important topic on Unicode Standard: when to do liguatures and graphemes (and grapheme clusters), how to avoid them (ZWJ/ZWNJ, variant selectors etc.). Such rendering decisions are intrinsic on how to write scripts and language (topic of Unicode Standard). So, some styling decisions are done at level of Unicode. On the other hand, some are not done at Unicode level. (liguatures: you may get on a Roman font, but not on a typewriter font, and obviously we have more and different in cursive fonts).
>>
>>
>> Maybe we should see Unicode has the last step, so HTML (structure), CSS (rendering) and Unicode has glyph selection. Which it is more in line with reality: words are just black boxes until rendering, we cannot format an accent in red with a black base character: formatting stage also in HTML happen before getting Unicode "Combining" category. (Unicode doesn't mandate a glyph, but it describe possible ligatures, and real world cases, decision is off-loaded to font designers, but the infrastructure is in Unicode).
>>
>>
>> Note: I see what you want to tell us. Just I think HTML/CSS cannot be a generic (for all languages/uses) markup language/style (and if we expand it for such task, the outcome will become ugly). But again: a task for future.
>>
>>
>> ciao
>> cate
>>
>>
>> Appendix: some special cases about strict layering.
>>
>> Unicode has "forms" (as blocks and with variant selectors). Is it wrong to have them? (should be moved to CSS, but they do not have ideas about glyphs).
>>
>> Spaces and new lines are considered control characters (ok, "spaces" may have double meaning). So already an strange case, but we can just interpret it as a "escape-like" "sequence" at lower layer.
>>
>> Some box characters, and many technical symbols requires some formatting (alignment of lines in case of other symbols nearby [right/left/above/below and possible diagonals]). On such case semantic of character has strong requirement on structure and rendering. (and you can change charmap, but so you can have different font rendering/engine also with very specific CSS).
>>
>> SHY (Soft hyphen, U+00AD or HTML ­): is it structure? style? Glyph semantic?
>>
>> And semantic (you just uncovered with your lat paragraph) is also a problem. For now no Unicode/HTML/CSS can style currency or numbers in my personal way: CSS doesn't know what are currencies (the part of text). HTML doesn't mandate to tag it differently, and Unicode may just help on giving a small space character (but also not so useful).
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20231012/9ba023c1/attachment-0001.htm>


More information about the Unicode mailing list