Unicode encoding philosophy

Giacomo Catenazzi cate at cateee.net
Wed Oct 11 03:02:55 CDT 2023


On 9 Oct 2023 02:38, Martin J. Dürst via Unicode wrote:
> On 2023-10-05 21:22, Giacomo Catenazzi via Unicode wrote:
> 
>> Note: formatting is important, but it should be done at different 
>> level (we should not repeat errors of 1960s-1989s on mixing text and 
>> formatting, and putting formatting in "binary"/codepoints: we need 
>> verbose and human readable syntax). IMHO HTML is not good enough for 
>> all formatting things, but I do not think it should be done at Unicode 
>> (or at least, not at codepoints, but at UAX level or with more 
>> "independent" like ICU.
> 
> Please note that formatting (and in particular saying what's bold or 
> italic) isn't really the business of HTML, but CSS.


It is complex, and probably difficult to define "formatting". CSS for 
sure it is used for the *realisation* of the formatting. The rest is 
complex.

CSS can do something autonomously (e.g. :first-child), but on most cases 
you should define formatting limits in HTML (tags, classes, id). As 
example I do not think is it appropriate to use class='red' in HTML to 
tell CSS how to justify a box. Additionally some HTML tags are about 
formatting <p> <h1>, etc. (paragraph, chapter title, etc.). We can argue 
about semantic and style separation in HTML and CSS. But for Unicode 
both are on the other side of the line. We lack CSS-like styling, but 
also the way to express semantic separation (but for some ruby). We may 
find that ASCII provide different level of separations (FS, GS, RS, US, 
but also EM, FF, CR/LF, and also SPACE), or with ECMA, more about style 
(but as I found in Wikipedia, each terminal has own interpretation of 
"red" and "highlight red", and used may redefine palettes black on white 
vs white on black), but that is also outside Unicode (just some support 
to do it transparently on a different layer).

We have different definition of "formatting" compared to W3C, and 
sometime that cause big problems, e.g. the problematic <sub>, <sup> (on 
few contexts, it seems W3C consider subscript as text, so lack of 
supports e.g. units on drop down menus using best-practices of Unicode). 
Or overlap on some fields (e.g. HTML: text directions, ruby; CSS: 
Variant selector)


So: you cannot use Unicode strings with CSS to get a nice formatted text.

Personally: I feel in future we need a generic markup language (as an 
Unicode-like project, with a large intent: for every living language, or 
in this case also about writing rules: not just articles, but signs, 
wood inscriptions, etc. which sometime have different rules). But it is 
a huge task, and I think more complex than Unicode tasks, so not for 
"today". And not a think should do Unicode (but ev. in a side entity).

But for italic: why not use just HTML/CSS (which has good support to 
Latin scripts (and Western scripts in general) which requires use it. Or 
just ECMA, until we get resources (and possibly after we solved also the 
font problems).

But also this last fact may give us some hints: why we do not use ECMA 
anymore for such formatting? For sure Microsoft knew it very well e.g. 
for Microsoft Word (it originated as MS-DOS (so console) program and 
later it had various parallel versions with the GUI console). But also 
on other cases. I really suspect such formatting is in the wrong layer, 
so it will not easy to program and to develop file formats. Also with 
such proposal: it is not enough expressive for all cases, and so it 
would be a special case so just making code more complex for what gain?

ciao
	cate


More information about the Unicode mailing list