Odp: Re: Is emoji +VC15, +VC16, without VC one or two columns with monospace font?🏝

Mon Apr 28 07:21:51 CDT 2025

Dnia 28 kwietnia 2025 12:00 Giacomo Catenazzi via Unicode <unicode at corp.unicode.org> napisał(a):  On 2025-04-26 12:51, Eli Zaretskii via Unicode wrote:  From: Dilyan Palauzov <b at bapha.be>  Cc: unicode at corp.unicode.org  Date: Sat, 26 Apr 2025 12:00:50 +0300   From: Eli Zaretskii <eliz at gnu.org>  Subject: Re: Is emoji +VC15, +VC16, without VC one or two columns with monospace font?🏝️  Date: 26/04/25 10:02:48   I think you have very outdated mental model of how the Windows console works and how it represents and encodes characters.  In particular, the width of a character is NOT determined by the length of its byte sequence, but by the font glyphs used to display those characters.  I am confused.  Does the width of an emoji/a character depend on the font (thus font designers decide this), or does it depend on EastAsianWidth.txt ?  It depends on the font, but the font is supposed to go by what  EastAsianWidth.txt says.   It is worse. We are discussing monospaced fonts, and so terminals may  select where to display each characters (skipping all font hints) and  overwriting part of character.   The concept of a character grid, as originally implemented in legacy systems, fundamentally implies non-overlapping columns/rows of equal width/height, with each character cell having its character code, as well as background and foreground colors if supported. Therefore, some terminals will only allow monospaced fonts to be used (otherwise the character cell width is undefined and therefore the character grid cannot be rendered), and as a special case they may allow duospaced fonts to be used for CJK codepage compatibility. Some terminals may allow proportional fonts, but they will distort the font to fit the character grid, not the other way around.   Also note: I do not like the division Unix/non-Unix:. "Unix" terminal  had different interpretations. E.g. if we look the initial Unicode  support of xterm (so the mother of many "unix pseudoterminals), we learn  that it supported only "Unicode level 1" (and obsolete terminology in  old Unicode standards, or just in ISO). So it did interpret each  codepoint independently (so no combining codepoints).   Perhaps it might be better to use another term such as 'non-random-access terminals' or 'variable-memory cell terminals' to refer to the terminals detached from the original concept that each character cell has a constant amount of memory associated with it.   Also a good documentation on width of characters in terminals: problems,  solutions, and interpretation of width in many implementations, from  gosthy (the new kid in the block):  mitchellh.com https://mitchellh.com/writing/grapheme-clusters-in-terminals.   giacomo   The metric compatibility of terminals is generally a matter of backwards compatibility, so the behavior of legacy platforms is relevant. The use of wcwidth is specific to Unix-like environments. The actual origin of the fullwidth characters is in legacy CJK encodings, where two consecutive bytes are placed in two consecutive character cells. On legacy non-Unix platforms the width precedent is set by legacy codepages, not by the wcwidth function. For example, ¨ (U+00A8) is 0xF9 in CP850 (single byte, which is halfwidth), but ¨ (U+00A8) is 0x81 0x4E in Shift JIS/CP932 (double byte, which is fullwidth). Therefore a backwards compatible Unicode extrapolation of a legacy terminal would still have to vary its behavior depending on the system locale/codepage to remain compatible. Win32 console seems to apply codepage-specific compatibility: in non-CJK codepages it simply maps each non-control character (or UCS-2 code unit when using Unicode text) directly to its corresponding character cell, but in CJK codepages it uses the appropriate CJK fonts and maps the codepage's fullwidth characters to two consecutive character cells to maintain compatibility (and for bidirectional codepages it might be doing something else entirely). However, the string "🧑‍🌾" maps to the corresponding UCS-2 code points 0xD83E 0xDDD1 0x200D 0xD83C 0xDF3E and since none of those codepoints have any CJK codepage compatibility precedent, they are written directly into five character cells regardless of the system codepage, the content of those cells can be retrieved with the ReadConsoleOutput  function (resulting in a random access array of CHAR_INFO structures) , and this itself sets a precedent for Win32 console compatibility. This is of course very different from wcwidth compatibility or mode 2027 compatibility, and yet is not mentioned in that article. That article therefore describes the widths only in Unix-like contexts.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20250428/feccfab6/attachment.htm>