State of ECMA-48 in a Unicode age (was Re: “plain text styling”…)

Harriet Riddle harjitmoe at outlook.com
Sun Jan 8 15:45:01 CST 2023


Kent Karlsson via Unicode wrote:
> Well, yes... But the problem is that, IIUC, the ECMA-48 committee is 
> currently the empty set of people…
>
> /K

---

Tangentially to this, I do much believe that a new edition of ECMA-48 
which clarifies and addresses relationship both to Unicode and to 
established convention would be of practical benefit, with the following 
points standing out to me:

① The penultimate edition of ECMA-48 (fourth edition, December 1986, 
still archived at the bottom of ECMA's page for ECMA-48) deprecates a 
number of mode flags and control functions in Appendix E (pages 84–87).  
Notable amongst these deprecations is LF/NLM (E.1.3, bottom of page 84 / 
top of page 85), i.e. the mode flag that toggles whether linefeeds imply 
a carriage return.  The note (E.1) attached to that section explains the 
deprecation, stating essentially that full line breaks should use either 
NEL or CR+LF moving forward; the IND control for explicit bare linefeed 
is also in the deprecated features appendix as E.2.3.  In the fifth and 
current edition (June 1991), per the 1998 reprint available as PDF from 
ECMA's page, both LF/NLM and IND were removed altogether, announced in 
annex F.5.2 (page numbered 88 / 102nd PDF page) and annex F.8.2 (page 
numbered 89 / 103rd PDF page) respectively.  The definition of LF 
(section 8.3.74, page numbered 49 / 63rd PDF page) unambiguously 
specifies a move to the "corresponding character position" (as opposed 
to the start) of the following line.  Therefore, /most terminal 
emulators (which accept bare LF as a full newline in their default 
modes) are actually in violation of the current edition of ECMA-48 (and 
exhibit deprecated but permitted behaviour per the edition before), as 
are virtually all modern text editors, for example/.  Honestly, the 
elimination of the LF/NLM mode comes across as wishful thinking on the 
part of the committee, but hindsight is 20:20.  A mode such as LF/NLM 
probably ought to be restored so as to align the standard with the 
by-now-set-in-stone reality.

② Speaking of ECMA-48 and the CR vs LF vs CRLF vs NEL vs LSEP issue, 
better coördination between UAX 14 and ECMA-48 might be in order.  This 
doesn't cause as much of an issue in practice, since the contexts where 
ECMA-48 is actually implemented (monospaced terminal emulators) are 
largely disjoint with these where UAX 14 is implemented, but it should 
be clearer how an implementation can be concordant with both (for 
example, whether an ECMA-48 conformant implementation of CR or VT is 
sufficient to count as a line break for UAX 14 purposes).  This is 
particularly relevant should one wish to use ECMA-48 in a non-terminal 
context, as seems to be part of the present discussion.

③ Section 5 needs reworking to address how it interacts with Unicode 
Transformation Formats (other than the erstwhile abortive UTF-1, which 
it works fine with, for all this has any effect on anything).  The 
representation of the C0 and C1 codes is given in 7-bit and 8-bit 
column/line bit combinations.  I believe ISO/IEC 10646 briefly addresses 
how these translate to UTF-16 or UTF-32 (padding to code unit width), 
but this would be ideal to have addressed in ECMA-48 itself in this day 
and age; furthermore, even with that provision, "bit combinations from 
08/00 to 09/15" in the context of UTF-8 arguably prescribes fragmentary 
or invalid UTF-8 sequences rather than the UTF-8 representations of the 
C1 code points.

④ Also in section 5: command strings (DCS, OSC, PM and APC) are limited 
to 0x08–0D (the ASCII FEx format effectors) and 0x20–7E (the ASCII 
printing characters including space).  This is contrasted with character 
strings (SOS) which have no such restriction, with only SOS itself being 
forbidden (and ST not includable due to being the terminator).  In 
practice, not only ASCII printing characters but arbitrary Unicode 
characters—other than Cc control codes outside of the aforementioned 
0x08–0D range—are permitted in OSC sequences recognised by terminal 
emulators, which often contain text.  For example, 
"\u{9D}0;flambé\u{9C}" will set a terminal window title to "flambé", 
even though "é" is not an ASCII character.  This is another area which 
probably needs updating to align it with both industry practice and a 
Unicode age.

⑤ The characters listed as affected by the FEAM mode in section 7.2.5 
needs looking at—for instance, it lists BPH (equivalent to ZWSP) but not 
its opposite NBH (equivalent to WJ).  It also lists CR and NEL but not 
LF, all of which are format effectors per section 8.2.4.  The 
interaction with Unicode general categories should also be addressed: 
presumably it would apply to the format category (Cf), and possibly also 
Zl and Zp, in addition to the specific listed Cc characters and CSI 
sequences, but this should be addressed in the FEAM definition, annex 
A.1, or both.

⑥ Speaking of annex A, annex A.2 and the GCC sequence might deserve 
addressing as to their relation to Unicode.  Certainly, ECMA-43 
(conformed to by ISO 8859—and yes, the graphical resemblence between 
"ECMA-43" and "ECMA-48" can be confusing whenever these two standards 
are discussed together) puts significant limitations on the ECMA-48 
codes used for composition, prohibiting any such composition that 
creates a new character rather than merely a ligature of existing 
characters (see annex C of ECMA-43, contrast with annex A.2 of 
ECMA-48).  This both bans backspace composition, and constrains the use 
of GCC to discretionary ligatures (which is not explicitly constrained 
by ECMA-48 itself—indeed, annex A seems to prescribe GCC as a migration 
path from backspace composition—although the note on the definition of 
GCC itself in section 8.3.54 mentions CJK square ligatures as the simple 
case, not diacritic composition or APL composition).  Backspace 
composition is similarly not really compatible with the Unicode model of 
base characters, combining characters and pre-composed diacritic-bearing 
characters (and composed-symbol APL operators without decompositions), 
although discretionary ligatures are manifestly compatible with the 
Unicode character model (see e.g. the OpenType dlig feature and the CSS 
font-variant-ligatures property).

--Har.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20230108/60751371/attachment.htm>


More information about the Unicode mailing list