Teletext control codes

Thu Dec 16 18:06:36 CST 2021

William_J_G Overington via Unicode wrote:
> In relation to teletext control codes, my opinion is that they need to 
> be encoded separately from the C0 control range. This would ensure 
> that in interchange that none of the teletext control codes is ever 
> misinterpreted as having the basic C0 character meaning.

What is the "basic C0 character meaning"?

In fact, the method of declaring when alternative C0 and C1 sets are in 
use is already /explicitly covered/ by ISO 10646 in section 13.4, which 
I excerpt as follows (I've added some explanations in hard brackets):

    For other C0 or C1 sets, the final octet F shall be obtained from
    the International Register of Coded Character Sets. The identifier
    sequences for these sets shall be

      * ESC 02/01 F /[i.e. 0x1B 0x21 then a byte from 0x40–7E (or
        0x30–3F for a private use C0 set)]/ identifies a C0 set

      * ESC 02/02 F /[i.e. 0x1B 0x22 then a byte from 0x40–7E (or
        0x30–3F for a private use C1 set)]/ identifies a C1 set

    If such an escape sequence appears within a code unit sequence
    conforming to ISO/IEC 2022 /[this strictly speaking includes e.g.
    ISO-8859-2, EUC-JP and ISO-2022-JP but not e.g. Windows-1252,
    Shift_JIS, EBCDIC or any Unicode encoding—although this specific
    provision is applicable to any ASCII‑ish encoding]/, it shall
    consist only of the sequences of bit combinations as shown above
    /[i.e. each byte value in the escape sequence shall be emitted as a
    single byte of the specified binary value without transcoding]/.

    If such an escape sequence appears within a code unit sequence
    conforming to this document /[i.e. in a Unicode string]/, it shall
    be padded in accordance with Clause 12 /[i.e. in UTF-16 or UTF-32, a
    whole code unit shall be emitted for each byte in the escape
    sequence, not just the single byte; this has no actual effect on
    UTF-8]/.

(end of excerpt)

Here's the International Register; notice thirteen separate C0 sets in 
section 2.5 here, plus ten C1 sets in section 2.6:

https://web.archive.org/web/20190424200034/https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf

Now, the teletext control set, as it appears in the International 
Register, is actually IR-056, which is a C1 (rather than C0) set 
registration with the escape sequence ESC 0x22 0x40; this is because one 
of the ITU T.101 Videotex formats uses the Teletext control set as its 
C1 set and its registration was referenced to ITU T.101 (this also means 
it has the ECMA-48 CSI instead of a duplicate ESC—it is important to 
note here that Teletext's use of ESC should not even still appear in 
Teletext data once it's transcoded to Unicode, since it's used for 
character set switching, albeit in a manner incompatible with ISO 2022).

https://web.archive.org/web/20200614215855/https://www.itscj.ipsj.or.jp/iso-ir/056.pdf

So the /unambiguous/ way of representing it is to first emit the escape 
sequence U+001B+0022+0040 at the start of the string (or before any 
Teletext control codes appear), and /then/ map the Teletext controls 
(except ESC, which should change transcoder state but not be emitted) to 
U+0080–9F. This is /already well defined by the relevant standards/, and 
no work needs to be done on that front.

As for adding /support/ for this to e.g. terminal emulators, that's 
another matter, but it's one which would need to be done for any other 
solution you might be inclined to propose too, so that's not really 
saying much.

> There was no ambiguity possible in a teletext system.

Well, yes, because the higher-level protocols were agreed upon as part 
of the system; it's only when this is mixed with another system (e.g. 
ECMA-48) that there needs to be indicators as to which one is in use: 
ESC 0x22 0x43 for ECMA-48's C1 set, versus ESC 0x22 0x40 for the 
Teletext controls used in the C1 area (represented either as codepoints 
or escape sequences).

> One possibility is to encode the teletext control characters as a 
> block of 32 code points in plane 14, without closing up the unused 
> points. These characters in plane 14 would be displayable characters 
> and thus not control characters in non-teletext-emulating systems, 
> each displayed as a glyph specified in The Unicode Standard as two 
> small capital letters arranged one above the other, but not 
> overlapping. For example A above G for Alphanumerics Green.

The existing control pictures never function as format effectors in 
their own right, and it would be weird if others started to.

> I opine that it would be good for the proposal be extended to include 
> encoding of the teletext control characters please.

Control "character" is a bit of a wooly term. The term "control /code/" 
refers specifically to the category Cc characters (a closed category), 
which don't have formal names (although they do have formal aliases) and 
mostly have behaviour defined by higher level protocols rather than by 
Unicode itself, some of which actually carry instructions for things 
other than the text renderer (BEL, for example). Category Cf, Zl and Zp 
characters are format effectors, which are a type of nonprinting 
character which semantically constitute part of the text itself and 
affect only how it's displayed by triggering (effecting) a particular 
format behaviour (line break, RTL override, permitted or forbidden line 
break, superscript etc). Some Cc characters from particular C0 or C1 
sets are format effectors (LF for example), but the Cf/Zl/Zp ones are 
full-fledged Unicode characters with names and semantics defined by 
Unicode itself, not a higher level protocol such as ECMA-48 or the 
aforementioned IR-056.

The Teletext controls are control /codes/, in existing systems that 
incorporate them.

>
> Could we discuss this please?
>
> William Overington
>
> Thursday 16 December 2021
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211217/049b6b91/attachment.htm>