<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">William_J_G Overington via Unicode

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com">In

      relation to teletext control codes, my opinion is that they need

      to be encoded separately from the C0 control range. This would

      ensure that in interchange that none of the teletext control codes

      is ever misinterpreted as having the basic C0 character meaning.

      <br>

    </blockquote>

    <br>

    What is the "basic C0 character meaning"?<br>

    <br>

    In fact, the method of declaring when alternative C0 and C1 sets are

    in use is already <i>explicitly covered</i> by ISO 10646 in section

    13.4, which I excerpt as follows (I've added some explanations in

    hard brackets):<br>

    <blockquote>

      <p>For other C0 or C1 sets, the final octet F shall be obtained

        from the International Register of Coded Character

        Sets. The identifier sequences for these sets shall be

      </p>

      <ul>

        <li>ESC 02/01 F <i>[i.e. 0x1B 0x21 then a byte from 0x40–7E (or

            0x30–3F for a private use C0 set)]</i> identifies a C0 set </li>

      </ul>

      <ul>

        <li>ESC 02/02 F <i>[i.e. 0x1B 0x22 then a byte from 0x40–7E (or

            0x30–3F for a private use C1 set)]</i> identifies a C1 set </li>

      </ul>

      <p>If such an escape sequence appears within a code unit sequence

        conforming to ISO/IEC 2022 <i>[this strictly speaking includes

          e.g. ISO-8859-2, EUC-JP and ISO-2022-JP but not e.g.

          Windows-1252, Shift_JIS, EBCDIC or any Unicode

          encoding—although this specific provision is applicable to any

          ASCII‑ish encoding]</i>, it shall consist

        only of the sequences of bit combinations as shown above <i>[i.e.

          each byte value in the escape sequence shall be emitted as a

          single byte of the specified binary value without transcoding]</i>.

      </p>

      <p>If such an escape sequence appears within a code unit sequence

        conforming to this document <i>[i.e. in a Unicode string]</i>,

        it shall be padded

        in accordance with Clause 12 <i>[i.e. in UTF-16 or UTF-32, a

          whole code unit shall be emitted for each byte in the escape

          sequence, not just the single byte; this has no actual effect

          on UTF-8]</i>.

      </p>

    </blockquote>

    (end of excerpt)<br>

    <br>

    Here's the International Register; notice thirteen separate C0 sets

    in section 2.5 here, plus ten C1 sets in section 2.6:<br>

    <br>

<a class="moz-txt-link-freetext" href="https://web.archive.org/web/20190424200034/https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf">https://web.archive.org/web/20190424200034/https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf</a><br>

    <br>

    Now, the teletext control set, as it appears in the International

    Register, is actually IR-056, which is a C1 (rather than C0) set

    registration with the escape sequence ESC 0x22 0x40; this is because

    one of the ITU T.101 Videotex formats uses the Teletext control set

    as its C1 set and its registration was referenced to ITU T.101 (this

    also means it has the ECMA-48 CSI instead of a duplicate ESC—it is

    important to note here that Teletext's use of ESC should not even

    still appear in Teletext data once it's transcoded to Unicode, since

    it's used for character set switching, albeit in a manner

    incompatible with ISO 2022).<br>

    <br>

<a class="moz-txt-link-freetext" href="https://web.archive.org/web/20200614215855/https://www.itscj.ipsj.or.jp/iso-ir/056.pdf">https://web.archive.org/web/20200614215855/https://www.itscj.ipsj.or.jp/iso-ir/056.pdf</a><br>

    <br>

    So the <i>unambiguous</i> way of representing it is to first emit

    the escape sequence U+001B+0022+0040 at the start of the string (or

    before any Teletext control codes appear), and <i>then</i> map the

    Teletext controls (except ESC, which should change transcoder state

    but not be emitted) to U+0080–9F. This is <i>already well defined

      by the relevant standards</i>, and no work needs to be done on

    that front.<br>

    <br>

    As for adding <i>support</i> for this to e.g. terminal emulators,

    that's another matter, but it's one which would need to be done for

    any other solution you might be inclined to propose too, so that's

    not really saying much.<br>

    <br>

    <blockquote type="cite" cite="mid:43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com">

      There was no ambiguity possible in a teletext system.

      <br>

    </blockquote>

    <br>

    Well, yes, because the higher-level protocols were agreed upon as

    part of the system; it's only when this is mixed with another system

    (e.g. ECMA-48) that there needs to be indicators as to which one is

    in use: ESC 0x22 0x43 for ECMA-48's C1 set, versus ESC 0x22 0x40 for

    the Teletext controls used in the C1 area (represented either as

    codepoints or escape sequences).<br>

    <br>

    <blockquote type="cite" cite="mid:43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com">

      One possibility is to encode the teletext control characters as a

      block of 32 code points in plane 14, without closing up the unused

      points. These characters in plane 14 would be displayable

      characters and thus not control characters in

      non-teletext-emulating systems, each displayed as a glyph

      specified in The Unicode Standard as two small capital letters

      arranged one above the other, but not overlapping. For example A

      above G for Alphanumerics Green.<br>

    </blockquote>

    <br>

    The existing control pictures never function as format effectors in

    their own right, and it would be weird if others started to.<br>

    <br>

    <blockquote type="cite" cite="mid:43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com">

      I opine that it would be good for the proposal be extended to

      include encoding of the teletext control characters please.

      <br>

    </blockquote>

    <br>

    Control "character" is a bit of a wooly term. The term "control <i>code</i>"

    refers specifically to the category Cc characters (a closed

    category), which don't have formal names (although they do have

    formal aliases) and mostly have behaviour defined by higher level

    protocols rather than by Unicode itself, some of which actually

    carry instructions for things other than the text renderer (BEL, for

    example). Category Cf, Zl and Zp characters are format effectors,

    which are a type of nonprinting character which semantically

    constitute part of the text itself and affect only how it's

    displayed by triggering (effecting) a particular format behaviour

    (line break, RTL override, permitted or forbidden line break,

    superscript etc). Some Cc characters from particular C0 or C1 sets

    are format effectors (LF for example), but the Cf/Zl/Zp ones are

    full-fledged Unicode characters with names and semantics defined by

    Unicode itself, not a higher level protocol such as ECMA-48 or the

    aforementioned IR-056.<br>

    <br>

    The Teletext controls are control <i>codes</i>, in existing systems

    that incorporate them.<br>

    <br>

    <blockquote type="cite" cite="mid:43ce7487.27f24.17dc532517e.Webtop.96@btinternet.com">

      <br>

      Could we discuss this please?

      <br>

      <br>

      William Overington

      <br>

      <br>

      Thursday 16 December 2021

      <br>

      <br>

      <br>

    </blockquote>

    <br>

  </body>

</html>