Teletext separated mosaic graphics

Harriet Riddle harjitmoe at outlook.com
Wed Oct 7 17:25:08 CDT 2020


Doug Ewell via Unicode wrote:
> Richard Wordingham wrote:
>
>> […]
>> That strikes me as a very good description of most of the 27 (as at
>> Version 12) characters with an Indic syllabic category of virama.
> A non-spacing mark (Mn) is not a control character (Cc). Whether it is rendered as a separate glyph or by modifying the glyph of a neighboring character is not the issue.
>
> There is no such thing in Unicode as a character which has more than General_Category value. Either a character is a control character, or it is not.
>
> Of course, I can create a program or a protocol that takes ordinary graphic characters such as < and >, and handles them in some special way, but then I am creating a new layer on top of plain text.
>
> --
> Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
>
---

Some comparisons of type-Cc and non-type-Cc characters with comparable, 
although not necessarily identical, behaviours (provided that the 
type-Cc characters are interpreted in accordance with ECMA-48, as I 
shall come to later):

  * CR (U+000D), LF (U+000A) and NEL (U+0085) are all Cc — versus
    LS/LSEP (U+2028), which is Zs.
  * VT (U+000B) and FF (U+000C) are Cc, whereas PS/PSEP (U+2029) is Zp.
  * BPH (U+0082) is Cc, whereas SHY (U+00AD) and ZWSP (U+200B) are both Cf.
  * NBH (U+0083) is Cc, whereas WJ (U+2060) and ZWNBSP/BOM (U+FEFF) are
    both Cf.
  * PLU (U+008C) to start a superscript is Cc, whereas IAS (U+FFFA) to
    start a furigana section is Cf.
  * SSA (U+0086) and its terminator ESA (U+0087) are Cc, whereas for
    example RLO (U+202E), which similarly affects all following
    characters until further notice, is Cf.

That being said, not everything which is appropriate for a Cc character 
is appropriate elsewhere: it would clearly be inappropriate for (say) 
DC1 or BEL, both of which issue instructions to something very much 
outside of the sandbox (so to speak) of the text render, to be anything 
other than Cc characters. However, format effector functions (such as 
the above), i.e. those which constitute instructors to the text render 
and/or layout engine specifically, evidently do not have to be possessed 
by Cc characters. Indeed, this is the entire purpose of the Cf (format) 
category.

It is perhaps helpful to draw a distinction, in fine, between a control 
code in the vernacular sense (non-printing but does something) versus in 
the much more restricted sense of a category Cc character. The former 
may have functions defined by Unicode itself, whereas the latter are the 
domain of a control code standard such as ECMA-48.

Anyway, regarding ECMA-48 versus not ECMA-48:

Interpretation of Cc characters seems to be treated as a higher-level 
protocol, per chapter 23.1 of the Unicode core specification, which 
names ISO 6429 (i.e. ECMA-48) as /one possible/ such protocol but not 
the only one, while only listing semantics for HT, LF, VT, FF, CR, FS, 
GS, RS, US and NEL (i.e. the format effectors and information 
separators) and describing the basic concept of an ESC sequence without 
fully specifying their higher-level syntax, expressly leaving escape 
sequences and interpretation of most control codes to higher level 
protocols.

ISO 10646 similarly names ISO 6429 (i.e. ECMA-48) in section 11, but 
qualifies this with "or similarly structured standards". Section 12.4 
specifies the escape sequences to indicate use of ECMA-48 within UCS, 
but then (on the next page) specifies the general sequences to indicate 
use of other ISO-IR control code sets within UCS. Confusingly, this 
specification of how an ECMA-35 control code set designation is to be 
represented in UCS (i.e. padded to the word size of the encoding—a moot 
point in UTF-8) comes after section 11's statement of ISO 2022 (i.e. 
ECMA-35) designation escapes being forbidden in UCS. I personally 
understand this apparent contradiction in the standard as meaning that 
designation escapes for /graphical sets/ are forbidden per section 11 
(UCS being a monolithic graphical set in itself, they would be ambiguous 
and nonsensical in meaning were they used), but that those for /control 
code sets/ may be used with appropriate padding if required by higher 
level protocols per section 12.4, since the semantics of category Cc 
characters are left more open to higher protocols.

I understand the sum of this to be that, while use of ECMA-48 for 
interpreting category Cc characters is recommended, this can be 
overridden by prior agreement on another higher level standard protocol.

However: although MARC 21, the standard defining character encodings for 
Library of Congress records, uses a subset of ISO 6630 with some 
extensions (in positions not used by ISO 6630) as its C1 set within 
MARC-8 (its 8-bit, somewhat ECMA-35-based encoding), it however uses 
ECMA-48 as its C1 within Unicode, which means that it resorts to using 
SOS and ST instead of NSB and NSE (marking up a range of characters to 
be ignored during collation but nonetheless displayed). Notably, 
MARC-8's extensions to the ISO 6630 C1 set are ZWJ and ZWNJ, which are 
included in Unicode as non-Cc characters (U+200D and U+200C, both Cf). 
So there is some precedent to considering it inappropriate to just copy 
C0 and C1 codes from non-ECMA-48 sets into Unicode streams.

However: EBCDIC mappings (both UTF-EBCDIC and the Microsoft-supplied 
ones on Unicode.org) conventionally map the EBCDIC control codes to 
Unicode in a specific manner (well, two specific manners, differing only 
in LF→LF and NL→NEL versus NL→LF and LF→NEL) but, apart from aligning 
either LF or NL up with NEL, these make no attempt at any sort of 
partial compatibility with the ECMA-48 C1 set (e.g. putting SBS at 
U+0098 and SPS at U+008D, as opposed to aligning them with PLD and PLU 
at U+008B and U+008C respectively, which do the same thing). They do, 
however, match ASCII/ECMA-48 with their C0 mappings. So using C1 control 
mappings which pay little or no regard to ECMA-48 is not without 
precedent either.

Final note: I previously linked the ISO-IR document for the Videotex 
Data Syntax 2 (ITU T.101 Annex C) "Serial" variant C1 controls, 
otherwise known as the "Attribute Control Set for UK Videotex". This is 
registered with ISO-IR, and hence does also have an escape sequence to 
declare it as stipulated in section 12.4 of ISO 10646 (the bit on page 
20, specifically). The teletext set, by contrast, is not. However, the 
Data Syntax 2 Serial Videotex C1 controls are basically the same as the 
ETS Teletext control set but with ESC removed, CSI added in its place, 
and encoding them over the C1 range rather than the C0 range as in 
Teletext. Since Teletext's unusual use of ESC for code switching would 
presumably be handled in the process of transcoding to Unicode, this 
would be one way of marshalling Teletext control data through Unicode 
with a higher level protocol, provided that interoperation with 
something using ECMA-48 codes besides CSI or its sequences is not needed 
(e.g. DCS in terminals or OSC in terminal emulators).

-- Har.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201007/0d9ee534/attachment.htm>


More information about the Unicode mailing list