Teletext separated mosaic graphics
Harriet Riddle
harjitmoe at outlook.com
Wed Oct 7 17:25:08 CDT 2020
Doug Ewell via Unicode wrote:
> Richard Wordingham wrote:
>
>> […]
>> That strikes me as a very good description of most of the 27 (as at
>> Version 12) characters with an Indic syllabic category of virama.
> A non-spacing mark (Mn) is not a control character (Cc). Whether it is rendered as a separate glyph or by modifying the glyph of a neighboring character is not the issue.
>
> There is no such thing in Unicode as a character which has more than General_Category value. Either a character is a control character, or it is not.
>
> Of course, I can create a program or a protocol that takes ordinary graphic characters such as < and >, and handles them in some special way, but then I am creating a new layer on top of plain text.
>
> --
> Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
>
---
Some comparisons of type-Cc and non-type-Cc characters with comparable,
although not necessarily identical, behaviours (provided that the
type-Cc characters are interpreted in accordance with ECMA-48, as I
shall come to later):
* CR (U+000D), LF (U+000A) and NEL (U+0085) are all Cc — versus
LS/LSEP (U+2028), which is Zs.
* VT (U+000B) and FF (U+000C) are Cc, whereas PS/PSEP (U+2029) is Zp.
* BPH (U+0082) is Cc, whereas SHY (U+00AD) and ZWSP (U+200B) are both Cf.
* NBH (U+0083) is Cc, whereas WJ (U+2060) and ZWNBSP/BOM (U+FEFF) are
both Cf.
* PLU (U+008C) to start a superscript is Cc, whereas IAS (U+FFFA) to
start a furigana section is Cf.
* SSA (U+0086) and its terminator ESA (U+0087) are Cc, whereas for
example RLO (U+202E), which similarly affects all following
characters until further notice, is Cf.
That being said, not everything which is appropriate for a Cc character
is appropriate elsewhere: it would clearly be inappropriate for (say)
DC1 or BEL, both of which issue instructions to something very much
outside of the sandbox (so to speak) of the text render, to be anything
other than Cc characters. However, format effector functions (such as
the above), i.e. those which constitute instructors to the text render
and/or layout engine specifically, evidently do not have to be possessed
by Cc characters. Indeed, this is the entire purpose of the Cf (format)
category.
It is perhaps helpful to draw a distinction, in fine, between a control
code in the vernacular sense (non-printing but does something) versus in
the much more restricted sense of a category Cc character. The former
may have functions defined by Unicode itself, whereas the latter are the
domain of a control code standard such as ECMA-48.
Anyway, regarding ECMA-48 versus not ECMA-48:
Interpretation of Cc characters seems to be treated as a higher-level
protocol, per chapter 23.1 of the Unicode core specification, which
names ISO 6429 (i.e. ECMA-48) as /one possible/ such protocol but not
the only one, while only listing semantics for HT, LF, VT, FF, CR, FS,
GS, RS, US and NEL (i.e. the format effectors and information
separators) and describing the basic concept of an ESC sequence without
fully specifying their higher-level syntax, expressly leaving escape
sequences and interpretation of most control codes to higher level
protocols.
ISO 10646 similarly names ISO 6429 (i.e. ECMA-48) in section 11, but
qualifies this with "or similarly structured standards". Section 12.4
specifies the escape sequences to indicate use of ECMA-48 within UCS,
but then (on the next page) specifies the general sequences to indicate
use of other ISO-IR control code sets within UCS. Confusingly, this
specification of how an ECMA-35 control code set designation is to be
represented in UCS (i.e. padded to the word size of the encoding—a moot
point in UTF-8) comes after section 11's statement of ISO 2022 (i.e.
ECMA-35) designation escapes being forbidden in UCS. I personally
understand this apparent contradiction in the standard as meaning that
designation escapes for /graphical sets/ are forbidden per section 11
(UCS being a monolithic graphical set in itself, they would be ambiguous
and nonsensical in meaning were they used), but that those for /control
code sets/ may be used with appropriate padding if required by higher
level protocols per section 12.4, since the semantics of category Cc
characters are left more open to higher protocols.
I understand the sum of this to be that, while use of ECMA-48 for
interpreting category Cc characters is recommended, this can be
overridden by prior agreement on another higher level standard protocol.
However: although MARC 21, the standard defining character encodings for
Library of Congress records, uses a subset of ISO 6630 with some
extensions (in positions not used by ISO 6630) as its C1 set within
MARC-8 (its 8-bit, somewhat ECMA-35-based encoding), it however uses
ECMA-48 as its C1 within Unicode, which means that it resorts to using
SOS and ST instead of NSB and NSE (marking up a range of characters to
be ignored during collation but nonetheless displayed). Notably,
MARC-8's extensions to the ISO 6630 C1 set are ZWJ and ZWNJ, which are
included in Unicode as non-Cc characters (U+200D and U+200C, both Cf).
So there is some precedent to considering it inappropriate to just copy
C0 and C1 codes from non-ECMA-48 sets into Unicode streams.
However: EBCDIC mappings (both UTF-EBCDIC and the Microsoft-supplied
ones on Unicode.org) conventionally map the EBCDIC control codes to
Unicode in a specific manner (well, two specific manners, differing only
in LF→LF and NL→NEL versus NL→LF and LF→NEL) but, apart from aligning
either LF or NL up with NEL, these make no attempt at any sort of
partial compatibility with the ECMA-48 C1 set (e.g. putting SBS at
U+0098 and SPS at U+008D, as opposed to aligning them with PLD and PLU
at U+008B and U+008C respectively, which do the same thing). They do,
however, match ASCII/ECMA-48 with their C0 mappings. So using C1 control
mappings which pay little or no regard to ECMA-48 is not without
precedent either.
Final note: I previously linked the ISO-IR document for the Videotex
Data Syntax 2 (ITU T.101 Annex C) "Serial" variant C1 controls,
otherwise known as the "Attribute Control Set for UK Videotex". This is
registered with ISO-IR, and hence does also have an escape sequence to
declare it as stipulated in section 12.4 of ISO 10646 (the bit on page
20, specifically). The teletext set, by contrast, is not. However, the
Data Syntax 2 Serial Videotex C1 controls are basically the same as the
ETS Teletext control set but with ESC removed, CSI added in its place,
and encoding them over the C1 range rather than the C0 range as in
Teletext. Since Teletext's unusual use of ESC for code switching would
presumably be handled in the process of transcoding to Unicode, this
would be one way of marshalling Teletext control data through Unicode
with a higher level protocol, provided that interoperation with
something using ECMA-48 codes besides CSI or its sequences is not needed
(e.g. DCS in terminals or OSC in terminal emulators).
-- Har.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201007/0d9ee534/attachment.htm>
More information about the Unicode
mailing list