<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Doug Ewell via Unicode wrote:<br>
</div>
<blockquote type="cite" cite="mid:000001d69cec$0d4b1e50$27e15af0$@ewellic.org">
<pre wrap="">Richard Wordingham wrote:
</pre>
<blockquote type="cite">[…]<br>
<pre wrap="">
That strikes me as a very good description of most of the 27 (as at
Version 12) characters with an Indic syllabic category of virama.
</pre>
</blockquote>
<pre wrap="">
A non-spacing mark (Mn) is not a control character (Cc). Whether it is rendered as a separate glyph or by modifying the glyph of a neighboring character is not the issue.
There is no such thing in Unicode as a character which has more than General_Category value. Either a character is a control character, or it is not.
Of course, I can create a program or a protocol that takes ordinary graphic characters such as < and >, and handles them in some special way, but then I am creating a new layer on top of plain text.
--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
</pre>
</blockquote>
---<br>
<br>
Some comparisons of type-Cc and non-type-Cc characters with
comparable, although not necessarily identical, behaviours (provided
that the type-Cc characters are interpreted in accordance with
ECMA-48, as I shall come to later):<br>
<ul>
<li>CR (U+000D), LF (U+000A) and NEL (U+0085) are all Cc — versus
LS/LSEP (U+2028), which is Zs.</li>
<li>VT (U+000B) and FF (U+000C) are Cc, whereas PS/PSEP (U+2029)
is Zp.<br>
</li>
<li>BPH (U+0082) is Cc, whereas SHY (U+00AD) and ZWSP (U+200B) are
both Cf.</li>
<li>NBH (U+0083) is Cc, whereas WJ (U+2060) and ZWNBSP/BOM
(U+FEFF) are both Cf.</li>
<li>PLU (U+008C) to start a superscript is Cc, whereas IAS
(U+FFFA) to start a furigana section is Cf.</li>
<li>SSA (U+0086) and its terminator ESA (U+0087) are Cc, whereas
for example RLO (U+202E), which similarly affects all following
characters until further notice, is Cf.<br>
</li>
</ul>
That being said, not everything which is appropriate for a Cc
character is appropriate elsewhere: it would clearly be
inappropriate for (say) DC1 or BEL, both of which issue instructions
to something very much outside of the sandbox (so to speak) of the
text render, to be anything other than Cc characters. However,
format effector functions (such as the above), i.e. those which
constitute instructors to the text render and/or layout engine
specifically, evidently do not have to be possessed by Cc
characters. Indeed, this is the entire purpose of the Cf (format)
category.<br>
<br>
It is perhaps helpful to draw a distinction, in fine, between a
control code in the vernacular sense (non-printing but does
something) versus in the much more restricted sense of a category Cc
character. The former may have functions defined by Unicode itself,
whereas the latter are the domain of a control code standard such as
ECMA-48.<br>
<br>
Anyway, regarding ECMA-48 versus not ECMA-48:<br>
<br>
Interpretation of Cc characters seems to be treated as a
higher-level protocol, per chapter 23.1 of the Unicode core
specification, which names ISO 6429 (i.e. ECMA-48) as <i>one
possible</i> such protocol but not the only one, while only
listing semantics for HT, LF, VT, FF, CR, FS, GS, RS, US and NEL
(i.e. the format effectors and information separators) and
describing the basic concept of an ESC sequence without fully
specifying their higher-level syntax, expressly leaving escape
sequences and interpretation of most control codes to higher level
protocols.<br>
<br>
ISO 10646 similarly names ISO 6429 (i.e. ECMA-48) in section 11, but
qualifies this with "or similarly structured standards". Section
12.4 specifies the escape sequences to indicate use of ECMA-48
within UCS, but then (on the next page) specifies the general
sequences to indicate use of other ISO-IR control code sets within
UCS. Confusingly, this specification of how an ECMA-35 control code
set designation is to be represented in UCS (i.e. padded to the word
size of the encoding—a moot point in UTF-8) comes after section 11's
statement of ISO 2022 (i.e. ECMA-35) designation escapes being
forbidden in UCS. I personally understand this apparent
contradiction in the standard as meaning that designation escapes
for <i>graphical sets</i> are forbidden per section 11 (UCS being a
monolithic graphical set in itself, they would be ambiguous and
nonsensical in meaning were they used), but that those for <i>control
code sets</i> may be used with appropriate padding if required by
higher level protocols per section 12.4, since the semantics of
category Cc characters are left more open to higher protocols.<br>
<br>
I understand the sum of this to be that, while use of ECMA-48 for
interpreting category Cc characters is recommended, this can be
overridden by prior agreement on another higher level standard
protocol.<br>
<br>
However: although MARC 21, the standard defining character encodings
for Library of Congress records, uses a subset of ISO 6630 with some
extensions (in positions not used by ISO 6630) as its C1 set within
MARC-8 (its 8-bit, somewhat ECMA-35-based encoding), it however uses
ECMA-48 as its C1 within Unicode, which means that it resorts to
using SOS and ST instead of NSB and NSE (marking up a range of
characters to be ignored during collation but nonetheless
displayed). Notably, MARC-8's extensions to the ISO 6630 C1 set are
ZWJ and ZWNJ, which are included in Unicode as non-Cc characters
(U+200D and U+200C, both Cf). So there is some precedent to
considering it inappropriate to just copy C0 and C1 codes from
non-ECMA-48 sets into Unicode streams.<br>
<br>
However: EBCDIC mappings (both UTF-EBCDIC and the Microsoft-supplied
ones on Unicode.org) conventionally map the EBCDIC control codes to
Unicode in a specific manner (well, two specific manners, differing
only in LF→LF and NL→NEL versus NL→LF and LF→NEL) but, apart from
aligning either LF or NL up with NEL, these make no attempt at any
sort of partial compatibility with the ECMA-48 C1 set (e.g. putting
SBS at U+0098 and SPS at U+008D, as opposed to aligning them with
PLD and PLU at U+008B and U+008C respectively, which do the same
thing). They do, however, match ASCII/ECMA-48 with their C0
mappings. So using C1 control mappings which pay little or no regard
to ECMA-48 is not without precedent either.<br>
<br>
Final note: I previously linked the ISO-IR document for the Videotex
Data Syntax 2 (ITU T.101 Annex C) "Serial" variant C1 controls,
otherwise known as the "Attribute Control Set for UK Videotex". This
is registered with ISO-IR, and hence does also have an escape
sequence to declare it as stipulated in section 12.4 of ISO 10646
(the bit on page 20, specifically). The teletext set, by contrast,
is not. However, the Data Syntax 2 Serial Videotex C1 controls are
basically the same as the ETS Teletext control set but with ESC
removed, CSI added in its place, and encoding them over the C1 range
rather than the C0 range as in Teletext. Since Teletext's unusual
use of ESC for code switching would presumably be handled in the
process of transcoding to Unicode, this would be one way of
marshalling Teletext control data through Unicode with a higher
level protocol, provided that interoperation with something using
ECMA-48 codes besides CSI or its sequences is not needed (e.g. DCS
in terminals or OSC in terminal emulators).<br>
<br>
-- Har.<br>
<br>
<br>
</body>
</html>