<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Doug Ewell via Unicode wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:000001d69cec$0d4b1e50$27e15af0$@ewellic.org">

      <pre wrap="">Richard Wordingham wrote:

</pre>

      <blockquote type="cite">[…]<br>

        <pre wrap="">

That strikes me as a very good description of most of the 27 (as at

Version 12) characters with an Indic syllabic category of virama.

</pre>

      </blockquote>

      <pre wrap="">

A non-spacing mark (Mn) is not a control character (Cc). Whether it is rendered as a separate glyph or by modifying the glyph of a neighboring character is not the issue.

There is no such thing in Unicode as a character which has more than General_Category value. Either a character is a control character, or it is not.

Of course, I can create a program or a protocol that takes ordinary graphic characters such as < and >, and handles them in some special way, but then I am creating a new layer on top of plain text.

--

Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org

</pre>

    </blockquote>

    ---<br>

    <br>

    Some comparisons of type-Cc and non-type-Cc characters with

    comparable, although not necessarily identical, behaviours (provided

    that the type-Cc characters are interpreted in accordance with

    ECMA-48, as I shall come to later):<br>

    <ul>

      <li>CR (U+000D), LF (U+000A) and NEL (U+0085) are all Cc — versus

        LS/LSEP (U+2028), which is Zs.</li>

      <li>VT (U+000B) and FF (U+000C) are Cc, whereas PS/PSEP (U+2029)

        is Zp.<br>

      </li>

      <li>BPH (U+0082) is Cc, whereas SHY (U+00AD) and ZWSP (U+200B) are

        both Cf.</li>

      <li>NBH (U+0083) is Cc, whereas WJ (U+2060) and ZWNBSP/BOM

        (U+FEFF) are both Cf.</li>

      <li>PLU (U+008C) to start a superscript is Cc, whereas IAS

        (U+FFFA) to start a furigana section is Cf.</li>

      <li>SSA (U+0086) and its terminator ESA (U+0087) are Cc, whereas

        for example RLO (U+202E), which similarly affects all following

        characters until further notice, is Cf.<br>

      </li>

    </ul>

    That being said, not everything which is appropriate for a Cc

    character is appropriate elsewhere: it would clearly be

    inappropriate for (say) DC1 or BEL, both of which issue instructions

    to something very much outside of the sandbox (so to speak) of the

    text render, to be anything other than Cc characters. However,

    format effector functions (such as the above), i.e. those which

    constitute instructors to the text render and/or layout engine

    specifically, evidently do not have to be possessed by Cc

    characters. Indeed, this is the entire purpose of the Cf (format)

    category.<br>

    <br>

    It is perhaps helpful to draw a distinction, in fine, between a

    control code in the vernacular sense (non-printing but does

    something) versus in the much more restricted sense of a category Cc

    character. The former may have functions defined by Unicode itself,

    whereas the latter are the domain of a control code standard such as

    ECMA-48.<br>

    <br>

    Anyway, regarding ECMA-48 versus not ECMA-48:<br>

    <br>

    Interpretation of Cc characters seems to be treated as a

    higher-level protocol, per chapter 23.1 of the Unicode core

    specification, which names ISO 6429 (i.e. ECMA-48) as <i>one

      possible</i> such protocol but not the only one, while only

    listing semantics for HT, LF, VT, FF, CR, FS, GS, RS, US and NEL

    (i.e. the format effectors and information separators) and

    describing the basic concept of an ESC sequence without fully

    specifying their higher-level syntax, expressly leaving escape

    sequences and interpretation of most control codes to higher level

    protocols.<br>

    <br>

    ISO 10646 similarly names ISO 6429 (i.e. ECMA-48) in section 11, but

    qualifies this with "or similarly structured standards". Section

    12.4 specifies the escape sequences to indicate use of ECMA-48

    within UCS, but then (on the next page) specifies the general

    sequences to indicate use of other ISO-IR control code sets within

    UCS. Confusingly, this specification of how an ECMA-35 control code

    set designation is to be represented in UCS (i.e. padded to the word

    size of the encoding—a moot point in UTF-8) comes after section 11's

    statement of ISO 2022 (i.e. ECMA-35) designation escapes being

    forbidden in UCS. I personally understand this apparent

    contradiction in the standard as meaning that designation escapes

    for <i>graphical sets</i> are forbidden per section 11 (UCS being a

    monolithic graphical set in itself, they would be ambiguous and

    nonsensical in meaning were they used), but that those for <i>control

      code sets</i> may be used with appropriate padding if required by

    higher level protocols per section 12.4, since the semantics of

    category Cc characters are left more open to higher protocols.<br>

    <br>

    I understand the sum of this to be that, while use of ECMA-48 for

    interpreting category Cc characters is recommended, this can be

    overridden by prior agreement on another higher level standard

    protocol.<br>

    <br>

    However: although MARC 21, the standard defining character encodings

    for Library of Congress records, uses a subset of ISO 6630 with some

    extensions (in positions not used by ISO 6630) as its C1 set within

    MARC-8 (its 8-bit, somewhat ECMA-35-based encoding), it however uses

    ECMA-48 as its C1 within Unicode, which means that it resorts to

    using SOS and ST instead of NSB and NSE (marking up a range of

    characters to be ignored during collation but nonetheless

    displayed). Notably, MARC-8's extensions to the ISO 6630 C1 set are

    ZWJ and ZWNJ, which are included in Unicode as non-Cc characters

    (U+200D and U+200C, both Cf). So there is some precedent to

    considering it inappropriate to just copy C0 and C1 codes from

    non-ECMA-48 sets into Unicode streams.<br>

    <br>

    However: EBCDIC mappings (both UTF-EBCDIC and the Microsoft-supplied

    ones on Unicode.org) conventionally map the EBCDIC control codes to

    Unicode in a specific manner (well, two specific manners, differing

    only in LF→LF and NL→NEL versus NL→LF and LF→NEL) but, apart from

    aligning either LF or NL up with NEL, these make no attempt at any

    sort of partial compatibility with the ECMA-48 C1 set (e.g. putting

    SBS at U+0098 and SPS at U+008D, as opposed to aligning them with

    PLD and PLU at U+008B and U+008C respectively, which do the same

    thing). They do, however, match ASCII/ECMA-48 with their C0

    mappings. So using C1 control mappings which pay little or no regard

    to ECMA-48 is not without precedent either.<br>

    <br>

    Final note: I previously linked the ISO-IR document for the Videotex

    Data Syntax 2 (ITU T.101 Annex C) "Serial" variant C1 controls,

    otherwise known as the "Attribute Control Set for UK Videotex". This

    is registered with ISO-IR, and hence does also have an escape

    sequence to declare it as stipulated in section 12.4 of ISO 10646

    (the bit on page 20, specifically). The teletext set, by contrast,

    is not. However, the Data Syntax 2 Serial Videotex C1 controls are

    basically the same as the ETS Teletext control set but with ESC

    removed, CSI added in its place, and encoding them over the C1 range

    rather than the C0 range as in Teletext. Since Teletext's unusual

    use of ESC for code switching would presumably be handled in the

    process of transcoding to Unicode, this would be one way of

    marshalling Teletext control data through Unicode with a higher

    level protocol, provided that interoperation with something using

    ECMA-48 codes besides CSI or its sequences is not needed (e.g. DCS

    in terminals or OSC in terminal emulators).<br>

    <br>

    -- Har.<br>

    <br>

    <br>

  </body>

</html>