Concise term for non-ASCII Unicode characters
richard.wordingham at ntlworld.com
Mon Sep 21 17:04:16 CDT 2015
On Mon, 21 Sep 2015 20:54:23 +0100
"Tony Jollans" <Tony at jollans.com> wrote:
> Windows code pages and their ilk predate Unicode, and I would only
> ever expect to see them used in environments where legacy support is
> needed, and would not expect a significant amount of new
> documentation about them to be written.
So at what version did Windows ditch 'ANSI code pages' as the default
for users' 'plain text'?
> Nor, as
> far as I'm aware, do the 0x80 to 0xFF octets have any special meaning
> in Unicode that would require there to be a recognisable term to
> describe them.
Such 8-bit *code units* are an unambiguous indicators in that one code
unit = one code point no longer applies. The 16-bit analogue to ASCII
v. nono-ASCII in scalar values, namely the BMP v. supplementary planes,
has a fair amount of terminology. Indeed, there is a special
terminology for the 16-bit analogue of octets with high bit set, the
surrogate 'code points'. The analogy breaks down because of the
existence of the Latin-1 Supplement block - the number 0xC2 serves a
double rôle as U+00C2 LATIN CAPITAL LETTER A WITH CIRCUMFLEX and as a
UTF-8 lead byte.
> Code that processes arbitrary *character* sequences (for legibility
> or any other reason) should, surely, work with characters, which may
> be sequences of code points, each of which may be a sequence of
> bytes. I can think of no reason for chopping up byte sequences except
> where they are going to be recombined later, by the reverse
> treatment, and code, if required, that does so probably has no idea
> of, and need not have any idea of, meaning, and can only, surely,
> work with bytes.
In the case I have in mind, the catch is that the chopped up sequences
are being stored in an intentionally human readable intermediate file.
The reason for the file being readable is to allow debugging, and in
extreme cases, correction. Now, the application is fairly old, and was
created when lines longer than 132 characters caused problems.
However, lines many thousands of characters long can still cause
problems, and are not amenable to line-by-line differencing. In
principle, one might rewrite the presentation part of the package to be
aware of Unicode characters (or even grapheme clusters), and that would
cause havoc if the text chopped up contained multibyte characters and
the reading program assumed that each chunk contained no unbroken
> The actual octets are, of course, used in combinations, but not
> singly in any way that requires them to be described in Unicode
> terms. Or am I missing something fundamental?
I believe the relevant distinction is simple that such octets are
associated with Unicode characters. They do not occur in ASCII text.
More information about the Unicode