Corrigendum #9

Philippe Verdy verdy_p at wanadoo.fr
Sat May 31 06:09:16 CDT 2014


2014-05-30 20:49 GMT+02:00 Asmus Freytag <asmusf at ix.netcom.com>:

> This might have been possible at the time these were added, but now it is
> probably not feasible. One of the reasons is that block names are exposed
> (for better or for worse) as character properties and as such are also
> exposed in regular expressions. While not recommended, it would be really
> bad if the expression with pseudo-code "IsInArabicPresentationFormB(x)"
> were to fail, because we split the block into three (with the middle one
> being the noncharacters).
>

If you think about pseudocode testing for properties then nothing forbifs
the test IsInArabicPresentationFormB(x) to check two ranges onstead of just
one. Almost all character properties are using multiple ranges of
characters (including the more useful properties needed in lots of place in
the code, so updating it so that the property covers two ranges is not a
major change).

But anyway, I have never see the non-characters in the Arabic presentation
forms used elsewhere than within legacy Arabic fonts, using these code
points to map... Arabic presentation forms.

OK, text documents do not need to encode these legay forms in order to use
these fonts (text renderers don't need them with modern OpenType fonts but
will still use them in legacy non-OpenType TTF fonts, as a tentative
fallback to render these contextual forms).

So basically there's no interchange of *text* but the fonts using these
codepoints are still interchanged.

I think it would be better to just reassign these characters are
compatibility characters (or even as PUA) and not as non-characters. I see
no rational for keeping them illegal, when it just cause unnecessary
complications for document validation.

After all most C0 and C1 controls also don't have any other interchangeable
semantic except being "controls" which are always application and protocol
dependant (not meant for encoding texts, except in legacy more or less
"rich" encodings (e.g. for storing escape sequences, not standardized and
fully dependant on the protocol or terminal type, or on various legacy
standards that did not separate text from style, or for the many protocols
that need them for special purpose, such as tagginh content, switching code
pages changing colors and font styles, positioning on a screen or input
form, adding foratting metadata, implement out-of-band commands,
starting/stopping records, pacing the bandwidth use,
starting/ending/redirecting/splitting/merging sessions, embed non-text
content such as bitmap images or structured data, changing transport
protocol options such as compresion schemes, exchanhing
encryption/decryption keys, adding checksum controls or autocorrection
data, marking redundant data copies, inserting resynchronization points for
error recovery...)

So these "non-characters" in Arabic presentation forms are to be treated
more or less like most C1 controls that have undefined behavior. Saying
that there's a need for a "prior agreement" the agreement may be explicit
by the fact that they are used in some old font formats (the same is true
about old fonts using PUA assignments: the kind of agreement is basically
the same, and in both cases, fonts are not plain-text documents).

So the good queston for us is only to be able to reply to this question:
"is this document a valid and conforming plain-text ?"

If:
  * (1) your document contains
  - any one in most of the C0 or C1 controls (except CR, LF, VT, FF, and NL
from C1)
  - any one in PUA
  - any one in non-characters
  - any unpaired surrogates
  * and (2) your document does not validate its encoding scheme,
Then it is not plain-text (to be interchangeable it also needs a recognized
standard encoding, which also requires an agreement or a specification in
the protocol or file format used to transport it).

Personnally I think that surrogates are also non-characters. They are not
assigned to any character even if a pair of encodings are using them
internally to represent code units (not directly code points which are
converted first in two code units); this means that some documents are
valid UTF16 and UTF-32 documents even if they are not plain-text with the
current system (I don't like this situation because UTF-16 and UTF-32
documents are supposed to be interchangeable, even if they are not all
convertible to UTF-8).

But with the non-characters in the Arabic presentation forms, all is made
as if they were reserved for a possible future encoding that could use them
internally for representing some text using sequences of code units
containing or starting by them, or for some still mysterious encoding of a
PUA agreement with an unspecified protocol (exactly the same situation as
with most C1 controls), or as possible replacement for some code units that
could collide with the internal use of some standard controls with some
protocols (e.g. to reencode a NULL, or to delimit the end of an
variable-length escape sequence, when all other C0 and C1 controls are
already used in a terminal protocol). But even in this case, it will be
difficult to consider documents containing them as "plain-text".

----

Note: I do not discuss the 34 non-characters in positions U+xxFFFE and
U+xxFFFF: keep them as non-characters, they are sufficient for all possible
internal use (in fact ony U+FFFE and U+FFFF are needed: the first one for
determining the byte order in streams that accept either big-endian or
little-endian ordering, the second to mark the end of a stream), and I've
still never seen any application needing more non-characters from the
Arabic presentation form for such use.

The non-character U+FFFE can be used to detect the byte order in UTF-16
adnd UTF-32, but not the bit order in bytes, (because U+7FFF and U+FF7F are
not non-characters).

This is a problem in some protocols that can accept both without an
explicit prior encoding of this order (they could need another
non-character to help determining the bit order, in which case the encoding
of the non-character U+1FFFE could be used: if the bit order is swapped in
UTF-16 we get 0xDFFE as the second UTF-16 encoding from which we can
determine the bit order from the position of the clear bit with value
0x20000).

But unlike the cirrent code unit 0xFEFF used to detect swapped BOM (whcih
is considered valid character and in-band, stripped conditionnaly only in
the leading position), 0x1FFFE could be treated as non-character and its
presence always out-of-band, so once the bit and byte order has been
detected, or changed with it within a stream of code units, it can always
be stripped from the plain-text output of code points.

May be in the furure we'll need more distinctive order marks for bits,
bytes, code units, but I am convinced that the 34 codepojnts U+xFFFE and
U+xFFFF will be far enough **without** ever needing to use the
non-characters in the Arabic presentation form block.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140531/0eeab854/attachment.html>


More information about the Unicode mailing list