Surrogates and noncharacters

Mon May 11 16:53:02 CDT 2015

> On 11 May 2015, at 21:25, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> Yes, but this does not mean that 0xFFFFFFF cannot be used as a (32-bit) code unit in "32-bit strings", even if it is not a valid code point with a valid scaar value in any legacy or standard version of UTF-32.

The reason I did it was to avoid having a check to throw an exception. It merely means that the check for valid Unicode code points, in such a context, must be elsewhere.

> The limitation to 0x7FFFFFF was certainly just there to avoid sign/unsigned differences in 32-bit integers (if ever they were in fact converted to larger integers such as 64-bit to exhibit differences in APIs returning individual code units).

Indeed, so I use uint32_t combined with uint32_t, because char can be signed at the will of the C/C++ compiler implementer.

> It's true that in 32-bit integers (signed or unsigned) you cannot differenciate 0xFFFFFFF from -1 (which is generally the value chosen in C/C++ standard libraries for representing the EOF condition returned by functions or macros like getchar(). But EOF conditions do not require to be differentiated when you are scanning positions in a buffer of 32-bit integers (instead you compare the relative index in the buffer with the buffer length, or the buffer object includes a separate method to test this condition).

It is s good point - perhaps that was the reason to not allow highest bit set. But it is not a problem in C++, would it get UTF-32 streams, as they can throw an exception

> But today, where programming environment are going to 64-bit by default, the APIs that return an integer when reading individual code positions will return them as 64-bit integers, even if the inner storage uses 32-bit code units: 0xFFFFFFFF will then be returned as a positive integer and not -1 used for EOF.

Right, the C/C++ languages specifications say that size_t and friend must be able to hold any size, and similar for differences. So this forces signed and unsigned 64-bit integral types on a 64-bit platform.

> This was not still true when the legacy UTF-32 encoding was created, where a majority of environments were still only running 32-bit or 16-bit code; for the 16-bit code, the 0xFFFF code unit, for the U+FFFF code point, had to be assigned to a non-character to limit problems of confusions with the EOF condition in C/C++ or similar APIs in other languages (when they cannot throw an exception instead of a distinct EOF value).

Right, it might be a non-issue today.

> Well, there are stil la lot of devices running 32-bit code (notably in guest VMs, and in small devices) and written in C/C++ with the old standard C library, but without OOP features (such as exceptions, or methods for buffering objects). In Java, the "int" datatype (which is 32-bit and signed) has not been extended to 64-bit, even on platforms where 64-bit integers are the internal datatype used by the JVM in its natively compiled binary code.

Legacy is a problem.

> Once again, "code units" and "x-bit strings" are not bound to any Unicode or ISO/IEC 10646 or legacy RFC contraints related to the current standard UTFs or legacy (obsoleted) UTF's.
> 
> And I still don't see any productive need for "Unicode x-bit strings" in TUS D80-D83, when all that is needed for the conformance is NOT the whole range of valid code units, but only the allowed range of scalar values (which there's only the need for code units to be defined in a large enough set of distinct values:
> 
> The exact cardinality of this set does not matter, and there can always exist additional valid "code units" not bound to any valid "scalar value" or to a minimal set of distinct "Unicode code units" needed to support the standard Unicode encoding forms).
> 
> Even the Unicode scalar values or the implied values of "Unicode code units" to not have to be aligned with the effective native values of "code units" used in the lower level... except for the standard encoding schemes for 8-bit interchanges, where byte order matters... but still not the lower level bit order and the native hardware representation of invidually addressable bytes which may be sometimes larger than 8-bit, with some other control bits or framing bits, and sometimes even with variable bit sizes depending on their relative position in transport frames !

It is perfectly fine considering the Unicode code points as abstract integers, with UTF-32 and UTF-8 encodings that translate them into byte sequences in a computer. The code points that conflict with UTF-16 might have been merely declared not in use until UTF-16 has been fallen out of use, replaced by UTF-8 and UTF-32. One is going check that the code points are valid Unicode values somewhere, so it is hard to see to point of restricting UTF-8 to align it with UTF-16.