Proposal to update the UTF-18 specification (RFC 4042)

Sławomir Osipiuk sosipiuk at gmail.com
Sat Apr 1 04:45:00 CDT 2023


Background and Motivation:

UTF-18, the UCS Transformation Format ― 18-bit, is specified in IETF
RFC 4042. UTF-18, along with UTF-9, are intended to provide efficient
storage and processing of UCS/Unicode text in nonet-based
environments. Of the two formats, UTF-18 is notably much simpler, as
it represents any UCS code point with exactly two nonets; that is,
exactly 18 bits. A curious limitation of UTF-18 is that it is
incapable of representing all potential UCS code points.

At the time  RFC 4042 was written, the ranges of code points
representable by UTF-18 included all then-existing non-private assigned
characters. This situation changed in 2020, with the release of
ISO/IEC 10646:2020 and Unicode 13.0, which assigned characters to code
points in the U+30000 to U+3FFFF range, known as Plane 3, or the
Tertiary Ideographic Plane (TIP).

The obvious values for representing these new characters in UTF-18
were defined by RFC 4042 to instead represent characters in Plane 15,
the Supplementary Special Purpose Plane (SSP). This design decision
was reasonable at the time, driven by the necessity of representing
existing SSP characters. Unfortunately, it makes TIP characters
impossible to encode in UTF-18 as it is currently specified. RFC 4042
explicitly forbids the use of any surrogate-pair mechanism to increase
the amount of representable code points.

However, with a minor modification, UTF-18 can be made to represent
not only all currently assigned characters, but also all code points
which have been roadmapped for future assignment. This change can
ensure the practical viability of UTF-18 as a reliable storage format
for the foreseeable future in all environments where nonet-based or
octodectet-based storage and/or processing is ideal. Furthermore, this
proposed updated UTF-18 definition holds to the fundamental design and
spirit of the original, and presents a clear and straightforward
upgrade path for existing UTF-18 implementations.


Technical Details:

The only amendments to the existing specification as given by RFC 4042 are:

- Codepoint values in the range U+0000 - U+3FFFF are copied as the
same value into a UTF-18 value.
- Codepoint values in the range U+E0000 - U+E03FF are copied as values
0xdc00 - 0xdfff; that is, these values are shifted by 0xd2400.

This strategy allows the representation of all code points in Planes
0, 1, 2, and now 3 (TIP), as well the portion of Plane 15 (SSP)
encompassing all its currently assigned characters.

For greater clarity, it is noted that this strategy does not make use
of a surrogate mechanism as in UTF-16 (there is still a one-to-one
correspondence between code points and code units,) nor does it assign
code points from the reserved U+DB00 to U+DFFF range (though it does
make use of code units in that range.) The fundamental definition of
UTF-18 is not changed. Only some values involved in the range mapping
between code points and code units is altered from the original
definition in order to more efficiently cover assigned code points.

The full updated specification may be obtained by running the UNIX command:

wget -q -O - https://www.rfc-editor.org/rfc/rfc4042.txt \
| sed -E 's/EFF/E03/;s/0x3[0f]([0f]{3})/0xd\1/g;
s/d0/dc/;s/2F/3F/;s/600/156/;s/700/d24/'


Upgrade Considerations:

Remarkably, and very fortunately, a recent extensive search of all
publicly available UTF-18 data has found no instances of UCS values in
the U+E0000 to U+EFFFF range; that is, there is currently zero use of
SSP characters in publicly-visible UTF-18 environments.

While an accurate measure of SSP character usage in private UTF-18
environments cannot be determined, it can reasonably be presumed to
closely correspond to the public usage, thus, to be extremely low or
zero.

This presents an extraordinary, but time-sensitive, opportunity to
upgrade UTF-18 systems with the enhancement herein described with
minimal disruption and effort.

All upgradable software which processes UTF-18 data can be upgraded to
the new range mappings quickly and simply. Reasonable program code
implementations would most likely store the range limits and offset as
constant values which can be adjusted in source code, with software
re-compiled and re-deployed at low risk. This likewise applies to
strategies which use bit-masking to effect the range mapping. The
trivial nature of the change means that it is suitable for a point
release, and can be easily back-ported to previous major releases
should that be necessary.

Where existing stored user data lacks any instances of SSP characters,
there is nothing required to bring the data into compliance with the
updated standard, and such data may be considered de facto
“forwards-compatible”. This is the crux of the current opportunity,
and why time is of the essence. As the corpus of UTF-18 data grows
with time, SSP characters will almost certainly be introduced in more
environments, complicating the upgrade process.

In environments where existing UTF-18 data already includes such
characters, a suitable upgrade strategy can still be developed, of
course, and should be developed without delay to ensure compatibility
with the latest standard going forward. It is most desirable that
UTF-18 data be cleanly interchangeable between UTF-18 compliant
systems. As the change is simply an offset applied to a specific
subset of UCS values, an algorithm to update data will in general be
simple to develop, test, and execute.

Implementors are urged to examine their existing environments,
software, and user data to determine the best course, and strongly
consider an in-place upgrade of the software (and, if required, data)
wherever feasible.

UTF-18 systems which cannot feasibly be upgraded will continue to
function as expected with any data that does not include SIP or TIP
code points. They should be carefully monitored for potential errors
whenever data interchange with upgraded systems occurs.
Data-sanitization strategies may be required, depending on the
potential severity of mishandling SIP and TIP characters.


Conclusion:

The modification proposed here to the UTF-18 specification is an
easy-to-implement enhancement that allows UTF-18 to cover the entire
present non-private UCS character repertoire, ensuring that UTF-18
continues to remain as technically viable and relevant as it ever was
in the face of continued development of the UCS and Unicode.



More information about the Unicode mailing list