Unicode String Models

Tue Sep 11 05:12:40 CDT 2018

On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
<unicode at unicode.org> wrote:
>
> I recently did some extensive revisions of a paper on Unicode string models (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

* The Grapheme Cluster Model seems to have a couple of disadvantages
that are not mentioned:
  1) The subunit of string is also a string (a short string conforming
to particular constraints). There's a need for *another* more atomic
mechanism for examining the internals of the grapheme cluster string.
  2) The way an arbitrary string is divided into units when iterating
over it changes when the program is executed on a newer version of the
language runtime that is aware of newly-assigned codepoints from a
newer version of Unicode.

 * The Python 3.3 model mentions the disadvantages of memory usage
cliffs but doesn't mention the associated perfomance cliffs. It would
be good to also mention that when a string manipulation causes the
storage to expand or contract, there's a performance impact that's not
apparent from the nature of the operation if the programmer's
intuition works on the assumption that the programmer is dealing with
UTF-32.

 * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
text node storage in Gecko, (I believe but am not 100% sure) V8 and,
optionally, HotSpot
(https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A).
That is, text has UTF-16 semantics, but if the high half of every code
unit in a string is zero, only the lower half is stored. This has
properties analogous to the Python 3.3 model, except non-BMP doesn't
expand to UTF-32 but uses UTF-16 surrogate pairs.

 * I think the fact that systems that chose UTF-16 or UTF-32 have
implemented models that try to save storage by omitting leading zeros
and gaining complexity and performance cliffs as a result is a strong
indication that UTF-8 should be recommended for newly-designed systems
that don't suffer from a forceful legacy need to expose UTF-16 or
UTF-32 semantics.

 * I suggest splitting the "UTF-8 model" into three substantially
different models:

 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
UTF-8-related operations are performed when ingesting byte-oriented
data. Byte buffers and text buffers are type-wise ambiguous. Only
iterating over byte data by code point gives the data the UTF-8
interpretation. Unless the data is cleaned up as a side effect of such
iteration, malformed sequences in input survive into output.

 2) UTF-8 without full trust in ability to retain validity (the model
of the UTF-8-using C++ parts of Gecko; I believe this to be the most
common UTF-8 model for C and C++, but I don't have evidence to back
this up): When data is ingested with text semantics, it is converted
to UTF-8. For data that's supposed to already be in UTF-8, this means
replacing malformed sequences with the REPLACEMENT CHARACTER, so the
data is valid UTF-8 right after input. However, iteration by code
point doesn't trust ability of other code to retain UTF-8 validity
perfectly and has "else" branches in order not to blow up if invalid
UTF-8 creeps into the system.

 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
have a different type in the type system than byte buffers. To go from
a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
has been tagged as valid UTF-8, the validity is trusted completely so
that iteration by code point does not have "else" branches for
malformed sequences. If data that the type system indicates to be
valid UTF-8 wasn't actually valid, it would be nasal demon time. The
language has a default "safe" side and an opt-in "unsafe" side. The
unsafe side is for performing low-level operations in a way where the
responsibility of upholding invariants is moved from the compiler to
the programmer. It's impossible to violate the UTF-8 validity
invariant using the safe part of the language.

 * After working with different string models, I'd recommend the Rust
model for newly-designed programming languages. (Not because I work
for Mozilla but because I believe Rust's way of dealing with Unicode
is the best I've seen.) Rust's standard library provides Unicode
version-independent iterations over strings: by code unit and by code
point. Iteration by extended grapheme cluster is provided by a library
that's easy to include due to the nature of Rust package management
(https://crates.io/crates/unicode_segmentation). Viewing a UTF-8
buffer as a read-only byte buffer has zero run-time cost and allows
for maximally fast guaranteed-valid-UTF-8 output.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/