Unicode String Models

Tue Sep 11 06:13:03 CDT 2018

> Date: Tue, 11 Sep 2018 13:12:40 +0300
> From: Henri Sivonen via Unicode <unicode at unicode.org>
> 
>  * I suggest splitting the "UTF-8 model" into three substantially
> different models:
> 
>  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting byte-oriented
> data. Byte buffers and text buffers are type-wise ambiguous. Only
> iterating over byte data by code point gives the data the UTF-8
> interpretation. Unless the data is cleaned up as a side effect of such
> iteration, malformed sequences in input survive into output.
> 
>  2) UTF-8 without full trust in ability to retain validity (the model
> of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> common UTF-8 model for C and C++, but I don't have evidence to back
> this up): When data is ingested with text semantics, it is converted
> to UTF-8. For data that's supposed to already be in UTF-8, this means
> replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> data is valid UTF-8 right after input. However, iteration by code
> point doesn't trust ability of other code to retain UTF-8 validity
> perfectly and has "else" branches in order not to blow up if invalid
> UTF-8 creeps into the system.
> 
>  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> have a different type in the type system than byte buffers. To go from
> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> has been tagged as valid UTF-8, the validity is trusted completely so
> that iteration by code point does not have "else" branches for
> malformed sequences. If data that the type system indicates to be
> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> language has a default "safe" side and an opt-in "unsafe" side. The
> unsafe side is for performing low-level operations in a way where the
> responsibility of upholding invariants is moved from the compiler to
> the programmer. It's impossible to violate the UTF-8 validity
> invariant using the safe part of the language.

There's another model, the one used by Emacs.  AFAIU, it is different
from all the 3 you describe above.  In Emacs, each raw byte belonging
to a byte sequence which is invalid under UTF-8 is represented as a
special multibyte sequence.  IOW, Emacs's internal representation
extends UTF-8 with multibyte sequences it uses to represent raw bytes.
This allows mixing stray bytes and valid text in the same buffer,
without risking lossy conversions (such as those one gets under model
2 above).