Unicode String Models

Henri Sivonen via Unicode unicode at unicode.org
Wed Sep 12 00:38:21 CDT 2018


On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii <eliz at gnu.org> wrote:
>
> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode <unicode at unicode.org>
> >
> >  * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> >  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> > UTF-8-related operations are performed when ingesting byte-oriented
> > data. Byte buffers and text buffers are type-wise ambiguous. Only
> > iterating over byte data by code point gives the data the UTF-8
> > interpretation. Unless the data is cleaned up as a side effect of such
> > iteration, malformed sequences in input survive into output.
> >
> >  2) UTF-8 without full trust in ability to retain validity (the model
> > of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> > common UTF-8 model for C and C++, but I don't have evidence to back
> > this up): When data is ingested with text semantics, it is converted
> > to UTF-8. For data that's supposed to already be in UTF-8, this means
> > replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> > data is valid UTF-8 right after input. However, iteration by code
> > point doesn't trust ability of other code to retain UTF-8 validity
> > perfectly and has "else" branches in order not to blow up if invalid
> > UTF-8 creeps into the system.
> >
> >  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> > have a different type in the type system than byte buffers. To go from
> > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> > has been tagged as valid UTF-8, the validity is trusted completely so
> > that iteration by code point does not have "else" branches for
> > malformed sequences. If data that the type system indicates to be
> > valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> > language has a default "safe" side and an opt-in "unsafe" side. The
> > unsafe side is for performing low-level operations in a way where the
> > responsibility of upholding invariants is moved from the compiler to
> > the programmer. It's impossible to violate the UTF-8 validity
> > invariant using the safe part of the language.
>
> There's another model, the one used by Emacs.  AFAIU, it is different
> from all the 3 you describe above.  In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).

I think extensions of UTF-8 that expand the value space beyond Unicode
scalar values and the problems these extensions are designed to solve
is a worthwhile topic to cover, but I think it's not the same topic as
in the document but a slightly adjacent topic.

On that topic, these two are relevant:
https://simonsapin.github.io/wtf-8/
https://github.com/kennytm/omgwtf8

The former is used in the Rust standard library in order to provide a
Unix-like view to Windows file paths in a way that can represent all
Windows file paths. File paths on Unix-like systems are sequences of
bytes whose presentable-to-humans interpretation (these days) is
UTF-8, but there's no guarantee of UTF-8 validity. File paths on
Windows are are sequences of unsigned 16-bit numbers whose
presentable-to-humans interpretation is UTF-16, but there's no
guarantee of UTF-16 validity. WTF-8 can represent all Windows file
paths as sequences of bytes such that the paths that are valid UTF-16
as sequences of 16-bit units are valid UTF-8 in the 8-bit-unit
representation. This allows application-visible file paths in the Rust
standard library to be sequences of bytes both on Windows and
non-Windows platforms and to be presentable to humans by decoding as
UTF-8 in both cases.

To my knowledge, the latter isn't in use yet. The implementation is
tracked in https://github.com/rust-lang/rust/issues/49802

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/


More information about the Unicode mailing list