Unicode String Models

Tue Sep 11 09:19:58 CDT 2018

These are all interesting and useful comments. I'll be responding once I
get a bit of free time, probably Friday or Saturday.

Mark

On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode <
unicode at unicode.org> wrote:

> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode <unicode at unicode.org>
> >
> >  * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> >  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> > UTF-8-related operations are performed when ingesting byte-oriented
> > data. Byte buffers and text buffers are type-wise ambiguous. Only
> > iterating over byte data by code point gives the data the UTF-8
> > interpretation. Unless the data is cleaned up as a side effect of such
> > iteration, malformed sequences in input survive into output.
> >
> >  2) UTF-8 without full trust in ability to retain validity (the model
> > of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> > common UTF-8 model for C and C++, but I don't have evidence to back
> > this up): When data is ingested with text semantics, it is converted
> > to UTF-8. For data that's supposed to already be in UTF-8, this means
> > replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> > data is valid UTF-8 right after input. However, iteration by code
> > point doesn't trust ability of other code to retain UTF-8 validity
> > perfectly and has "else" branches in order not to blow up if invalid
> > UTF-8 creeps into the system.
> >
> >  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> > have a different type in the type system than byte buffers. To go from
> > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> > has been tagged as valid UTF-8, the validity is trusted completely so
> > that iteration by code point does not have "else" branches for
> > malformed sequences. If data that the type system indicates to be
> > valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> > language has a default "safe" side and an opt-in "unsafe" side. The
> > unsafe side is for performing low-level operations in a way where the
> > responsibility of upholding invariants is moved from the compiler to
> > the programmer. It's impossible to violate the UTF-8 validity
> > invariant using the safe part of the language.
>
> There's another model, the one used by Emacs.  AFAIU, it is different
> from all the 3 you describe above.  In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180911/513a9411/attachment.html>