Unicode String Models

Wed Sep 12 09:03:44 CDT 2018

> Date: Wed, 12 Sep 2018 01:41:03 +0200
> Cc: unicode Unicode Discussion <unicode at unicode.org>,
>         Richard Wordingham <richard.wordingham at ntlworld.com>,
>         Hans Aberg <haberg-1 at telia.com>
> From: Philippe Verdy via Unicode <unicode at unicode.org>
> 
> The only safe way to represent arbitrary bytes within strings when they are not valid UTF-8 is to use invalid
> UTF-8 sequences, i.e by using a "UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!)
> 
> This is what Java does for representing U+0000 by (0xC0,0x80) in the compiled Bytecode or via the C/C++
> interface for JNI when converting the java string buffer into a C/C++ string terminated by a NULL byte (not part
> of the Java string content itself). That special sequence however is really exposed in the Java API as a true
> unsigned 16-bit code unit (char) with value 0x0000, and a valid single code point.

That's more or less what Emacs does.

> But both schemes (a) or (b) would be useful in editors allowing to edit arbitrary binary files as if they were
> plain-text, even if they contain null bytes, or invalid UTF-8 sequences (it's up to these editors to find a way to
> distinctively represent these bytes, and a way to enter/change them reliably.

The experience in Emacs is that no serious text editor can decide that
it doesn't support these use cases.  Even if editing binary files is
out of scope, there will always be text files whose encoding is
unknowable and/or guessed/decided wrong, files with mixed encodings,
etc.