Unicode String Models

Wed Sep 12 03:37:00 CDT 2018

> On 12 Sep 2018, at 04:34, Eli Zaretskii via Unicode <unicode at unicode.org> wrote:
> 
>> Date: Wed, 12 Sep 2018 00:13:52 +0200
>> Cc: unicode at unicode.org
>> From: Hans Åberg via Unicode <unicode at unicode.org>
>> 
>> It might be useful to represent non-UTF-8 bytes as Unicode code points. One way might be to use a codepoint to indicate high bit set followed by the byte value with its high bit set to 0, that is, truncated into the ASCII range. For example, U+0080 looks like it is not in use, though I could not verify this.
> 
> You must use a codepoint that is not defined by Unicode, and never
> will.  That is what Emacs does: it extends the Unicode codepoint space
> beyond 0x10FFFF.

The idea is to extend Unicode itself, so that those bytes can be represented by legal codepoints. Then U+0080 has had some use in other encodings, but it looks like not in Unicode itself. But one could use some other value or values, and mark it for this special purpose.

There are a number of other byte sequences that are in use, too, like overlong UTF-8. Also original UTF-8 can be extended to handle all 32-bit words, also those with the high bit set, then.