Unicode String Models

Tue Sep 11 17:13:52 CDT 2018

> On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Tue, 11 Sep 2018 21:10:03 +0200
> Hans Åberg via Unicode <unicode at unicode.org> wrote:
> 
>> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> LaTeX files with sections in different Cyrillic and Latin encodings,
>> changing the editor encoding while typing.
> 
> Rather like some of the old Unicode list archives, which are just
> concatenations of a month's emails, with all sorts of 8-bit encodings
> and stretches of base64.

It might be useful to represent non-UTF-8 bytes as Unicode code points. One way might be to use a codepoint to indicate high bit set followed by the byte value with its high bit set to 0, that is, truncated into the ASCII range. For example, U+0080 looks like it is not in use, though I could not verify this.