Unicode String Models

Tue Sep 11 17:40:17 CDT 2018

On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode <unicode at unicode.org>
wrote:

>
> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:
> >
> > On Tue, 11 Sep 2018 21:10:03 +0200
> > Hans Åberg via Unicode <unicode at unicode.org> wrote:
> >
> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using
> >> LaTeX files with sections in different Cyrillic and Latin encodings,
> >> changing the editor encoding while typing.
> >
> > Rather like some of the old Unicode list archives, which are just
> > concatenations of a month's emails, with all sorts of 8-bit encodings
> > and stretches of base64.
>
> It might be useful to represent non-UTF-8 bytes as Unicode code points.
> One way might be to use a codepoint to indicate high bit set followed by
> the byte value with its high bit set to 0, that is, truncated into the
> ASCII range. For example, U+0080 looks like it is not in use, though I
> could not verify this.
>
>
it's used for character 0x400.   0xD0 0x80   or 0x8000   0xE8 0x80 0x80
(I'm probably off a bit in the leading byte)
UTF-8 can represent from 0 to 0x200000 every value; (which is all defined
codepoints) early varients can support up to U+7FFFFFFF...
and there's enough bits to carry the pattern forward to support 36 bits or
42 bits... (the last one breaking the standard a bit by allowing a byte
wihout one bit off... 0xFF would be the leadin)

0xF8-FF are unused byte values; but those can all be encoded into utf-8.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180911/368dfe75/attachment.html>