Custom characters (was: Re: Private Use Area in Use)

Richard Wordingham richard.wordingham at ntlworld.com
Thu Jun 4 14:36:26 CDT 2015


On Thu, 04 Jun 2015 14:39:27 +0000
David Starner <prosfilaes at gmail.com> wrote:

> On Thu, Jun 4, 2015 at 6:09 AM John <idou747 at gmail.com> wrote:
> 
> >  Mostly just a matter of upgrading the character size.
> 
> 
> Which totally blows any concern with text size out of the water.
> Using 30 bytes to define certain very rare characters and 1 byte to
> define ASCII is way better then using 8 bytes to define all
> characters.

The character size can be increased to 64 bits in such a way that no
new surrogates are required, current UTF-8 text remains UTF-8, current
UTF-16 text remains UTF-16 and current UTF-32 remains UTF-32, the
extended UTF-8 still has 8-bit code units, the extended UTF-16 still has
16-bit units, and the extended UTF-32 still has 32-bit code units.  In
fact, the character size can be made unbounded.

The trick is to extend UTF-8 indefinitely, and then for UTF-16 and
UTF-32 repeat the idea of the UTF-8 scheme using sequences of two or
more low surrogates (or two or more high surrogates - one must chose)
much as UTF-8 uses bytes.  Tom Bishop publicised the idea. 

Richard.


More information about the Unicode mailing list