Question about Perl5 extended UTF-8 design

Karl Williamson public at khwilliamson.com
Thu Nov 5 09:57:16 CST 2015


Hi,

Several of us are wondering about the reason for reserving bits for the 
extended UTF-8 in perl5.  I'm asking you because you are the apparent 
author of the commits that did this.

To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the 
length of the sequence of bytes that comprise a single character to be 
13 bytes.  This allows code points up to 2**72 - 1 to be represented. 
If the length had been instead 12 bytes, code points up to 2**66 - 1 
could be represented, which is enough to represent any code point 
possible in a 64-bit word.

The comments indicate that these extra bits are "reserved".  So we're 
wondering what potential use you had thought of for these bits.

Thanks

Karl Williamson


More information about the Unicode mailing list