Question about Perl5 extended UTF-8 design

Ilya Zakharevich nospam-abuse at
Thu Nov 5 16:11:37 CST 2015

On Thu, Nov 05, 2015 at 08:57:16AM -0700, Karl Williamson wrote:
> Several of us are wondering about the reason for reserving bits for
> the extended UTF-8 in perl5.  I'm asking you because you are the
> apparent author of the commits that did this.

To start, the INTERNAL REPRESENTATION of Perl’s strings is the «utf8»
format (not «UTF-8», «extended» or not).  [I see that this misprint
caused a lot of stir here!]

However, outside of a few contexts, this internal representation
should not be visible.  (However, some of these contexts are close to
the default, like read/write in Unicode mode, with -C switch.)

Perl’s string is just a sequence of Perl’s unsigned integers.
[Depending on the build, this may be, currently, 32-bit or 64-bit.]
By convention, the “meaning” of small integers coincides with what
Unicode says.

> To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes
> the length of the sequence of bytes that comprise a single character
> to be 13 bytes.  This allows code points up to 2**72 - 1 to be
> represented. If the length had been instead 12 bytes, code points up
> to 2**66 - 1 could be represented, which is enough to represent any
> code point possible in a 64-bit word.
> The comments indicate that these extra bits are "reserved".  So
> we're wondering what potential use you had thought of for these
> bits.

First of all, “reserved” means that they have no meaning.  Right?

Second, there are 2 ways in which one may need this INTERNAL format to
be extended:
  • 128-bit architectures may be at hand (sooner or later).
  • One may need to allow “objects” to be embedded into Perl strings.

With embedded objects, one must know how to kill them when the string
(or its part) is removed.  So, while a pointer can fit into a Perl
integer, one needs to specify what to do: call DESTROY, or free(), or
a user-defined function.

This gives 5 possibilities (3 extra bits) which may be needed with
“slots” in Perl strings.
  • Integer (≤64 bits)
  • Integer (≥65 bits) 
  • Pointer to a Perl object
  • Pointer to a malloc()ed memory
  • Pointer to a struct which knows how to destroy itself.
      struct self_destroy { void *content; void destroy(struct self_destroy*); }

Why one may need objects embedded into strings?  I explained it in
(look for «Emacs» near the middle).

Hope this helps,

More information about the Unicode mailing list