Corrigendum #9

Thu Jun 12 13:28:45 CDT 2014

On Thu, 12 Jun 2014 01:37:49 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson
> <public at khwilliamson.com> wrote:

> > The FAQ mentions using 0x7FFFFFFF as a possible sentinel.  I did not
> > realize that that was considered representable in any UTF.
> > Likewise -1.

> No, and that's the point of using those. Integer values that are not
> code points make for great sentinels in API functions, such as a
> next() iterator returning -1 when there is no next character.

They work fine as alternatives to scalar values.  They don't work so
well in 8-bit and 16-bit Unicode strings.  A general purpose routine
extracting scalar values from Unicode strings is likely to treat them
as errors rather than just returning the scalar value as it would for
a non-character.  The only way to use them directly in 8- and
16-bit Unicode strings is to deliberately create ill-formed Unicode
strings.

Thus, these 'sentinels' are not full blown sentinels like U+0000 in the
C conventions for 'strings', as opposed to arrays of char.

There is a get-out clause - just never accept that a Unicode string is
purported to be in a Unicode character encoding form.

Richard.