Corrigendum #9
Richard Wordingham
richard.wordingham at ntlworld.com
Thu Jun 12 13:28:45 CDT 2014
On Thu, 12 Jun 2014 01:37:49 -0700
Markus Scherer <markus.icu at gmail.com> wrote:
> On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson
> <public at khwilliamson.com> wrote:
> > The FAQ mentions using 0x7FFFFFFF as a possible sentinel. I did not
> > realize that that was considered representable in any UTF.
> > Likewise -1.
> No, and that's the point of using those. Integer values that are not
> code points make for great sentinels in API functions, such as a
> next() iterator returning -1 when there is no next character.
They work fine as alternatives to scalar values. They don't work so
well in 8-bit and 16-bit Unicode strings. A general purpose routine
extracting scalar values from Unicode strings is likely to treat them
as errors rather than just returning the scalar value as it would for
a non-character. The only way to use them directly in 8- and
16-bit Unicode strings is to deliberately create ill-formed Unicode
strings.
Thus, these 'sentinels' are not full blown sentinels like U+0000 in the
C conventions for 'strings', as opposed to arrays of char.
There is a get-out clause - just never accept that a Unicode string is
purported to be in a Unicode character encoding form.
Richard.
More information about the Unicode
mailing list