Counting Codepoints

David Starner prosfilaes at
Mon Oct 12 18:35:32 CDT 2015

Any system that exposes Unicode strings (not UTF-16 strings) cannot  have
two surrogates merge when two strings are appended. There's nothing in the
Unicode standard that says that should happen for a string in an arbitrary
format, and it's unreasonable behavior for a string. Thus a Unicode string
simply can't be in UTF-16 format internally with unpaired surrogates; a
Unicode string in a programmer opaque format must do something with broken
data on input.

On 1:27pm, Mon, Oct 12, 2015 Richard Wordingham <
richard.wordingham at> wrote:

> On Mon, 12 Oct 2015 17:29:13 +0200
> Philippe Verdy <verdy_p at> wrote:
> > But between two implementations
> > the result of the scanner could still be different because the
> > replacement character is not specified. If that result "sanitized"
> > string is then used to generate an URI, the URI is also unpredictable
> > and will vary between implementations, as well as its effective
> > length. If it is used to generate an identifier granting some new
> > access, such as a user name, several new user names could be
> > generated from the same input.
> TUS 8.0 Section 3 Requirement C10 has the following, wise words in its
> final paragraph:
> "However, such repair of mangled data is a special case, and it must
> not be used in circumstances where it would cause security problems."
> Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list