Counting Codepoints
David Starner
prosfilaes at gmail.com
Mon Oct 12 18:35:32 CDT 2015
Any system that exposes Unicode strings (not UTF-16 strings) cannot have
two surrogates merge when two strings are appended. There's nothing in the
Unicode standard that says that should happen for a string in an arbitrary
format, and it's unreasonable behavior for a string. Thus a Unicode string
simply can't be in UTF-16 format internally with unpaired surrogates; a
Unicode string in a programmer opaque format must do something with broken
data on input.
On 1:27pm, Mon, Oct 12, 2015 Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:
> On Mon, 12 Oct 2015 17:29:13 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > But between two implementations
> > the result of the scanner could still be different because the
> > replacement character is not specified. If that result "sanitized"
> > string is then used to generate an URI, the URI is also unpredictable
> > and will vary between implementations, as well as its effective
> > length. If it is used to generate an identifier granting some new
> > access, such as a user name, several new user names could be
> > generated from the same input.
>
> TUS 8.0 Section 3 Requirement C10 has the following, wise words in its
> final paragraph:
>
> "However, such repair of mangled data is a special case, and it must
> not be used in circumstances where it would cause security problems."
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151012/ba56a568/attachment.html>
More information about the Unicode
mailing list