Counting Codepoints

Mon Oct 12 10:33:07 CDT 2015

Replace U+FFFE by U+FFFD in my message (but there are applications that
also prefer using non-characters for those replacements, this is also an
additional alternative, as U+FFFE has a valid representation as well in
UTF-16). U+FFFD is not the only possible replacement even if it is
recommended (by a "best practrice", which is not a "requirement" for
conformance purpose).

2015-10-12 17:29 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> 2015-10-12 14:42 GMT+02:00 Mark Davis ☕️ <mark at macchiato.com>:
>
>> If these are not all aligned, then all heck breaks loose: you are letting
>> yourself in for code breakage and/or security problems.
>>
>> So the corresponding code point count would just return a count of 1 for
>> an isolated surrogate.
>>
>
> But the behavior in this case is absolutely not defined, and applications
> are free to do what they want when they encounter them. There's not even
> any warranty that any further (correctly encoded) code point will be
> returned, even if a replacement character like U+FFFE is returned, it could
> replace all the rest.
>
> So the count of 1 is possible for the first isolated surrogate but all the
> rest count count as 0 as well, or all the further characters could be
> replaced by U+FFFE independantly of what they initially represented. This
> would also be a "sanitized" result.
>
> TUS gives freedom of choice in application. There's absolutely no warranty
> that all possible "sanitized" results will be the same for all
> applications, and TUS does not even mandate which replacement character to
> use (not necessarily U+FFFE, it could as well be an ASCII '?' character or
> a C0 <SUB> or <DEL> control, when further processed to an application
> converting the result to some legacy 7-bit or 8-bit charset).
>
> My opinion is that the only really safe result is to not return any count
> of code points but instead throw an error (counting code points and with a
> function returning an integer is only valid if the UTF-16 input is actually
> a valid representation of code points, you cannot return a single integer
> as the application using that integer could expect to allocate some
> processing buffer, and then get this exact number of code points when
> reading the data into some processing buffer, and could leave initialized
> some positions in that buffer, or the application could assume that the
> input was left untouched and could then get an unexpected mismatch of
> digital signature).
>
> If your function counting codepoints and returning an integer counts those
> lone surrogates as 1, it assumes that exactly one codepoint will be
> returned for each lone surrogate, and it should document that clearly,
> meaning that the result is only valid if this matches the results of the
> actual input scanner. In that case that function will never fail and throw
> an exception. But between two implementations the result of the scanner
> could still be different because the replacement character is not
> specified. If that result "sanitized" string is then used to generate an
> URI, the URI is also unpredictable and will vary between implementations,
> as well as its effective length. If it is used to generate an identifier
> granting some new access, such as a user name, several new user names
> could be generated from the same input.
>
> So in all cases using replacements will also create security problems.
> This will not happen if you don't return any result but throw an exception
> (that counting function should document this exception so that it is not
> unexpectedly thrown and left unhandled, causing the program to abort
> prematurely in an unsafe state including loosing other data or transaction
> elsewhere in an incoherent state).
>
> For all programs taking some standard UTF input, the input scanner or
> processing functions MUST be prepared to handle the encoding error
> exception, which is an result expected equally to the return of a value or
> the execution of some code ! Sanitization is possible, but not described in
> the standard, and there are several conflict ways of doing it, it should be
> a separate subprocess documented separately.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151012/73f19165/attachment.html>