Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

Philippe Verdy via Unicode unicode at unicode.org
Mon Jul 24 15:50:05 CDT 2017

2017-07-24 21:12 GMT+02:00 J Decker via Unicode <unicode at unicode.org>:

> On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode <
> unicode at unicode.org> wrote:
>> Hi Folks,
>> 2. (Bug) The sending application performs the folding process - inserts
>> CRLF plus white space characters - and the receiving application does the
>> unfolding process but doesn't properly delete all of them.
>> The RFC doesn't say 'characters' but either a space or a tab character
> (singular)
>  back scanning is simple enough
> while( ( from[0] & 0xC0 ) == 0x80 )
> from--;

Certainly not like this! Backscanning should only directly use a single
assignement to the last known start position, no loop at all ! UTF-8
security is based on the fact that its sequences are strictly limited in
length so that you will never have more than 3 trailing bytes.

If you don't have that last position in a variable, just use 3 tests but NO
loop at all: if all 3 tests are failing, you know the input was not valid
at all, and the way to handle this error will not be solved simply by using
a very unsecure unbound loop like above but by exiting and returning an
error immediately, or throwing an exception.

The code should better be:

    if (from[0]&0xC0 == 0x80) from--;
    else if (from[-1]&0xC0 == 0x80) from -=2;
    else if (from[-2]&0xC0 == 0x80) from -=3;
    if (from[0]&0xC0 == 0x80) throw (some exception);
    // continue here with character encoded as UTF-8 starting at "from" (an
ASCII byte or an UTF-8 leading byte)

And it should be secured using a guard byte at start of your buffer in
which the "from" pointer was pointing, so that it will never read something
else and can generate an error.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170724/bc3d71a9/attachment.html>

More information about the Unicode mailing list