Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
Philippe Verdy via Unicode
unicode at unicode.org
Mon Jul 24 15:50:05 CDT 2017
2017-07-24 21:12 GMT+02:00 J Decker via Unicode <unicode at unicode.org>:
> On Mon, Jul 24, 2017 at 10:57 AM, Costello, Roger L. via Unicode <
> unicode at unicode.org> wrote:
>> Hi Folks,
>> 2. (Bug) The sending application performs the folding process - inserts
>> CRLF plus white space characters - and the receiving application does the
>> unfolding process but doesn't properly delete all of them.
>> The RFC doesn't say 'characters' but either a space or a tab character
> back scanning is simple enough
> while( ( from & 0xC0 ) == 0x80 )
Certainly not like this! Backscanning should only directly use a single
assignement to the last known start position, no loop at all ! UTF-8
security is based on the fact that its sequences are strictly limited in
length so that you will never have more than 3 trailing bytes.
If you don't have that last position in a variable, just use 3 tests but NO
loop at all: if all 3 tests are failing, you know the input was not valid
at all, and the way to handle this error will not be solved simply by using
a very unsecure unbound loop like above but by exiting and returning an
error immediately, or throwing an exception.
The code should better be:
if (from&0xC0 == 0x80) from--;
else if (from[-1]&0xC0 == 0x80) from -=2;
else if (from[-2]&0xC0 == 0x80) from -=3;
if (from&0xC0 == 0x80) throw (some exception);
// continue here with character encoded as UTF-8 starting at "from" (an
ASCII byte or an UTF-8 leading byte)
And it should be secured using a guard byte at start of your buffer in
which the "from" pointer was pointing, so that it will never read something
else and can generate an error.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode