Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Philippe Verdy via Unicode unicode at
Wed May 17 19:04:55 CDT 2017

I find intriguating that the update intends to enforce the decoding of the
**shortest** sequences, but now wants to treat **maximal sequences** as a
single unit with arbitrary length. UTF-8 was designed to work only with
some state machines that would NEVER need to parse more than 4 bytes.

For me, as soon as the first byte encountered is invalid, the current
sequence should be stopped there and treated as error (replaced by U+FFFD
is replacement is enabled instead of returning an error or throwing an
exception), and then any further trailing byte should be treated isolated
as an error: The number of returned U+FFFD replacements would then be the
same when you scan the input forward or backward without **ever** reading
more than 4 bytes in all directions (this is a problem when the parseing
will reach an end of buffer where you'll block on performing I/O to read
the previous or next block, and managing a cache of multiple blocks (more
than 2) is a problem with this unexpected change that will create new
performance problems and add new memory constraints (in adition to new
possible attacks if that parser needs to keep multiple buffers in memorty
instead of treating them individually with a single overhead buffer, and
throwing away the individual buffers on the fly as soon as they are
indivisually fully parsed).

2017-05-18 1:41 GMT+02:00 Asmus Freytag via Unicode <unicode at>:

> On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote:
> There's some sort of rule that proposals should be made seven days in
> advance of the meeting.  I can't find it now, so I'm not sure whether
> the actual rule was followed, let alone what authority it has.
> Ideally, proposals that update algorithms or properties of some
> significance should be required to be reviewed in more than one pass. The
> procedures of the UTC are a bit weak in that respect, at least compared to
> other standards organizations. The PRI process addresses that issue to some
> extent.
> A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list