Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Alastair Houghton via Unicode unicode at
Tue May 23 04:17:06 CDT 2017

On 23 May 2017, at 07:10, Jonathan Coxhead via Unicode <unicode at> wrote:
> On 18/05/2017 1:58 am, Alastair Houghton via Unicode wrote:
>> On 18 May 2017, at 07:18, Henri Sivonen via Unicode <unicode at>
>>  wrote:
>>> the decision complicates U+FFFD generation when validating UTF-8 by state machine.
>> It *really* doesn’t.  Even if you’re hell bent on using a pure state machine approach, you need to add maybe two additional error states (two-trailing-bytes-to-eat-then-fffd and one-trailing-byte-to-eat-then-fffd) on top of the states you already have.  The implementation complexity argument is a *total* red herring.
> Heh. A state machine with N+2 states is, a fortiori, more complex than one with N states. So I think your argument is self-contradictory.

You’re being overly pedantic (and in this case, actually, the cyclomatic complexity of the state machine wouldn’t increase).  In any case, Henri is complaining that it’s too difficult to implement; it isn’t.  You need two extra states, both of which are trivial.

The point I was making was that this is not a strong argument against the proposed change, *even if* we were treating it as a requirement, which it isn’t.

Kind regards,



More information about the Unicode mailing list