Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Richard Wordingham via Unicode unicode at
Wed May 31 15:06:29 CDT 2017

On Wed, 31 May 2017 17:43:08 +0000
Shawn Steele via Unicode <unicode at> wrote:

> There also appears to be a special weight given to
> non-minimally-encoded sequences.  It would seem to me that none of
> these illegal sequences should appear in practice, so we have either:


> I do not understand the energy being invested in a case that
> shouldn't happen, especially in a case that is a subset of all the
> other bad cases that could happen.

That's not the motivation for my using a structurally based approach.
I want to expend as little energy as possible, both in thought (Keep
It Simple, Stupid) and in machine cycles, in catering for these
overlong/non-scalar value cases. I have to cater for indisputably
illegal truncated sequences, but for the rest of it I optimise for the
conformant case. If I'm extracting scalar values, I calculate the
scalar value and then check that it's legal. If I'm advancing through a
string, I just advance by the requisite number of trailing bytes.
UTF-8 is simple in concept, and I try to follow that simplicity.  A
state machine overcomplicates it.

Moroever, if I want to handle CESU-8 or U+0000 as opposed to a sentinel
null, it is easy to add special case logic to a scalar value extractor.

> -Shawn 

More information about the Unicode mailing list