Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Richard Wordingham via Unicode
unicode at unicode.org
Wed May 31 15:06:29 CDT 2017
On Wed, 31 May 2017 17:43:08 +0000
Shawn Steele via Unicode <unicode at unicode.org> wrote:
> There also appears to be a special weight given to
> non-minimally-encoded sequences. It would seem to me that none of
> these illegal sequences should appear in practice, so we have either:
> I do not understand the energy being invested in a case that
> shouldn't happen, especially in a case that is a subset of all the
> other bad cases that could happen.
That's not the motivation for my using a structurally based approach.
I want to expend as little energy as possible, both in thought (Keep
It Simple, Stupid) and in machine cycles, in catering for these
overlong/non-scalar value cases. I have to cater for indisputably
illegal truncated sequences, but for the rest of it I optimise for the
conformant case. If I'm extracting scalar values, I calculate the
scalar value and then check that it's legal. If I'm advancing through a
string, I just advance by the requisite number of trailing bytes.
UTF-8 is simple in concept, and I try to follow that simplicity. A
state machine overcomplicates it.
Moroever, if I want to handle CESU-8 or U+0000 as opposed to a sentinel
null, it is easy to add special case logic to a scalar value extractor.
More information about the Unicode