Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Shawn Steele via Unicode
unicode at unicode.org
Wed May 31 14:28:03 CDT 2017
> it’s more meaningful for whoever sees the output to see a single U+FFFD representing
> the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid lead byte and
> then another for an “unexpected” trailing byte.
I disagree. It may be more meaningful for some applications to have a single U+FFFD representing an illegally encoded 2-byte NULL than to have 2 U+FFFDs. Of course then you don't know if it was an illegally encoded 2-byte NULL or an illegally encoded 3-byte NULL or whatever, so some information that other applications may be interested in is lost.
Personally, I prefer the "emit a U+FFFD if the sequence is invalid, drop the byte, and try again" approach.
-Shawn
More information about the Unicode
mailing list