Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 08:10:05 CDT 2017

On Tue, 16 May 2017 20:08:52 +0900
"Martin J. Dürst via Unicode" <unicode at unicode.org> wrote:

> I agree with others that ICU should not be considered to have a
> special status, it should be just one implementation among others.

> [The next point is a side issue, please don't spend too much time on 
> it.] I find it particularly strange that at a time when UTF-8 is
> firmly defined as up to 4 bytes, never including any bytes above
> 0xF4, the Unicode consortium would want to consider recommending that
> <FD 81 82 83 84 85> be converted to a single U+FFFD. I note with
> agreement that Markus seems to have thoughts in the same direction,
> because the proposal (17168-utf-8-recommend.pdf) says "(I suppose
> that lead bytes above F4 could be somewhat debatable.)".

The undesirable sidetrack, I suppose, is worrying about how many planes
will be required for emoji.

However, it does make for the point that, while some practices may be
better than other, there isn't necessarily a best practice.

The English of the proposal is unclear - the text would benefit from
showing some maximal subsequences (poor terminology - some of us are
used to non-contiguous subsequences).  When he writes, "For UTF-8,
recommend evaluating maximal subsequences based on the original
structural definition of UTF-8, without ever restricting trail bytes to
less than 80..BF", I am pretty sure he means "For UTF-8,
recommend evaluating maximal subsequences based on the original
structural definition of UTF-8, with the only restriction on trailing
bytes beyond the number of them being that they must be in the range
80..BF".

Thus Philippe's example of "E0 E0 C3 89" would be converted with an
error flagged to a sequence of scalar values FFFD FFFD C9.

This may make a UTF-8 system usable if it tries to use something like
non-characters as understood before CLDR was caught publishing them
as an essential part of text files.

Richard.