Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Alastair Houghton via Unicode
unicode at unicode.org
Mon May 15 13:02:34 CDT 2017
On 15 May 2017, at 18:52, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote:
>> On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>>> In reference to:
>>> I think Unicode should not adopt the proposed change.
>> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense.
> Changing a specification as fundamental as this is something that should not be undertaken lightly.
> Apparently we have a situation where implementations disagree, and have done so for a while. This normally means not only that the implementations differ, but that data exists in both formats.
> Even if it were true that all data is only stored in UTF-8, any data converted from UFT-8 back to UTF-8 going through an interim stage that requires UTF-8 conversion would then be different based on which converter is used.
> Implementations working in UTF-8 natively would potentially see three formats:
> 1) the original ill-formed data
> 2) data converted with single FFFD
> 3) data converted with multiple FFFD
> These forms cannot be compared for equality by binary matching.
But that was always true, if you were under the impression that only one of (2) and (3) existed, and indeed claiming equality between two instances of U+FFFD might be problematic itself in some circumstances (you don’t know why the U+FFFDs were inserted - they may not replace the same original data).
> The best that can be done is to convert (1) into one of the other forms and then compare treating any run of FFFD code points as equal to any other run, irrespective of length.
It’s probably safer, actually, to refuse to compare U+FFFD as equal to anything (even itself) unless a special flag is passed. For “general purpose” applications, you could set that flag and then a single U+FFFD would compare equal to another single U+FFFD; no need for the complicated “any string of U+FFFD” logic (which in any case makes little sense - it could just as easily generate erroneous comparisons as fix the case we’re worrying about here).
> Because we've had years of multiple implementations, it would be expected that copious data exists in all three formats, and that data will not go away. Changing the specification to pick one of these formats as solely conformant is IMHO too late.
I don’t think so. Even if we acknowledge the possibility of data in the other form, I think it’s useful guidance to implementers, both now and in the future. One might even imagine that the other, non-favoured form, would eventually fall out of use.
More information about the Unicode