Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Mon May 15 17:53:13 CDT 2017

2017-05-15 19:54 GMT+02:00 Asmus Freytag via Unicode <unicode at unicode.org>:

> I think this political reason should be taken very seriously. There are
> already too many instances where ICU can be seen "driving" the development
> of property and algorithms.
>
> Those involved in the ICU project may not see the problem, but I agree
> with Henri that it requires a bit more sensitivity from the UTC.
>
I don't think that the fact that ICU was originately using UTF-16
internally has ANY effect on the decision to represent ill-formed sequences
as single or multiple U+FFFD.
The internal encoding has nothing in common with the external encoding used
when processing input data (which may be UTf-8, UTF-16, UTF-32, and could
in all case present ill-formed sequences). That internal encoding here will
paly no role in how to convert the ill-formed input, or if it will be
converted.
So yes, independantly of the internal encoding, we'll still ahve to choose
between:
- not converting the input and return an error or throw an exception
- converting the input using a single U+FFFD (in its internal
representation, this does not matter) to replace the complete sequence of
ill-formed code units in the input data, and preferably return an error
status
- converting the input using as many U+FFFD (in its internal
representation, this does not matter)  to replace every ocurence of
ill-formed code units in the input data, and preferably return an error
status.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170516/bbd51617/attachment.html>