Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Philippe Verdy via Unicode unicode at unicode.org
Tue May 16 06:15:33 CDT 2017


2017-05-16 12:40 GMT+02:00 Henri Sivonen via Unicode <unicode at unicode.org>:

> > One additional note: the standard codifies this behaviour as a
> *recommendation*, not a requirement.
>
> This is an odd argument in favor of changing it. If the argument is
> that it's just a recommendation that you don't need to adhere to,
> surely then the people who don't like the current recommendation
> should choose not to adhere to it instead of advocating changing it.


I also agree. The internet is full of RFC specifications that are also
"best practices" and even in this case, changing them must be extensively
documented, including discussing new compatibility/interoperability
problems and new security risks.

The case of random access in substrings is significant because what was
once valid UTF-8 could become invalid if the best recommandation is not
followed, and then could cause unexpected failures, uncaught exceptions
causing software to suddenly fail and become subject to possible attacks
due to this new failure (this is mostly a problem for implementations that
do not use "safe" U+FFFD replacements but throw exceptions on ill-formed
input: we should not change the cases where these exceptions may occur by
adding new cases caused by a change of implementation based on a change of
best practice).

The considerations about trying to reduce the nnumber of U+FFFD is not
relevant, purely esthetic because some people would like to compact the
decoded result in memory. What is really import is to not ignore silently
these ill-formed sequences, and properly track that there was some data
loss. The number of U+FFFD (only one or as many as there are invalid code
units in the input before the first resynchronization point) inserted is
not so important.

As well, wether implementations will use an accumulator or just a single
state (where each state knows how many code units have been parsed without
emitting an output code point, so that these code points can be decoded by
relative indexed accesses) is not relevant, it is just a very minor
optimization case (in my opinion, using an accumulator that can live in a
CPU register is faster than using relative indexed accesses

All modern CPUs have enough registers to store that accumulator, and the
input and output pointers, and a finite state number is not needed when the
state can be tracked by the executable instruction position where you don't
necessarily need to loop for each code unit but can easily write your
decoder so that each loop will process a full codepoint or will emit a
single U+FFFD before adjusting the input pointer : UTF-8 and UTF-16
complexity is small enough that unwinding such loops will be easy to
implement for processing full code points instead of single code units:

That code will still remain very small (fitting fully in instruction
cache), and it will be faster because it will avoid several conditional
branches and because it will save one register (for the finite state
number) that will not ned to be slowly saved on a stack: 2 pointer
registers (or 2 access function/method addresses) + 2 data registers + the
PC instruction counter is enough.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170516/34f8e694/attachment.html>


More information about the Unicode mailing list