Best practices for replacing UTF-8 overlongs

Tue Dec 20 12:33:50 CST 2016

On Tue, Dec 20, 2016 at 8:59 AM, Ken Whistler <kenwhistler at att.net> wrote:

> You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the
> text there about best practices for using U+FFFD  was the discussion and
> resolution of PRI #121 in August, 2008:
>
> http://www.unicode.org/review/pr-121.html
>

Yes. However, some of the discussion in this thread is due to details that
were not spelled out in the PRI. There is basically a 2a and a 2b, while
the examples in PRI #121 work the same in both variants.

2a. As Richard said, "The natural logic is to read the requisite number of
continuation bytes, converting the whole to a codepoint value, and then
check that the codepoint value is allowed in UTF-8. Obviously one also has
to check that the requisite continuation bytes are present."

This naturally treats overlong sequences, surrogate-code-point sequences,
and 5/6-byte sequences (and prefixes thereof) as single errors.
(I suppose that lead byte above F4 could be somewhat debatable.)

(This is what ICU does for UTF-8.)

2b. The text in the standard represents the workings of a state machine
that walks strictly valid sequences. Overlong/surrogate/etc. sequences
become multiple errors.

(This is what ICU converters do for multi-byte charsets like Shift-JIS and
GB 18030.)

In my opinion, 2a. "feels right" for UTF-8, because of the history and
mechanics of the encoding, and 2b. is a good fit for MBCS where concepts
like overlong sequences don't exist. (And for GB 18030 you do have to walk
a validity state machine, you can't just look at the lead byte.)

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161220/c3599968/attachment.html>