Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 16:08:13 CDT 2017

2017-05-16 20:50 GMT+02:00 Shawn Steele <Shawn.Steele at microsoft.com>:

> But why change a recommendation just because it “feels like”.  As you
> said, it’s just a recommendation, so if that really annoyed someone, they
> could do something else (eg: they could use a single FFFD).
>
>
>
> If the recommendation is truly that meaningless or arbitrary, then we just
> get into silly discussions of “better” that nobody can really answer.
>
>
>
> Alternatively, how about “one or more FFFDs?” for the recommendation?
>
>
>
> To me it feels very odd to perhaps require writing extra code to detect an
> illegal case.  The “best practice” here should maybe be “one or more FFFDs,
> whatever makes your code faster”.
>

Faster ok, privided this does not break other uses, notably for  random
access within strings, where UTF-8 is designed to allow searching backward
on a limited number of bytes (maximum 3) in order to find the leading byte,
and then check its value:
- if it's not found, return back to the initial position and amke the next
access return U+FFFD to signal the error of position: this trailing byte is
part of an ill-formed sequence, and for coherence, any further trailine
bytes fouind after it will **also** return U+FFFD to be coherent (because
these other trailing bytes may also be found bby random access to them.
- it the leading byte is found backward ut does not match the expected
number of trailing bytes after it, return back to the initial random
position where you'll return also U+FFFD. This means that the initial
leading byte (part of the ill-formed sequence) must also return a separate
U+FFFD, given that each following trailing byte will return U+FFFD
isolately when accessing to them.

If we want coherent decoding with text handling promitives allowing random
access with encoded sequences, there's no other choice than treating EACH
byte part of the ill-formed sequence as individual errors mapped to the
same replacement code point (U+FFFD if that is what is chosen, but these
APIs could as well specify annother replacement character or could
eventually return a non-codepoint if the API return value is not restricted
to only valid codepoints (for example the replacement could be a negative
value whose absolute value matches the invalid code unit, or some other
invalid code unit outside the valid range for code points with scalar
values: isolated surrogates in UTF-16 for example could be returned as is,
or made negative either by returning its opposite or by setting (or'ing)
the most significant bit of the return value).

The problem will arise when you need to store the replacement values if the
internal backing store is limited to 16-bit code units or 8-bit code units:
this internal backing store may use its own internal extension of standard
UTF's, including the possibility of encoding NULLs as C0,80 (like what Java
does with its "modified UTF-8 internal encoding used in its compiled binary
classes and serializations), or internally using isolated trailing
surrogates to store illformed UTF-8 input by or'ing these bytes with 0xDC00
that will be returned as an code point with no valid scalar value. For
internally representing illformed UTF-16 sequences, there's no need to
change anything. For internally representing ill-formed UTF-32 sequences
(in fact limited to one 32-bitcode unit), with a 16bit internal backing
store you may need to store 3 16bit values (three isolated trailing
surrogates). For internally representing ill formed UTF-32 in an 8 bit
backing store, you could use 0xC1 followed by 5 five trailing bytes (each
one storing 7 bits of the initial ill-formed code unit from the UTF-32
input).

What you'll do in the internal backing store will not be exposed to your
API which will just return either valide codepoints with valid scalar
values, or values outside the two valid subranges (so it could possibly
negative values, or isolated trailing surrogates). That backing store can
also substitute some valid input causing problems (such as NULLs) using
0xC0 plus another byte, that sequence being unexposed by your API which
will still be able to return the expected codepoints (but with the minor
caveat that the total number of returned codepoints will not match the
actual size allocated for the internal backing store (that applications
using that API won't even need to know how it is internally represented).

In other words: any private extensions are possible internally, but it is
possible to isolate it within a blackboxing API which will still be able to
chose how to represent the input text (it may as well use a zlib-compressed
backing store, or some stateless Huffmann compression based on a static
statistic table configured and stored elsewhere, intiialized when you first
instantiate your API).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170516/c96c19e1/attachment.html>