Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Mon May 15 18:20:40 CDT 2017

Softwares designed with only UCS-2 and not real UTF-16 support are still
used today

For example MySQL with its broken "UTF-8" encoding which in fact encodes
supplementary characters as two separate 16-bit code-units for surrogates,
each one blindly encoded as 3-byte sequences which would be ill-formed in
standard UTF-8, buit that also does not differentiate invalid pairs of
surrogates, and offers no collation support for supplementary characters.

In this case some other softwares will break silently on these sequences
(for example Mediawiki when installed with a MySQL backend server whose
datastore was created with its broken "UTF-8", will silently discard any
text starting at the first supplementary character found in the wikitext.
This is not a problem of Mediawiki but the fact the MediaWiki does NOT
support such MySQL server isntalled with its "UTF-8" datastore, but only
supports MySQL if the storage encoding declared for the database was
"binary" (but in that case there's no support of collation in MySQL, texts
are just containing any random sequences of bytes and internationalization
is then made in the client software, here Mediawiki and its PHP, ICU, or
Lua libraries, and other tools written in Perl and other languages)

Note that this does not affect Wikimedia in its wikis because they were
initially installed corectly with the binary encoding in MySQL, but now
Wikimedia wikis use another database engine with native UTF-8 support and
full coverage of the UCS. Other wikis using Mediawiki will need to upgrade
their MySQL version if they want to keep it for adminsitrative reasons (and
not convert again their datastore to the binary encoding).

Softwares running with only UCS-2 are exposed to such risks similar to the
one seen in MediaWiki on incorrect MySQL installations, where any user may
edit a page to insert any supplementary character (supplementary sinograms,
emojis, Gothic letters, supplementary symbols...) which will look correct
when previewing, and correct when it is parsed, accepted silently by MySQL,
but then silently truncated because of the encoding error: when reloading
the data from MySQL, there will effectively be unexpectedly discarded data.

How to react to the risks of data losses or truncation ? Throwing an
exception or just returning an error is in fact more dangerous than just
replacing the ill-formed sequences by one or more U+FFFD: we preserve as
much as possible, but anyway softwares should be able to perform some tests
in their datastore to see if they correctly handle the encoding: this could
be done when starting the sofware and emitting log messages when the
backend do not support the encoding: all that is needed is to send a single
supplementary character to the remote datastore in a junk table or field
and then retrieve it immediately in another transaction to make sure it is
preserved. Similar tests can be done to see if the remote datastore also
preserves the encoding form or "normalizes it, or alters it (this
alteration could happen with a leading BOM and some other silent
alterations could be made on NULL and trailing spaces if the datastore does
not use text fields with varying length but fixed length instead). Similar
tests could be done to check the maximum length accepted (a VARCHAR(256) on
a binary-encoded database will not always store 256 Unciode characters, but
in a database encoded with non borken UTF-8, it should store 256 codepoints
independantly of theior values, even if their UTF-8 encoding would be up to
1024 bytes.

2017-05-16 0:43 GMT+02:00 Richard Wordingham via Unicode <
unicode at unicode.org>:

> On Mon, 15 May 2017 21:38:26 +0000
> David Starner via Unicode <unicode at unicode.org> wrote:
>
> > > and the fact is that handling surrogates (which is what proponents
> > > of UTF-8 or UCS-4 usually focus on) is no more complicated than
> > > handling combining characters, which you have to do anyway.
>
> > Not necessarily; you can legally process Unicode text without worrying
> > about combining characters, whereas you cannot process UTF-16 without
> > handling surrogates.
>
> The problem with surrogates is inadequate testing.  They're sufficiently
> rare for many users that it may be a long time before an error is
> discovered.  It's not always obvious that code is designed for UCS-2
> rather than UTF-16.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170516/042c670a/attachment.html>