Why Work at Encoding Level?

Mon Oct 19 13:53:03 CDT 2015

On Mon, 19 Oct 2015 10:07:31 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> This discussion was originally about how to handle unpaired
> surrogates, as if that were a normal use case.

And the subject line was changed when the topic changed to traversing
strings.

> Regardless of what encoding model is used to handle characters under
> the hood, and regardless of how the Delete key should work with actual
> characters or clusters, there is never any excuse for software to
> create unpaired surrogates, or any other sort of invalid code unit
> sequences.

How about, 'The specification says that one must pass the number of
_characters_ in the string.'?  Even worse, some specifications talk of
'Unicode characters' when they mean UTF-16 code units.  The word
'codepoint' is even worse, as a supplementary plane codepoint is
represented by two BMP codepoints.

ICU (but perhaps it's actually Java) seems to have a culture of
tolerating lone surrogates, and rules for handling lone surrogates are
strewn across the Unicode standards and annexes.  It was the once the
case that basic Unicode support in regular expressions required a
regular expression engine to be able to search for specified lone
surrogates - a real show-stopper for an engine working in UTF-8.
The Unicode collation algorithm conformance test once tested that
implementations of collation collated lone surrogates correctly.
Raising an exception was an automatic test failure!  By contrast,
no-one's proposed collation rules for broken bits of UTF-8 characters
or non-minimal length forms.

> That is like having an image editor that deletes every
> 128th byte from a JPEG file, and then worrying about how to display
> the file.

1. Of course, telemetry streams may very well contain damaged JPEG
images! 

2. The problem bad handling of supplementary characters seems to be
associated with UTF-16 is that the damage is rarely as obvious as every
128th code unit.  By contrast, bad UTF-8 handling usually comes to light
as soon as the text processing moves beyond ASCII.

Richard.