Unpaired surrogates (was: Re: Why Work at Encoding Level?)

Mon Oct 19 16:29:29 CDT 2015

On Mon, Oct 19, 2015 at 1:32 PM, Doug Ewell <doug at ewellic.org> wrote:

> > ICU (but perhaps it's actually Java) seems to have a culture of
> > tolerating lone surrogates, and rules for handling lone surrogates are
> > strewn across the Unicode standards and annexes.
>
> I suspect you have an example.

I have examples from ICU processing of 16-bit Unicode strings (which are
not usually required to be well-formed UTF-16 strings):

- "Count code points" counts an unpaired surrogate as 1.
- "Move forward/backward by n code points" counts an unpaired surrogate as
1.
- "Lower-/title-/upper-case the string" passes through an unpaired
surrogate as-is like any code point that does not have case mappings.
- "Get property x of code point y" returns the property value according to
the UCD; for example, gc(surrogate)=Cs.
- Collating a string that contains an unpaired surrogate: ICU currently
uses the second approach from UCA section 7.1.1
<http://www.unicode.org/reports/tr10/#Handling_Illformed>.

See http://userguide.icu-project.org/strings#TOC-ICU:-16-bit-Unicode-strings

However, "convert from UTF-16 to UTF-8" and such treats an unpaired
surrogate as an error.

> The Unicode collation algorithm conformance test once tested that
> > implementations of collation collated lone surrogates correctly.
> > Raising an exception was an automatic test failure! By contrast,
> > no-one's proposed collation rules for broken bits of UTF-8 characters
> > or non-minimal length forms.
>
> Are these tests still included, or did someone notice that they were in
> conflict with the standard and removed them?
>

We updated http://www.unicode.org/Public/UCA/latest/CollationTest.html to
say:

"These files contain test cases that include ill-formed strings, with
surrogate code points. Implementations that do not weight surrogate code
points the same way as reserved code points may filter out such lines lines
in the test cases, before testing for conformance."

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151019/89cbd68d/attachment.html>