Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

Philippe Verdy verdy_p at wanadoo.fr
Sat May 9 00:56:52 CDT 2015


Note: I used "16-bit string" in my sentence, NOT "Unicode 16-bit string"
which I used in the later part of my sentence (but also including 8-bit and
32-bit for the same restrictions in "Unicode strings")... So no
contradiction.


2015-05-09 7:55 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

>
>
> 2015-05-09 6:37 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:
>
>> On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>> wrote:
>>
>>> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
>>> richard.wordingham at ntlworld.com>:
>>>
>>>> I can't think of a practical use for the specific concepts of Unicode
>>>> 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
>>>> essentially the same as 16-bit strings, and Unicode 32-bit strings are
>>>> UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
>>>> pedantry; there are more useful categories of 8-bit strings that are
>>>> not UTF-8 strings.
>>>>
>>>
>>> And here you're wrong: a 16-bit string is just a sequence of arbitrary
>>> 16-bit code units, but an Unicode string (whatever the size of its code
>>> units) adds restrictions for validity (the only restriction being in fact
>>> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
>>> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
>>> forbidden.
>>>
>>
>> No, Richard had it right. See for example definition D82 "Unicode 16-bit
>> string" in the standard. (Section 3.9 Unicode Encoding Forms,
>> http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)
>>
>
> I was right, D82 refers to "UTF-16", which implies  the restriction of
> validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of
> non-characters).
>
> I was right, You and Richard were wrong.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/227fc7d9/attachment.html>


More information about the Unicode mailing list