Unicode String Models

Wed Oct 3 09:15:55 CDT 2018

On 3 October 2018 at 15:41:42, Mark Davis ☕️ via Unicode (unicode at unicode.org) wrote:

> Let me clear that up; I meant that "the underlying storage never contains
> something that would need to be represented as a surrogate code point." Of
> course, UTF-16 does need surrogate code units. What #1 would be excluding
> in the case of UTF-16 would be unpaired surrogates. That is, suppose the
> underlying storage is UTF-16 code units that don't satisfy #1.
>  
> 0061 D83D DC7D 0061 D83D
>  
> A code point API would return for those a sequence of 4 values, the last of
> which would be a surrogate code point.
>  
> 00000061, 0001F47D, 00000061, 0000D83D
>  
> A scalar value API would return for those also 4 values, but since we
> aren't in #1, it would need to remap.
>  
> 00000061, 0001F47D, 00000061, 0000FFFD

Ok understood. But I think that if you go to the length of providing a scalar-value API you would also prevent the construction of strings that have such anomalities in the first place (e.g. by erroring in the constructor if you provide it with malformed UTF-X data), i.e. maintain 1. From a programmer's perspective I really don't get anything from 2. except confusion.

> If it is a real datatype, with strong guarantees that it *never* contains
> values outside of [0x0000..0xD7FF 0xE000..0x10FFFF], then every conversion
> from number will require checking. And in my experience, without a strong
> guarantee the datatype is in practice pretty useless.

Sure. My point was that the places where you perform this check are few in practice. Namely mainly at the IO boundary of your program where you actually need to deal with encodings and, additionally, whenever you define scalar value constants (a check that could actually be performed by your compiler if your language provides a literal notation for values of this type).

Best, 

Daniel