Unicode String Models

Daniel Bünzli via Unicode unicode at unicode.org
Wed Oct 3 08:01:15 CDT 2018


On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (unicode at unicode.org) wrote:

> There are two main choices for a scalar-value API:
>  
> 1. Guarantee that the storage never contains surrogates. This is the
> simplest model.
> 2. Substitute U+FFFD for surrogates when the API returns code
> points. This can be done where #1 is not feasible, such as where the API is
> a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units
> that are not guaranteed to be UTF-16. The cost is extra tests on every code
> point access.

I'm not sure 2. really makes sense in pratice: it would mean you can't access scalar values 
which needs surrogates to be encoded. 

Also regarding 1. you can always defines an API that has this property regardless of the actual storage, it's only that your indexing operations might be costly as they do not directly map to the underlying storage array.

That being said I don't think direct indexing/iterating for Unicode text is such an interesting operation due of course to the normalization/segmentation issues. Basically if your API provides them I only see these indexes as useful ways to define substrings. APIs that identify/iterate boundaries (and thus substrings) are more interesting due to the nature of Unicode text.

> If the programming language provides for such a primitive datatype, that is
> possible. That would mean at a minimum that casting/converting to that
> datatype from other numerical datatypes would require bounds-checking and
> throwing an exception for values outside of [0x0000..0xD7FF
> 0xE000..0x10FFFF]. 

Yes. But note that in practice if you are in 1. above you usually perform this only at the point of decoding where you are already performing a lot of other checks. Once done you no longer need to check anything as long as the operations you perform on the values preserve the invariant. Also converting back to an integer if you need one is a no-op: it's the identity function. 

The OCaml Uchar module does this. This is the interface: 

  https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli

which defines the type t as abstract and here is the implementation: 

  https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml

which defines the implementation of type t = int which means values of this type are an *unboxed* OCaml integer (and will be stored as such in say an OCaml array). However since the module system enforces type abstraction the only way of creating such values is to use the constants or the constructors (e.g. of_int) which all maintain the scalar value invariant (if you disregard the unsafe_* functions). 

Note that it would perfectly be possible to adopt a similar approach in C via a typedef though given C's rather loose type system a little bit more discipline would be required from the programmer (always go through the constructor functions to create values of the type).

Best, 

Daniel





More information about the Unicode mailing list