Unicode String Models

Mark Davis ☕️ via Unicode unicode at unicode.org
Wed Oct 3 08:41:42 CDT 2018


On Wed, Oct 3, 2018 at 3:01 PM Daniel Bünzli <daniel.buenzli at erratique.ch>

> On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (
> unicode at unicode.org) wrote:
> > There are two main choices for a scalar-value API:
> >
> > 1. Guarantee that the storage never contains surrogates. This is the
> > simplest model.
> > 2. Substitute U+FFFD for surrogates when the API returns code
> > points. This can be done where #1 is not feasible, such as where the API
> is
> > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code
> units
> > that are not guaranteed to be UTF-16. The cost is extra tests on every
> code
> > point access.
> I'm not sure 2. really makes sense in pratice: it would mean you can't
> access scalar values
> which needs surrogates to be encoded.

Let me clear that up; I meant that "the underlying storage never contains
something that would need to be represented as a surrogate code point." Of
course, UTF-16 does need surrogate code units. What #1 would be excluding
in the case of UTF-16 would be unpaired surrogates. That is, suppose the
underlying storage is UTF-16 code units that don't satisfy #1.

0061 D83D DC7D 0061 D83D

A code point API would return for those a sequence of 4 values, the last of
which would be a surrogate code point.

00000061, 0001F47D, 00000061, 0000D83D

A scalar value API would return for those also 4 values, but since we
aren't in #1, it would need to remap.

00000061, 0001F47D, 00000061, 0000FFFD

> Also regarding 1. you can always defines an API that has this property
> regardless of the actual storage, it's only that your indexing operations
> might be costly as they do not directly map to the underlying storage array.

> That being said I don't think direct indexing/iterating for Unicode text
> is such an interesting operation due of course to the
> normalization/segmentation issues. Basically if your API provides them I
> only see these indexes as useful ways to define substrings. APIs that
> identify/iterate boundaries (and thus substrings) are more interesting due
> to the nature of Unicode text.

I agree that iteration is a very common case. But quite often
implementations need to have at least opaque indexes (as discussed).

> > If the programming language provides for such a primitive datatype, that
> is
> > possible. That would mean at a minimum that casting/converting to that
> > datatype from other numerical datatypes would require bounds-checking and
> > throwing an exception for values outside of [0x0000..0xD7FF
> > 0xE000..0x10FFFF].
> Yes. But note that in practice if you are in 1. above you usually perform
> this only at the point of decoding where you are already performing a lot
> of other checks. Once done you no longer need to check anything as long as
> the operations you perform on the values preserve the invariant. Also
> converting back to an integer if you need one is a no-op: it's the identity
> function.

If it is a real datatype, with strong guarantees that it *never* contains
values outside of [0x0000..0xD7FF 0xE000..0x10FFFF], then every conversion
from number will require checking. And in my experience, without a strong
guarantee the datatype is in practice pretty useless.

> The OCaml Uchar module does this. This is the interface:
>   https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli
> which defines the type t as abstract and here is the implementation:
>   https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml
> which defines the implementation of type t = int which means values of
> this type are an *unboxed* OCaml integer (and will be stored as such in say
> an OCaml array). However since the module system enforces type abstraction
> the only way of creating such values is to use the constants or the
> constructors (e.g. of_int) which all maintain the scalar value invariant
> (if you disregard the unsafe_* functions).
> Note that it would perfectly be possible to adopt a similar approach in C
> via a typedef though given C's rather loose type system a little bit more
> discipline would be required from the programmer (always go through the
> constructor functions to create values of the type).

That's the C motto: "requiring a 'bit more' discipline from programmers"


> Best,
> Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181003/96efdec0/attachment.html>

More information about the Unicode mailing list