Unicode String Models
Mark Davis ☕️ via Unicode
unicode at unicode.org
Wed Oct 3 08:41:42 CDT 2018
On Wed, Oct 3, 2018 at 3:01 PM Daniel Bünzli <daniel.buenzli at erratique.ch>
> On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (
> unicode at unicode.org) wrote:
> > There are two main choices for a scalar-value API:
> > 1. Guarantee that the storage never contains surrogates. This is the
> > simplest model.
> > 2. Substitute U+FFFD for surrogates when the API returns code
> > points. This can be done where #1 is not feasible, such as where the API
> > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code
> > that are not guaranteed to be UTF-16. The cost is extra tests on every
> > point access.
> I'm not sure 2. really makes sense in pratice: it would mean you can't
> access scalar values
> which needs surrogates to be encoded.
Let me clear that up; I meant that "the underlying storage never contains
something that would need to be represented as a surrogate code point." Of
course, UTF-16 does need surrogate code units. What #1 would be excluding
in the case of UTF-16 would be unpaired surrogates. That is, suppose the
underlying storage is UTF-16 code units that don't satisfy #1.
0061 D83D DC7D 0061 D83D
A code point API would return for those a sequence of 4 values, the last of
which would be a surrogate code point.
00000061, 0001F47D, 00000061, 0000D83D
A scalar value API would return for those also 4 values, but since we
aren't in #1, it would need to remap.
00000061, 0001F47D, 00000061, 0000FFFD
> Also regarding 1. you can always defines an API that has this property
> regardless of the actual storage, it's only that your indexing operations
> might be costly as they do not directly map to the underlying storage array.
> That being said I don't think direct indexing/iterating for Unicode text
> is such an interesting operation due of course to the
> normalization/segmentation issues. Basically if your API provides them I
> only see these indexes as useful ways to define substrings. APIs that
> identify/iterate boundaries (and thus substrings) are more interesting due
> to the nature of Unicode text.
I agree that iteration is a very common case. But quite often
implementations need to have at least opaque indexes (as discussed).
> > If the programming language provides for such a primitive datatype, that
> > possible. That would mean at a minimum that casting/converting to that
> > datatype from other numerical datatypes would require bounds-checking and
> > throwing an exception for values outside of [0x0000..0xD7FF
> > 0xE000..0x10FFFF].
> Yes. But note that in practice if you are in 1. above you usually perform
> this only at the point of decoding where you are already performing a lot
> of other checks. Once done you no longer need to check anything as long as
> the operations you perform on the values preserve the invariant. Also
> converting back to an integer if you need one is a no-op: it's the identity
If it is a real datatype, with strong guarantees that it *never* contains
values outside of [0x0000..0xD7FF 0xE000..0x10FFFF], then every conversion
from number will require checking. And in my experience, without a strong
guarantee the datatype is in practice pretty useless.
> The OCaml Uchar module does this. This is the interface:
> which defines the type t as abstract and here is the implementation:
> which defines the implementation of type t = int which means values of
> this type are an *unboxed* OCaml integer (and will be stored as such in say
> an OCaml array). However since the module system enforces type abstraction
> the only way of creating such values is to use the constants or the
> constructors (e.g. of_int) which all maintain the scalar value invariant
> (if you disregard the unsafe_* functions).
> Note that it would perfectly be possible to adopt a similar approach in C
> via a typedef though given C's rather loose type system a little bit more
> discipline would be required from the programmer (always go through the
> constructor functions to create values of the type).
That's the C motto: "requiring a 'bit more' discipline from programmers"
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode