Unicode String Models

Wed Oct 3 02:17:10 CDT 2018

Mark

On Tue, Oct 2, 2018 at 8:31 PM Daniel Bünzli <daniel.buenzli at erratique.ch>
wrote:

> On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (
> unicode at unicode.org) wrote:
>
> > Because of performance and storage consideration, you need to consider
> the
> > possible internal data structures when you are looking at something as
> > low-level as strings. But most of the 'model's in the document are only
> > really distinguished by API, only the "Code Point model" discussions are
> > segmented by internal storage, as with "Code Point Model: UTF-32"
>
> I guess my gripe with the presentation of that document is that it
> perpetuates the problem of confusing "unicode characters" (or integers, or
> scalar values) and their *encoding* (how to represent these integers as
> byte sequences) which a source of endless confusion among programmers.
>
> This confusion is easy lifted once you explain that there exists certain
> integers, the scalar values, which are your actual characters and then you
> have different ways of encoding your characters; one can then explain that
> a surrogate is not a character per se, it's a hack and there's no point in
> indexing them except if you want trouble.
>
> This may also suggest another taxonomy of classification for the APIs,
> those in which you work directly with the character data (the scalar
> values) and those in which you work with an encoding of the actual
> character data (e.g. a JavaScript string).
>

Thanks for the feedback. It is worth adding a discussion of the issues,
perhaps something like:

A code-point-based API takes and returns int32's, although only a small
subset of the values are valid code points, namely 0x0..0x10FFFF. (In
practice some APIs may support returning -1 to signal an error or
termination, such as before or after the end of a string.) A surrogate code
point is one in U+D800..U+DFFF; these reflect a range of special code units
used in pairs in UTF-16 for representing code points above U+FFFF. A scalar
value is a code point that is not a surrogate.

A scalar-value API for immutable strings requires that no surrogate code
points are ever returned. In practice, the main advantage of that API is
that round-tripping to UTF-8/16 is guaranteed. Otherwise, a leaked
surrogate code point is relatively harmless: Unicode properties are devised
so that clients can essentially treat them as (permanently) unassigned
characters. Warning: an iterator should *never* avoid returning surrogate
code points by skipping them; that can cause security problems; see
https://www.unicode.org/reports/tr36/tr36-7.html#Substituting_for_Ill_Formed_Subsequences
and
https://www.unicode.org/reports/tr36/tr36-7.html#Deletion_of_Noncharacters.

There are two main choices for a scalar-value API:

   1. Guarantee that the storage never contains surrogates. This is the
   simplest model.
   2. Substitute U+FFFD for surrogates when the API returns code
   points. This can be done where #1 is not feasible, such as where the API is
   a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units
   that are not guaranteed to be UTF-16. The cost is extra tests on every code
   point access.

> > In reality, most APIs are not even going to be in terms of code points:
> > they will return int32's.
>
> That reality depends on your programming language. If the latter supports
> type abstraction you can define an abstract type for scalar values (whose
> implementation may simply be an integer). If you always go through the
> constructor to create these "integers" you can maintain the invariant that
> a value of this type is an integer in the ranges [0x0000;0xD7FF] and
> [0xE000;0x10FFFF]. Knowing this invariant holds is quite useful when you
> feed your "character" data to other processes like UTF-X encoders: it
> guarantees the correctness of their outputs regardless of what the
> programmer does.
>

If the programming language provides for such a primitive datatype, that is
possible. That would mean at a minimum that casting/converting to that
datatype from other numerical datatypes would require bounds-checking and
throwing an exception for values outside of [0x0000..0xD7FF
0xE000..0x10FFFF]. Most common-use programming languages that I know of
don't support that for primitives; the API would have to use a class, which
would be so very painful for performance/storage. If you (or others) know
of languages that do have such a cheap primitive datatype, that would be
worth mentioning!

> Best,
>
> Daniel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181003/478c2365/attachment.html>