Unicode String Models

Daniel Bünzli via Unicode unicode at unicode.org
Tue Oct 2 13:31:02 CDT 2018

On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (unicode at unicode.org) wrote:

> Because of performance and storage consideration, you need to consider the
> possible internal data structures when you are looking at something as
> low-level as strings. But most of the 'model's in the document are only
> really distinguished by API, only the "Code Point model" discussions are
> segmented by internal storage, as with "Code Point Model: UTF-32"

I guess my gripe with the presentation of that document is that it perpetuates the problem of confusing "unicode characters" (or integers, or scalar values) and their *encoding* (how to represent these integers as byte sequences) which a source of endless confusion among programmers. 

This confusion is easy lifted once you explain that there exists certain integers, the scalar values, which are your actual characters and then you have different ways of encoding your characters; one can then explain that a surrogate is not a character per se, it's a hack and there's no point in indexing them except if you want trouble.

This may also suggest another taxonomy of classification for the APIs, those in which you work directly with the character data (the scalar values) and those in which you work with an encoding of the actual character data (e.g. a JavaScript string).

> In reality, most APIs are not even going to be in terms of code points:
> they will return int32's. 

That reality depends on your programming language. If the latter supports type abstraction you can define an abstract type for scalar values (whose implementation may simply be an integer). If you always go through the constructor to create these "integers" you can maintain the invariant that a value of this type is an integer in the ranges [0x0000;0xD7FF] and [0xE000;0x10FFFF]. Knowing this invariant holds is quite useful when you feed your "character" data to other processes like UTF-X encoders: it guarantees the correctness of their outputs regardless of what the programmer does.



More information about the Unicode mailing list