Unicode String Models
Henri Sivonen via Unicode
unicode at unicode.org
Thu Nov 22 04:27:31 CST 2018
On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️ <mark at macchiato.com> wrote:
> * The Python 3.3 model mentions the disadvantages of memory usage
>> cliffs but doesn't mention the associated perfomance cliffs. It would
>> be good to also mention that when a string manipulation causes the
>> storage to expand or contract, there's a performance impact that's not
>> apparent from the nature of the operation if the programmer's
>> intuition works on the assumption that the programmer is dealing with
> The focus was on immutable string models, but I didn't make that clear.
> Added some text.
> * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
>> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
>> optionally, HotSpot
>> That is, text has UTF-16 semantics, but if the high half of every code
>> unit in a string is zero, only the lower half is stored. This has
>> properties analogous to the Python 3.3 model, except non-BMP doesn't
>> expand to UTF-32 but uses UTF-16 surrogate pairs.
> Thanks, will add.
V8 source code shows it has a OneByteString storage option:
. From hearsay, I'm convinced that it means Latin1, but I've failed to find
a clear quotable statement from a V8 developer to that affect.
> 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
>> have a different type in the type system than byte buffers. To go from
>> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
>> has been tagged as valid UTF-8, the validity is trusted completely so
>> that iteration by code point does not have "else" branches for
>> malformed sequences. If data that the type system indicates to be
>> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
>> language has a default "safe" side and an opt-in "unsafe" side. The
>> unsafe side is for performing low-level operations in a way where the
>> responsibility of upholding invariants is moved from the compiler to
>> the programmer. It's impossible to violate the UTF-8 validity
>> invariant using the safe part of the language.
> Added a quote based on this; please check if it is ok.
Looks accurate. Thanks.
hsivonen at hsivonen.fi
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode