Unicode String Models

Thu Nov 22 04:27:31 CST 2018

On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️ <mark at macchiato.com> wrote:

>
>   * The Python 3.3 model mentions the disadvantages of memory usage
>> cliffs but doesn't mention the associated perfomance cliffs. It would
>> be good to also mention that when a string manipulation causes the
>> storage to expand or contract, there's a performance impact that's not
>> apparent from the nature of the operation if the programmer's
>> intuition works on the assumption that the programmer is dealing with
>> UTF-32.
>>
>
> The focus was on immutable string models, but I didn't make that clear.
> Added some text.
>

Thanks.

>  * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
>> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
>> optionally, HotSpot
>> (
>> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
>> ).
>> That is, text has UTF-16 semantics, but if the high half of every code
>> unit in a string is zero, only the lower half is stored. This has
>> properties analogous to the Python 3.3 model, except non-BMP doesn't
>> expand to UTF-32 but uses UTF-16 surrogate pairs.
>>
>
> Thanks, will add.
>

V8 source code shows it has a OneByteString storage option:
https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium&g=0&l=494
. From hearsay, I'm convinced that it means Latin1, but I've failed to find
a clear quotable statement from a V8 developer to that affect.

>   3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
>> have a different type in the type system than byte buffers. To go from
>> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
>> has been tagged as valid UTF-8, the validity is trusted completely so
>> that iteration by code point does not have "else" branches for
>> malformed sequences. If data that the type system indicates to be
>> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
>> language has a default "safe" side and an opt-in "unsafe" side. The
>> unsafe side is for performing low-level operations in a way where the
>> responsibility of upholding invariants is moved from the compiler to
>> the programmer. It's impossible to violate the UTF-8 validity
>> invariant using the safe part of the language.
>>
>
> Added a quote based on this; please check if it is ok.
>

Looks accurate. Thanks.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181122/68973f28/attachment.html>