Unicode String Models

Thu Nov 22 05:24:49 CST 2018

Thanks for the review! In case you're interested, I'd also welcome feedback
on Locale Identifiers <https://goo.gl/kizkrm>

Mark

On Thu, Nov 22, 2018 at 11:27 AM Henri Sivonen <hsivonen at hsivonen.fi> wrote:

> On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️ <mark at macchiato.com> wrote:
>
>>
>>   * The Python 3.3 model mentions the disadvantages of memory usage
>>> cliffs but doesn't mention the associated perfomance cliffs. It would
>>> be good to also mention that when a string manipulation causes the
>>> storage to expand or contract, there's a performance impact that's not
>>> apparent from the nature of the operation if the programmer's
>>> intuition works on the assumption that the programmer is dealing with
>>> UTF-32.
>>>
>>
>> The focus was on immutable string models, but I didn't make that clear.
>> Added some text.
>>
>
> Thanks.
>
>
>>  * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
>>> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
>>> optionally, HotSpot
>>> (
>>> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
>>> ).
>>> That is, text has UTF-16 semantics, but if the high half of every code
>>> unit in a string is zero, only the lower half is stored. This has
>>> properties analogous to the Python 3.3 model, except non-BMP doesn't
>>> expand to UTF-32 but uses UTF-16 surrogate pairs.
>>>
>>
>> Thanks, will add.
>>
>
> V8 source code shows it has a OneByteString storage option:
> https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium&g=0&l=494
> . From hearsay, I'm convinced that it means Latin1, but I've failed to find
> a clear quotable statement from a V8 developer to that affect.
>
>
>>   3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
>>> have a different type in the type system than byte buffers. To go from
>>> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
>>> has been tagged as valid UTF-8, the validity is trusted completely so
>>> that iteration by code point does not have "else" branches for
>>> malformed sequences. If data that the type system indicates to be
>>> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
>>> language has a default "safe" side and an opt-in "unsafe" side. The
>>> unsafe side is for performing low-level operations in a way where the
>>> responsibility of upholding invariants is moved from the compiler to
>>> the programmer. It's impossible to violate the UTF-8 validity
>>> invariant using the safe part of the language.
>>>
>>
>> Added a quote based on this; please check if it is ok.
>>
>
> Looks accurate. Thanks.
>
> --
> Henri Sivonen
> hsivonen at hsivonen.fi
> https://hsivonen.fi/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181122/8ac9f047/attachment.html>