Unicode String Models

Tue Oct 2 07:04:09 CDT 2018

Mark

On Tue, Sep 11, 2018 at 12:17 PM Henri Sivonen via Unicode <
unicode at unicode.org> wrote:

> On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
> <unicode at unicode.org> wrote:
> >
> > I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> * The Grapheme Cluster Model seems to have a couple of disadvantages
> that are not mentioned:
>   1) The subunit of string is also a string (a short string conforming
> to particular constraints). There's a need for *another* more atomic
> mechanism for examining the internals of the grapheme cluster string.
>

I did mention this.

>   2) The way an arbitrary string is divided into units when iterating
> over it changes when the program is executed on a newer version of the
> language runtime that is aware of newly-assigned codepoints from a
> newer version of Unicode.
>

Good point. I did mention the EGC definitions changing, but should point
out that if you have a string with unassigned characters in it, they may be
clustered on future versions. Will add.

>  * The Python 3.3 model mentions the disadvantages of memory usage
> cliffs but doesn't mention the associated perfomance cliffs. It would
> be good to also mention that when a string manipulation causes the
> storage to expand or contract, there's a performance impact that's not
> apparent from the nature of the operation if the programmer's
> intuition works on the assumption that the programmer is dealing with
> UTF-32.
>

The focus was on immutable string models, but I didn't make that clear.
Added some text.

>
>  * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
> optionally, HotSpot
> (
> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
> ).
> That is, text has UTF-16 semantics, but if the high half of every code
> unit in a string is zero, only the lower half is stored. This has
> properties analogous to the Python 3.3 model, except non-BMP doesn't
> expand to UTF-32 but uses UTF-16 surrogate pairs.
>

Thanks, will add.

>
>  * I think the fact that systems that chose UTF-16 or UTF-32 have
> implemented models that try to save storage by omitting leading zeros
> and gaining complexity and performance cliffs as a result is a strong
> indication that UTF-8 should be recommended for newly-designed systems
> that don't suffer from a forceful legacy need to expose UTF-16 or
> UTF-32 semantics.
>
>  * I suggest splitting the "UTF-8 model" into three substantially
> different models:
>
>  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting byte-oriented
> data. Byte buffers and text buffers are type-wise ambiguous. Only
> iterating over byte data by code point gives the data the UTF-8
> interpretation. Unless the data is cleaned up as a side effect of such
> iteration, malformed sequences in input survive into output.
>
>  2) UTF-8 without full trust in ability to retain validity (the model
> of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> common UTF-8 model for C and C++, but I don't have evidence to back
> this up): When data is ingested with text semantics, it is converted
> to UTF-8. For data that's supposed to already be in UTF-8, this means
> replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> data is valid UTF-8 right after input. However, iteration by code
> point doesn't trust ability of other code to retain UTF-8 validity
> perfectly and has "else" branches in order not to blow up if invalid
> UTF-8 creeps into the system.
>
>  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> have a different type in the type system than byte buffers. To go from
> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> has been tagged as valid UTF-8, the validity is trusted completely so
> that iteration by code point does not have "else" branches for
> malformed sequences. If data that the type system indicates to be
> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> language has a default "safe" side and an opt-in "unsafe" side. The
> unsafe side is for performing low-level operations in a way where the
> responsibility of upholding invariants is moved from the compiler to
> the programmer. It's impossible to violate the UTF-8 validity
> invariant using the safe part of the language.
>

Added a quote based on this; please check if it is ok.

>
>  * After working with different string models, I'd recommend the Rust
> model for newly-designed programming languages. (Not because I work
> for Mozilla but because I believe Rust's way of dealing with Unicode
> is the best I've seen.) Rust's standard library provides Unicode
> version-independent iterations over strings: by code unit and by code
> point. Iteration by extended grapheme cluster is provided by a library
> that's easy to include due to the nature of Rust package management
> (https://crates.io/crates/unicode_segmentation). Viewing a UTF-8
> buffer as a read-only byte buffer has zero run-time cost and allows
> for maximally fast guaranteed-valid-UTF-8 output.
>
> --
> Henri Sivonen
> hsivonen at hsivonen.fi
> https://hsivonen.fi/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181002/948c7821/attachment.html>