Unicode education in Schools

Philippe Verdy via Unicode unicode at unicode.org
Thu Aug 24 21:29:06 CDT 2017

Strings in Java and JavaScript are basically the same as they are arbitrary
sequences of 16-bit code units, and not restricted to text with valid
UTF-16 encoding. The differences are in the set of access methods, but they
are both normally immutable, and both allow (but do enforce) substrings to
share their backing store between distinct instances. The same applies to
C/C++ "wide strings" when their code units are larger than 1 byte, but
C/C++ do not make them immutable, except using dedicated classes, which
will transiently allow setting their content through constructors, and
C/C++ wide strings exist with several signed and unsigned code units (when
Java only have unsigned 16-bit code units in their "char", and Javascript
has no "char" type but only "Number" types with valid range restrictions
applied when constructing String instances from code units or from
codepoint values.

Javascript should soon have a new numeric type (it is provisionnaly named
"BigInt", a signed 64-bit integer and will have constants sufixed by "n",
and there will be no implicit promotion from/to Number but only explicit
conversions by checked constructors) and new code unit types for mutable
buffers (but only for the rangechecks of their write accessors, using
"Number" 64-bit floating points or the newer "BigInt" 64-bit integers)

There are similar designs in Perl, PHP, and most languages: Unicode support
and conformance for using these types for valid text is implemented only by
libraries in their standard text API or in their I/O APIs taking immutable
strings or mutable buffers in parameters, or returning sharable but
immutable string instances or a mutable buffer referenced on input or
allocated internally, but these API's are not restricted to just valid
Unicode text handling and allow using their strings with any other encoding.

With immutable strings implemented as classes, the backing store is
normally not directly accessible even by reference, you can just reference
the class referencing internally the backing store... implemented using
mutable buffers and using an internal encoding which may be different from
the one exposed by the string class (possibly using compression technics
for their backing store, on demand, and implicit atomization of most
frequently used string values, notably the empty string and string values
representing a single character with an 8-bit only code point value, or
strings containing any repetition of the same code point value:  these
values do not need any internally allocated buffer for their backing store,
so these instances are allocated very fast, and do not stress the garbage
collector when they are no longer used).

When Unicode text handling methods are supported by their exposed methods,
the Unicode validation rules are not necessarily checked everywhere, so it
is still possible to have strings or buffers containing a single unpaired
surrogate value. The backing store may also allow storing code units
outside the ranges used by valid UTF-16 or valid UTF-32 (the backing stores
are virtualized and could be on disk and swapped on demand with reusable
buffers from a pool).

2017-08-25 2:17 GMT+02:00 David Starner via Unicode <unicode at unicode.org>:

> ---------- Forwarded message ---------
> From: David Starner <prosfilaes at gmail.com>
> Date: Thu, Aug 24, 2017, 6:16 PM
> Subject: Re: Unicode education in Schools
> To: Richard Wordingham <richard.wordingham at ntlworld.com>
> On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:
>> Just steer them away from UTF-16!  (And vigorously prohibit the very
>> concept of UCS-2).
>> Richard.
> Steer them away from reinventing the wheel. If they use Java, use Java
> strings. If they're using GTK, use strings compatible with GTK. If they're
> writing JavaScript, use JavaScript strings. There's basically no system
> without Unicode strings or that they would be better off rewriting the
> wheel.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170825/64ecd5d8/attachment.html>

More information about the Unicode mailing list