"A Programmer's Introduction to Unicode"

Steffen Nurpmeso steffen at sdaoden.eu
Wed Mar 15 05:40:54 CDT 2017


"Doug Ewell" <doug at ewellic.org> wrote:
 |Philippe Verdy wrote:
 |>>> Well, you do have eleven bits for flags per codepoint, for example.
 |>>
 |>> That's not UCS-4; that's a custom encoding.
 |>>
 |>> (any UCS-4 code unit) & 0xFFE00000 == 0
 |
 |(changing to "UTF-32" per Ken's observation)
 |
 |> Per definition yes, but UTC-4 is not Unicode.
 |
 |I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting
 |held in 1989?
 |
 |> As well (any UCS-4 code unit) & 0xFFE00000 == 0 (i.e. 21 bits) is not
 |> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which
 |> would allow 32 planes instead of just the 17 first ones).
 |
 |I used bitwise arithmetic strictly to address Steffen's premise that the
 |11 "unused bits" in a UTF-32 code unit were available to store metadata
 |about the code point. Of course UTF-32 does not allow 0x110000 through
 |0x1FFFFF either.
 |
 |> I suppose he meant 21 bits, not 11 bits which covers only a small part
 |> of the BMP.
 |
 |No, his comment "you do have eleven bits for flags per codepoint" pretty
 |clearly referred to using the "extra" 11 bits beyond what is needed to
 |hold the Unicode scalar value.

It surely is a weak argument for a general string encoding.  But
sometimes, and for local use cases it surely is valid.  You could
store the wcwidth(3) plus a graphem codepoint count both in these
bits of the first codepoint of a cluster, for example, and, then,
that storage detail hidden under an access method interface.

--steffen


More information about the Unicode mailing list