"A Programmer's Introduction to Unicode"
Steffen Nurpmeso
steffen at sdaoden.eu
Wed Mar 15 05:40:54 CDT 2017
"Doug Ewell" <doug at ewellic.org> wrote:
|Philippe Verdy wrote:
|>>> Well, you do have eleven bits for flags per codepoint, for example.
|>>
|>> That's not UCS-4; that's a custom encoding.
|>>
|>> (any UCS-4 code unit) & 0xFFE00000 == 0
|
|(changing to "UTF-32" per Ken's observation)
|
|> Per definition yes, but UTC-4 is not Unicode.
|
|I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting
|held in 1989?
|
|> As well (any UCS-4 code unit) & 0xFFE00000 == 0 (i.e. 21 bits) is not
|> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which
|> would allow 32 planes instead of just the 17 first ones).
|
|I used bitwise arithmetic strictly to address Steffen's premise that the
|11 "unused bits" in a UTF-32 code unit were available to store metadata
|about the code point. Of course UTF-32 does not allow 0x110000 through
|0x1FFFFF either.
|
|> I suppose he meant 21 bits, not 11 bits which covers only a small part
|> of the BMP.
|
|No, his comment "you do have eleven bits for flags per codepoint" pretty
|clearly referred to using the "extra" 11 bits beyond what is needed to
|hold the Unicode scalar value.
It surely is a weak argument for a general string encoding. But
sometimes, and for local use cases it surely is valid. You could
store the wcwidth(3) plus a graphem codepoint count both in these
bits of the first codepoint of a cluster, for example, and, then,
that storage detail hidden under an access method interface.
--steffen
More information about the Unicode
mailing list