Whitespace characters in Unicode
doug at ewellic.org
Mon Aug 8 11:30:04 CDT 2016
Martin J. Dürst wrote:
>> encoded U+0080 - U+207F in two octets:
>> https://en.wikipedia.org/wiki/UTF-8 :
>> |10xxxxxx| |1xxxxxxx|
>> So, the space block /just barely makes it/. Was this intentional
>> during the original design of UTF-8, or just a coincidence? I think
>> it was more than a coincidence.
> Just a coincidence, I'd say. When designing such schemes, trying to be
> compact is obviously one of the goals. But "how can I design it so
> that these two characters still make it as two bytes" isn't.
For actual Unicode compression schemes (SCSU and BOCU-1), certain design
elements do exist to allow certain character blocks "in widespread use"
to fit in minimal space.
For byte-based UTFs, that wasn't a goal at all. ASCII in one byte was a
given -- for compatibility with existing software, not favoritism toward
English as was sometimes claimed -- but otherwise, algorithmic
simplicity and reasonable overall efficiency were more important than
optimizing for certain blocks.
Replacing one encoding with ranges like "U+2080 through U+8207F" with
another which architecturally allows non-shortest sequences, and then
disallowing them, is simply a matter of different engineering solutions
to the same problem. Each adds simplicity in one place and complexity in
another. UTF-8 happened to tick more additional boxes (e.g.
self-synchronization) than the others.
Doug Ewell | Thornton, CO, US | ewellic.org
More information about the Unicode