Whitespace characters in Unicode
Martin J. Dürst
duerst at it.aoyama.ac.jp
Mon Aug 8 02:07:59 CDT 2016
On 2016/08/08 08:08, Sean Leonard wrote:
> On 8/6/2016 11:30 AM, Doug Ewell wrote:
>> Additionally, in UTF-8, either LS or PS actually takes more bytes than
>> CR plus LF, so the "increased text size" argument also discouraged use
>> of the new controls.
> That is true, it takes 3 bytes. However, the original UTF-8 proposal
The term "original UTF-8 proposal" is quite misleading, because that
proposal was never labeled as UTF-8. "FSS-UTF draft version" would be
> encoded U+0080 - U+207F in two octets:
> https://en.wikipedia.org/wiki/UTF-8 :
> |10xxxxxx| |1xxxxxxx|
> So, the space block /just barely makes it/. Was this intentional during
> the original design of UTF-8, or just a coincidence? I think it was more
> than a coincidence.
Just a coincidence, I'd say. When designing such schemes, trying to be
compact is obviously one of the goals. But "how can I design it so that
these two characters still make it as two bytes" isn't.
> It is regrettable that the space block was too high
> to work in the final version of UTF-8...maybe it should have gone below
There aren't too many line breaks (and usually even less paragraph
breaks) in a text, so the overall effect of the encoding length for LS
or PS were really not that much of an issue. The main reason for why
they didn't spread was that everybody was already dealing with several
variants of line breaks and didn't want more of these, even at the
prospect of (potentially, eventually, in the very, very long run maybe)
have only a single one.
More information about the Unicode