Whitespace characters in Unicode

Martin J. Dürst duerst at it.aoyama.ac.jp
Mon Aug 8 02:07:59 CDT 2016


On 2016/08/08 08:08, Sean Leonard wrote:
> On 8/6/2016 11:30 AM, Doug Ewell wrote:
>> Additionally, in UTF-8, either LS or PS actually takes more bytes than
>> CR plus LF, so the "increased text size" argument also discouraged use
>> of the new controls.
>
> That is true, it takes 3 bytes. However, the original UTF-8 proposal

The term "original UTF-8 proposal" is quite misleading, because that 
proposal was never labeled as UTF-8. "FSS-UTF draft version" would be 
much better.

> encoded U+0080 - U+207F in two octets:
> https://en.wikipedia.org/wiki/UTF-8 :
> |10xxxxxx|     |1xxxxxxx|
>
>
> So, the space block /just barely makes it/. Was this intentional during
> the original design of UTF-8, or just a coincidence? I think it was more
> than a coincidence.

Just a coincidence, I'd say. When designing such schemes, trying to be 
compact is obviously one of the goals. But "how can I design it so that 
these two characters still make it as two bytes" isn't.

> It is regrettable that the space block was too high
> to work in the final version of UTF-8...maybe it should have gone below
> U+07FF.

There aren't too many line breaks (and usually even less paragraph 
breaks) in a text, so the overall effect of the encoding length for LS 
or PS were really not that much of an issue. The main reason for why 
they didn't spread was that everybody was already dealing with several 
variants of line breaks and didn't want more of these, even at the 
prospect of (potentially, eventually, in the very, very long run maybe) 
have only a single one.

Regards,   Martin.


More information about the Unicode mailing list