LS and RS (was: Re: Whitespace characters in Unicode)
doug at ewellic.org
Sat Aug 6 13:30:31 CDT 2016
Markus Scherer wrote:
> In hindsight, LS and PS are probably mistakes. When we came up
> with Pattern_White_Space, we still liked the idea of unambiguous
> end-of-line controls, but in practice it looks like no one really uses
> them. Anyone who cares uses markup or rich-text formats. (Markup was
> not common when Unicode was "born".)
I've often felt that the rise of UTF-8 spelled the end for LS and PS.
Unicode was originally a completely new text format, exactly 16 bits per
character. Conversion to ASCII and other byte-based encodings was an
explicit process. Dedicated characters for LS and PS were a
simplification, removing the notorious confusion over CR versus LF
UTF-8 brought ASCII backward compatibility to Unicode, removing early
objections that "Unicode will double my text size" but requiring
continued use of ASCII controls to maintain that compatibility.
Implementers saw the existing CR/LF/CRLF muddle as a problem already
solved, and LS and PS as new complications with no historical
Additionally, in UTF-8, either LS or PS actually takes more bytes than
CR plus LF, so the "increased text size" argument also discouraged use
of the new controls.
Doug Ewell | Thornton, CO, US | ewellic.org
More information about the Unicode