LS and RS (was: Re: Whitespace characters in Unicode)

Doug Ewell doug at ewellic.org
Sat Aug 6 13:30:31 CDT 2016


Markus Scherer wrote:

> In hindsight, LS and PS are probably mistakes. When we came up
> with Pattern_White_Space, we still liked the idea of unambiguous
> end-of-line controls, but in practice it looks like no one really uses
> them. Anyone who cares uses markup or rich-text formats. (Markup was
> not common when Unicode was "born".)

I've often felt that the rise of UTF-8 spelled the end for LS and PS.

Unicode was originally a completely new text format, exactly 16 bits per 
character. Conversion to ASCII and other byte-based encodings was an 
explicit process. Dedicated characters for LS and PS were a 
simplification, removing the notorious confusion over CR versus LF 
versus CRLF.

UTF-8 brought ASCII backward compatibility to Unicode, removing early 
objections that "Unicode will double my text size" but requiring 
continued use of ASCII controls to maintain that compatibility. 
Implementers saw the existing CR/LF/CRLF muddle as a problem already 
solved, and LS and PS as new complications with no historical 
justification.

Additionally, in UTF-8, either LS or PS actually takes more bytes than 
CR plus LF, so the "increased text size" argument also discouraged use 
of the new controls.

--
Doug Ewell | Thornton, CO, US | ewellic.org



More information about the Unicode mailing list