Whitespace characters in Unicode

Sun Aug 7 18:08:58 CDT 2016

On 8/5/2016 10:07 AM, Markus Scherer wrote:
> On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard 
> <lists+unicode at seantek.com <mailto:lists+unicode at seantek.com>> wrote:
>
>     What makes a character a "whitespace" in Unicode, e.g., why are
>     ZWSP and ZWNBSP not "whitespace" even though they clearly say
>     "SPACE" in them?
>
>
> I think "white space" basically wants to have an advance width (occupy 
> space) but no ink (all white, no black)  :-)

Yes, that is the thought that I had as well: whitespace characters 
always generate blank space between graphemes, whether horizontal or 
vertical.

>
> ZWSP and ZWNBSP affect word and line breaking but have no advance width.

I suppose that these are "SPACE" characters, but not "WHITE space" 
characters, since there is no white in them. :)

>
> Note that character names can be misleading, plain wrong, or even just 
> misspelled, but they cannot be changed. Best to read the 
> documentation. The charts are a good start:
> http://www.unicode.org/charts/PDF/U2000.pdf
> http://www.unicode.org/charts/PDF/UFE70.pdf
>
> In particular, don't build sets of Unicode characters just based on 
> character name patterns. Use character properties as much as possible.
>
>     What are "Unicode-y" ways to compute word boundaries?
>
>
> http://www.unicode.org/reports/tr29/#Word_Boundaries
>
>     Related to prior question--I suppose ZWSP is not "whitespace", but
>     like whitespace, it separates words. I suppose that since it is
>     not printable, it is "confusing", and therefore should be avoided
>     in contexts where the printed representation of Unicode code
>     points matters.
>
>
> Depends on what you do.
>
> Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping 
> and line breaking in a browser or text field/editor.
>
> They are not allowed in identifiers, and removed from domain names 
> (UTS #46).
>
>     Why is Pattern_White_Space significantly disjoint from
>     White_Space, namely, why does Pattern_White_Space include LTRM and
>     RTLM (and notably LS and PS) yet omit the spaces U+1680 and in the
>     U+2000 range?
>
>
> We wanted a simple, immutable definition for rule and pattern strings 
> that programmers write and maintain. We included LRM and RLM so that 
> they can be used (and will be ignored) in rules, for example collation 
> rule strings, to keep them moderately readable when they contain RTL 
> characters. Typographic spaces are unnecessary in this context, and 
> could be confusing.
>
> In hindsight, LS and PS are probably mistakes. When we came up 
> with Pattern_White_Space, we still liked the idea of unambiguous 
> end-of-line controls, but in practice it looks like no one really uses 
> them. Anyone who cares uses markup or rich-text formats. (Markup was 
> not common when Unicode was "born".)

I like the premise of LS and PS: one (well, two) unambiguous characters 
to rule them all. But the execution was lacking, to put it mildly. And 
there aren't two keys on a common keyboard to distinguish between line 
and paragraph separation.

On 8/6/2016 11:30 AM, Doug Ewell wrote:
> Additionally, in UTF-8, either LS or PS actually takes more bytes than 
> CR plus LF, so the "increased text size" argument also discouraged use 
> of the new controls.

That is true, it takes 3 bytes. However, the original UTF-8 proposal 
encoded U+0080 - U+207F in two octets: https://en.wikipedia.org/wiki/UTF-8 :
|10xxxxxx| 	|1xxxxxxx|

So, the space block /just barely makes it/. Was this intentional during 
the original design of UTF-8, or just a coincidence? I think it was more 
than a coincidence. It is regrettable that the space block was too high 
to work in the final version of UTF-8...maybe it should have gone below 
U+07FF.

(More motivation for my whitespace question in following message...)

Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160807/322b8589/attachment.html>