Whitespace characters in Unicode

Fri Aug 5 10:52:56 CDT 2016

Here are specific questions (perhaps Mark Davis, but anyone really with 
experience, can respond):

As Mark said, there are 25 whitespace characters. (I forgot to include 
HT, so that makes 25 from my original post.)

What makes a character a "whitespace" in Unicode, e.g., why are ZWSP and 
ZWNBSP not "whitespace" even though they clearly say "SPACE" in them?

What are "Unicode-y" ways to compute word boundaries? Related to prior 
question--I suppose ZWSP is not "whitespace", but like whitespace, it 
separates words. I suppose that since it is not printable, it is 
"confusing", and therefore should be avoided in contexts where the 
printed representation of Unicode code points matters.

Why is Pattern_White_Space significantly disjoint from White_Space, 
namely, why does Pattern_White_Space include LTRM and RTLM (and notably 
LS and PS) yet omit the spaces U+1680 and in the U+2000 range?

Any implementation experience from other standards authors/implementers 
who have run into problems with shifty whitespace definitions?

Regards,

Sean

On 8/4/2016 2:28 PM, Leonardo Boiko wrote:
> I'm sorry; I thought that, when you wanted to separate identifiers, it 
> might be interesting to follow existing regexps definitions; this way 
> your syntax would play along with already-existing tools (e.g. you'd 
> be making it easy for someone to pipe your language into grep -P 
> "\p{Whitespace}").
>
> But I was talking out of my depth; I've never worked with defining 
> Unicode identifiers, so I'm not really qualified to answer.  I'm sure 
> Davis and the others can give better answers to your questions.  
> Meanwhile, I see that UAX #31 goes further into Unicode identifiers. 
> It says that Pattern_White_Space is stable (unlike Whitespace, 
> perhaps?), and intended for use in regexp-like "patterns" which mix 
> literal characters, whitespace, and syntax (special characters), where 
> the latter two would e.g. require quoting.  For example, Perl has a 
> "/x" flag which makes unquoted Pattern_White_Space characters be 
> ignored in regexpes (so that you can make then less illegible).
>
> However, UAX #31 it also gives a Default Identifier Syntax, which 
> bounds identifiers not by Whitespace but by their start characters, 
> identified by ID_Start, defined like this:
>
> |> ID_Start| characters are derived from the Unicode General_Category 
> of uppercase letters, lowercase letters, titlecase letters, modifier 
> letters, other letters, letter numbers, plus Other_ID_Start, minus 
> Pattern_Syntax and Pattern_White_Space code points.
>
> So it makes reference only to Pattern_White_Space and not Whitespace.  
> On the other hand, I guess the listing above will exclude Whitespace 
> characters, since they don't count as any of letters, numbers, or 
> Other_ID_Start?
>
> None of that is guaranteed to be stable, though.  UAX #31 includes a 
> separate definition for "Immutable identifiers", which are, and 
> suggests various compromises between them.
>
>
> 2016-08-04 17:44 GMT-03:00 Sean Leonard <lists+unicode at seantek.com 
> <mailto:lists+unicode at seantek.com>>:
>
>     I read through TR18...it mainly says that <space> == \s ==
>     \p{Whitespace} == property White_Space is true. Does it say
>     anything else or more significant than that, that I'm missing?
>
>     Sean
>
>
>     On 8/4/2016 1:17 PM, Leonardo Boiko wrote:
>>     What Mark Davis said; also, depending on what you need, consider
>>     taking a look at the definitions used by Unicode regexpes, at
>>     http://unicode.org/reports/tr18/ <http://unicode.org/reports/tr18/> .
>>
>>     2016-08-04 16:37 GMT-03:00 Sean Leonard
>>     <lists+unicode at seantek.com <mailto:lists+unicode at seantek.com>>:
>>
>>         Hi Unicode Folks:
>>
>>         I am trying to come up with a sensible sets of characters
>>         that are considered whitespace or newlines in Unicode, and to
>>         understand the relative stability policy with respect to
>>         them. (This is for a formal syntax where the definition of
>>         "whitespace" matters, e.g., to separate identifiers, and I
>>         want to be as conservative as possible.) Please let me know
>>         if the stuff below is correct, or needs work.
>>
>>         The following characters / sequences are considered line
>>         breaking characters, per UAX #14 and Section 5.8 of UNICODE:
>>
>>         CRLF CR LF FF VT NEL LS PS
>>
>>         So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the
>>         combination U+000D U+000A (treated as one line break). These
>>         characters / sequences are called "newlines".
>>
>>         There will not be any additional code points that are
>>         assigned to be line breaks. (Correct?)
>>
>>         CRLF, CR, LF, and NEL are also considered "newline functions"
>>         or NLF. These are distinguished from other codes (above) that
>>         also mean line breaks, mainly because of historical and
>>         widespread use of them.
>>
>>         There are several formatting characters that affect word
>>         wrapping and line breaking, as discussed in those
>>         documents...but they are not line breaking characters.
>>
>>         ****
>>
>>         The following characters are whitespaces: characters (code
>>         points) with the property WSpace=Y (or White_Space). This is:
>>
>>         newlines
>>         U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000
>>
>>         Assigned characters that are not listed above, can never be
>>         whitespace (according to Unicode). However, the set is not
>>         closed, so unassigned code points *could* be assigned to
>>         whitespace. It is (unlikely? very unlikely? Pretty much never
>>         going to happen?) that additional code points will be
>>         assigned to whitespace.
>>
>>         ****
>>
>>         There are some other characters that Unicode does not
>>         consider whitespace, but deserve discussion:
>>         U+180E MONGOLIAN VOWEL SEPARATOR:
>>         <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
>>         <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
>>         U+200B ZERO WIDTH SPACE
>>         U+200C ZERO WIDTH NON-JOINER
>>         U+200D ZERO WIDTH JOINER
>>         U+200E LEFT-TO-RIGHT MARK*
>>         U+200F RIGHT-TO-LEFT MARK*
>>         U+2060 WORD JOINER
>>         U+FEFF ZERO WIDTH NON-BREAKING SPACE
>>
>>         *These appear in Pattern_White_Space, but Pattern_White_Space
>>         excludes U+2000-200A characters, which are obviously spaces.
>>         This is confusing and I would appreciate clarification /why/
>>         Pattern_White_Space is significantly disjoint from White_Space.
>>
>>         ********
>>         The borderline characters above are not considered WSpace=Y,
>>         but sometimes might have space-like properties. ZWP and ZWNBP
>>         are obviously "space" characters, but they never generate
>>         whitespace. I suppose that conversely LTRM and RTLM are
>>         obviously "not space" characters, but they could generate
>>         whitespace under certain circumstances. Ditto for other
>>         formatting characters in general (for which the class is much
>>         larger).
>>
>>         Therefore I guess a Unicode definition of "whitespace" (or
>>         "space characters") is: an assigned code point that *always*
>>         (is supposed to) generates white space (empty space between
>>         graphemes).
>>
>>         ********
>>
>>         Are there other standards that Unicode people recommend, that
>>         have addressed whether certain borderline characters are
>>         considered whitespace vs. non-whitespace (e.g., possibly
>>         acceptable as an identifier or syntax component)?
>>
>>         Regards,
>>
>>         Sean
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160805/ec97866a/attachment.html>