Whitespace characters in Unicode

Thu Aug 4 14:37:14 CDT 2016

Hi Unicode Folks:

I am trying to come up with a sensible sets of characters that are 
considered whitespace or newlines in Unicode, and to understand the 
relative stability policy with respect to them. (This is for a formal 
syntax where the definition of "whitespace" matters, e.g., to separate 
identifiers, and I want to be as conservative as possible.) Please let 
me know if the stuff below is correct, or needs work.

The following characters / sequences are considered line breaking 
characters, per UAX #14 and Section 5.8 of UNICODE:

CRLF CR LF FF VT NEL LS PS

So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the 
combination U+000D U+000A (treated as one line break). These characters 
/ sequences are called "newlines".

There will not be any additional code points that are assigned to be 
line breaks. (Correct?)

CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. 
These are distinguished from other codes (above) that also mean line 
breaks, mainly because of historical and widespread use of them.

There are several formatting characters that affect word wrapping and 
line breaking, as discussed in those documents...but they are not line 
breaking characters.

****

The following characters are whitespaces: characters (code points) with 
the property WSpace=Y (or White_Space). This is:

newlines
U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000

Assigned characters that are not listed above, can never be whitespace 
(according to Unicode). However, the set is not closed, so unassigned 
code points *could* be assigned to whitespace. It is (unlikely? very 
unlikely? Pretty much never going to happen?) that additional code 
points will be assigned to whitespace.

****

There are some other characters that Unicode does not consider 
whitespace, but deserve discussion:
U+180E MONGOLIAN VOWEL SEPARATOR: 
<https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+200E LEFT-TO-RIGHT MARK*
U+200F RIGHT-TO-LEFT MARK*
U+2060 WORD JOINER
U+FEFF ZERO WIDTH NON-BREAKING SPACE

*These appear in Pattern_White_Space, but Pattern_White_Space excludes 
U+2000-200A characters, which are obviously spaces. This is confusing 
and I would appreciate clarification /why/ Pattern_White_Space is 
significantly disjoint from White_Space.

********
The borderline characters above are not considered WSpace=Y, but 
sometimes might have space-like properties. ZWP and ZWNBP are obviously 
"space" characters, but they never generate whitespace. I suppose that 
conversely LTRM and RTLM are obviously "not space" characters, but they 
could generate whitespace under certain circumstances. Ditto for other 
formatting characters in general (for which the class is much larger).

Therefore I guess a Unicode definition of "whitespace" (or "space 
characters") is: an assigned code point that *always* (is supposed to) 
generates white space (empty space between graphemes).

********

Are there other standards that Unicode people recommend, that have 
addressed whether certain borderline characters are considered 
whitespace vs. non-whitespace (e.g., possibly acceptable as an 
identifier or syntax component)?

Regards,

Sean

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160804/7e83ac5d/attachment.html>