Whitespace characters in Unicode

Sean Leonard lists+unicode at seantek.com
Thu Aug 4 15:44:46 CDT 2016


I read through TR18...it mainly says that <space> == \s == 
\p{Whitespace} == property White_Space is true. Does it say anything 
else or more significant than that, that I'm missing?

Sean

On 8/4/2016 1:17 PM, Leonardo Boiko wrote:
> What Mark Davis said; also, depending on what you need, consider 
> taking a look at the definitions used by Unicode regexpes, at 
> http://unicode.org/reports/tr18/ .
>
> 2016-08-04 16:37 GMT-03:00 Sean Leonard <lists+unicode at seantek.com 
> <mailto:lists+unicode at seantek.com>>:
>
>     Hi Unicode Folks:
>
>     I am trying to come up with a sensible sets of characters that are
>     considered whitespace or newlines in Unicode, and to understand
>     the relative stability policy with respect to them. (This is for a
>     formal syntax where the definition of "whitespace" matters, e.g.,
>     to separate identifiers, and I want to be as conservative as
>     possible.) Please let me know if the stuff below is correct, or
>     needs work.
>
>     The following characters / sequences are considered line breaking
>     characters, per UAX #14 and Section 5.8 of UNICODE:
>
>     CRLF CR LF FF VT NEL LS PS
>
>     So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the
>     combination U+000D U+000A (treated as one line break). These
>     characters / sequences are called "newlines".
>
>     There will not be any additional code points that are assigned to
>     be line breaks. (Correct?)
>
>     CRLF, CR, LF, and NEL are also considered "newline functions" or
>     NLF. These are distinguished from other codes (above) that also
>     mean line breaks, mainly because of historical and widespread use
>     of them.
>
>     There are several formatting characters that affect word wrapping
>     and line breaking, as discussed in those documents...but they are
>     not line breaking characters.
>
>     ****
>
>     The following characters are whitespaces: characters (code points)
>     with the property WSpace=Y (or White_Space). This is:
>
>     newlines
>     U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000
>
>     Assigned characters that are not listed above, can never be
>     whitespace (according to Unicode). However, the set is not closed,
>     so unassigned code points *could* be assigned to whitespace. It is
>     (unlikely? very unlikely? Pretty much never going to happen?) that
>     additional code points will be assigned to whitespace.
>
>     ****
>
>     There are some other characters that Unicode does not consider
>     whitespace, but deserve discussion:
>     U+180E MONGOLIAN VOWEL SEPARATOR:
>     <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
>     <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
>     U+200B ZERO WIDTH SPACE
>     U+200C ZERO WIDTH NON-JOINER
>     U+200D ZERO WIDTH JOINER
>     U+200E LEFT-TO-RIGHT MARK*
>     U+200F RIGHT-TO-LEFT MARK*
>     U+2060 WORD JOINER
>     U+FEFF ZERO WIDTH NON-BREAKING SPACE
>
>     *These appear in Pattern_White_Space, but Pattern_White_Space
>     excludes U+2000-200A characters, which are obviously spaces. This
>     is confusing and I would appreciate clarification /why/
>     Pattern_White_Space is significantly disjoint from White_Space.
>
>     ********
>     The borderline characters above are not considered WSpace=Y, but
>     sometimes might have space-like properties. ZWP and ZWNBP are
>     obviously "space" characters, but they never generate whitespace.
>     I suppose that conversely LTRM and RTLM are obviously "not space"
>     characters, but they could generate whitespace under certain
>     circumstances. Ditto for other formatting characters in general
>     (for which the class is much larger).
>
>     Therefore I guess a Unicode definition of "whitespace" (or "space
>     characters") is: an assigned code point that *always* (is supposed
>     to) generates white space (empty space between graphemes).
>
>     ********
>
>     Are there other standards that Unicode people recommend, that have
>     addressed whether certain borderline characters are considered
>     whitespace vs. non-whitespace (e.g., possibly acceptable as an
>     identifier or syntax component)?
>
>     Regards,
>
>     Sean
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160804/2619ec73/attachment.html>


More information about the Unicode mailing list