Whitespace characters in Unicode
Sean Leonard
lists+unicode at seantek.com
Thu Aug 4 15:44:46 CDT 2016
I read through TR18...it mainly says that <space> == \s ==
\p{Whitespace} == property White_Space is true. Does it say anything
else or more significant than that, that I'm missing?
Sean
On 8/4/2016 1:17 PM, Leonardo Boiko wrote:
> What Mark Davis said; also, depending on what you need, consider
> taking a look at the definitions used by Unicode regexpes, at
> http://unicode.org/reports/tr18/ .
>
> 2016-08-04 16:37 GMT-03:00 Sean Leonard <lists+unicode at seantek.com
> <mailto:lists+unicode at seantek.com>>:
>
> Hi Unicode Folks:
>
> I am trying to come up with a sensible sets of characters that are
> considered whitespace or newlines in Unicode, and to understand
> the relative stability policy with respect to them. (This is for a
> formal syntax where the definition of "whitespace" matters, e.g.,
> to separate identifiers, and I want to be as conservative as
> possible.) Please let me know if the stuff below is correct, or
> needs work.
>
> The following characters / sequences are considered line breaking
> characters, per UAX #14 and Section 5.8 of UNICODE:
>
> CRLF CR LF FF VT NEL LS PS
>
> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the
> combination U+000D U+000A (treated as one line break). These
> characters / sequences are called "newlines".
>
> There will not be any additional code points that are assigned to
> be line breaks. (Correct?)
>
> CRLF, CR, LF, and NEL are also considered "newline functions" or
> NLF. These are distinguished from other codes (above) that also
> mean line breaks, mainly because of historical and widespread use
> of them.
>
> There are several formatting characters that affect word wrapping
> and line breaking, as discussed in those documents...but they are
> not line breaking characters.
>
> ****
>
> The following characters are whitespaces: characters (code points)
> with the property WSpace=Y (or White_Space). This is:
>
> newlines
> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000
>
> Assigned characters that are not listed above, can never be
> whitespace (according to Unicode). However, the set is not closed,
> so unassigned code points *could* be assigned to whitespace. It is
> (unlikely? very unlikely? Pretty much never going to happen?) that
> additional code points will be assigned to whitespace.
>
> ****
>
> There are some other characters that Unicode does not consider
> whitespace, but deserve discussion:
> U+180E MONGOLIAN VOWEL SEPARATOR:
> <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
> <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
> U+200B ZERO WIDTH SPACE
> U+200C ZERO WIDTH NON-JOINER
> U+200D ZERO WIDTH JOINER
> U+200E LEFT-TO-RIGHT MARK*
> U+200F RIGHT-TO-LEFT MARK*
> U+2060 WORD JOINER
> U+FEFF ZERO WIDTH NON-BREAKING SPACE
>
> *These appear in Pattern_White_Space, but Pattern_White_Space
> excludes U+2000-200A characters, which are obviously spaces. This
> is confusing and I would appreciate clarification /why/
> Pattern_White_Space is significantly disjoint from White_Space.
>
> ********
> The borderline characters above are not considered WSpace=Y, but
> sometimes might have space-like properties. ZWP and ZWNBP are
> obviously "space" characters, but they never generate whitespace.
> I suppose that conversely LTRM and RTLM are obviously "not space"
> characters, but they could generate whitespace under certain
> circumstances. Ditto for other formatting characters in general
> (for which the class is much larger).
>
> Therefore I guess a Unicode definition of "whitespace" (or "space
> characters") is: an assigned code point that *always* (is supposed
> to) generates white space (empty space between graphemes).
>
> ********
>
> Are there other standards that Unicode people recommend, that have
> addressed whether certain borderline characters are considered
> whitespace vs. non-whitespace (e.g., possibly acceptable as an
> identifier or syntax component)?
>
> Regards,
>
> Sean
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160804/2619ec73/attachment.html>
More information about the Unicode
mailing list