Whitespace characters in Unicode

Andrea Giammarchi andrea.giammarchi at gmail.com
Thu Aug 4 17:36:32 CDT 2016


Actually my apologies for my instinctive and quite rude answer, I've
misunderstood the initial email thinking Sean was proposing extra
whitespace for clarifications.

I won't react a quickly in the future, go on with your question Sean, and I
hope you'll get it right.

Best Regards

On Thu, Aug 4, 2016 at 11:19 PM, Andrea Giammarchi <
andrea.giammarchi at gmail.com> wrote:

> I'm not a Unicode expert, but I couldn't stop thinking about the following
> comic after reading "I am trying to come up with a sensible sets of
> characters that are considered whitespace" https://xkcd.com/927/
>
> Apologies for bringing pretty much nothing to this discussion but I'm
> pretty sure there's much more to discuss in this ML than another whitespace
> set on top of 25 characters already.
>
> Thanks for your patience and your understanding.
>
> Have a great weekend everyone!
> Best Regards
>
> On Thu, Aug 4, 2016 at 10:28 PM, Leonardo Boiko <leoboiko at namakajiri.net>
> wrote:
>
>> I'm sorry; I thought that, when you wanted to separate identifiers, it
>> might be interesting to follow existing regexps definitions; this way your
>> syntax would play along with already-existing tools (e.g. you'd be making
>> it easy for someone to pipe your language into grep -P "\p{Whitespace}").
>>
>> But I was talking out of my depth; I've never worked with defining
>> Unicode identifiers, so I'm not really qualified to answer.  I'm sure Davis
>> and the others can give better answers to your questions.  Meanwhile, I see
>> that UAX #31 goes further into Unicode identifiers. It says that
>> Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended
>> for use in regexp-like "patterns" which mix literal characters, whitespace,
>> and syntax (special characters), where the latter two would e.g. require
>> quoting.  For example, Perl has a "/x" flag which makes unquoted
>> Pattern_White_Space characters be ignored in regexpes (so that you can make
>> then less illegible).
>>
>> However, UAX #31 it also gives a Default Identifier Syntax, which bounds
>> identifiers not by Whitespace but by their start characters, identified by
>> ID_Start, defined like this:
>>
>> > ID_Start characters are derived from the Unicode General_Category of
>> uppercase letters, lowercase letters, titlecase letters, modifier letters,
>> other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax
>> and Pattern_White_Space code points.
>>
>> So it makes reference only to Pattern_White_Space and not Whitespace.  On
>> the other hand, I guess the listing above will exclude Whitespace
>> characters, since they don't count as any of letters, numbers, or
>> Other_ID_Start?
>>
>> None of that is guaranteed to be stable, though.  UAX #31 includes a
>> separate definition for "Immutable identifiers", which are, and suggests
>> various compromises between them.
>>
>>
>> 2016-08-04 17:44 GMT-03:00 Sean Leonard <lists+unicode at seantek.com>:
>>
>>> I read through TR18...it mainly says that <space> == \s ==
>>> \p{Whitespace} == property White_Space is true. Does it say anything else
>>> or more significant than that, that I'm missing?
>>>
>>> Sean
>>>
>>>
>>> On 8/4/2016 1:17 PM, Leonardo Boiko wrote:
>>>
>>> What Mark Davis said; also, depending on what you need, consider taking
>>> a look at the definitions used by Unicode regexpes, at
>>> http://unicode.org/reports/tr18/ .
>>>
>>> 2016-08-04 16:37 GMT-03:00 Sean Leonard <lists+unicode at seantek.com>:
>>>
>>>> Hi Unicode Folks:
>>>>
>>>> I am trying to come up with a sensible sets of characters that are
>>>> considered whitespace or newlines in Unicode, and to understand the
>>>> relative stability policy with respect to them. (This is for a formal
>>>> syntax where the definition of "whitespace" matters, e.g., to separate
>>>> identifiers, and I want to be as conservative as possible.) Please let me
>>>> know if the stuff below is correct, or needs work.
>>>>
>>>> The following characters / sequences are considered line breaking
>>>> characters, per UAX #14 and Section 5.8 of UNICODE:
>>>>
>>>> CRLF CR LF FF VT NEL LS PS
>>>>
>>>> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the
>>>> combination U+000D U+000A (treated as one line break). These characters /
>>>> sequences are called "newlines".
>>>>
>>>> There will not be any additional code points that are assigned to be
>>>> line breaks. (Correct?)
>>>>
>>>> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF.
>>>> These are distinguished from other codes (above) that also mean line
>>>> breaks, mainly because of historical and widespread use of them.
>>>>
>>>> There are several formatting characters that affect word wrapping and
>>>> line breaking, as discussed in those documents...but they are not line
>>>> breaking characters.
>>>>
>>>> ****
>>>>
>>>> The following characters are whitespaces: characters (code points) with
>>>> the property WSpace=Y (or White_Space). This is:
>>>>
>>>> newlines
>>>> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000
>>>>
>>>> Assigned characters that are not listed above, can never be whitespace
>>>> (according to Unicode). However, the set is not closed, so unassigned code
>>>> points *could* be assigned to whitespace. It is (unlikely? very unlikely?
>>>> Pretty much never going to happen?) that additional code points will be
>>>> assigned to whitespace.
>>>>
>>>> ****
>>>>
>>>> There are some other characters that Unicode does not consider
>>>> whitespace, but deserve discussion:
>>>> U+180E MONGOLIAN VOWEL SEPARATOR: <https://codeblog.jonskeet.uk/
>>>> 2014/12/01/when-is-an-identifier-not-an-identifier-attack-of
>>>> -the-mongolian-vowel-separator/>
>>>> <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
>>>> U+200B ZERO WIDTH SPACE
>>>> U+200C ZERO WIDTH NON-JOINER
>>>> U+200D ZERO WIDTH JOINER
>>>> U+200E LEFT-TO-RIGHT MARK*
>>>> U+200F RIGHT-TO-LEFT MARK*
>>>> U+2060 WORD JOINER
>>>> U+FEFF ZERO WIDTH NON-BREAKING SPACE
>>>>
>>>> *These appear in Pattern_White_Space, but Pattern_White_Space excludes
>>>> U+2000-200A characters, which are obviously spaces. This is confusing and I
>>>> would appreciate clarification *why* Pattern_White_Space is
>>>> significantly disjoint from White_Space.
>>>>
>>>> ********
>>>> The borderline characters above are not considered WSpace=Y, but
>>>> sometimes might have space-like properties. ZWP and ZWNBP are obviously
>>>> "space" characters, but they never generate whitespace. I suppose that
>>>> conversely LTRM and RTLM are obviously "not space" characters, but they
>>>> could generate whitespace under certain circumstances. Ditto for other
>>>> formatting characters in general (for which the class is much larger).
>>>>
>>>> Therefore I guess a Unicode definition of "whitespace" (or "space
>>>> characters") is: an assigned code point that *always* (is supposed to)
>>>> generates white space (empty space between graphemes).
>>>>
>>>> ********
>>>>
>>>> Are there other standards that Unicode people recommend, that have
>>>> addressed whether certain borderline characters are considered whitespace
>>>> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax
>>>> component)?
>>>>
>>>> Regards,
>>>>
>>>> Sean
>>>>
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160804/f624f0b1/attachment.html>


More information about the Unicode mailing list