Whitespace characters in Unicode
lists+unicode at seantek.com
Sun Aug 7 18:46:27 CDT 2016
On 8/5/2016 10:07 AM, Markus Scherer wrote:
> On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard
> <lists+unicode at seantek.com <mailto:lists+unicode at seantek.com>> wrote:
> What makes a character a "whitespace" in Unicode, e.g., why are
> ZWSP and ZWNBSP not "whitespace" even though they clearly say
> "SPACE" in them?
> Any implementation experience from other standards
> authors/implementers who have run into problems with shifty
> whitespace definitions?
> Use properties, not character name patterns. If you have strong
> reasons not to use a property as-is, then still use it, just with
> inclusion & exclusion overrides.
Short answer: I cannot use character properties, and cannot use
As I have posted publicly, I am proposing some experimental
Unicode-friendly extensions to IETF ABNF (currently in
going to change that around a bit). There is (some) renewed interest in
that part of the work since RFCs will permit UTF-8 in certain places,
and IETF protocols are supposed to march towards "Net-Unicode" per RFC 5198.
Being a BNF, ABNF does not have exclusion, only incremental
alternatives. Character properties would require a runtime library,
which significantly goes against the purpose of (A)BNF.
The current proposed core rules have <UNICODE> (scalar values = doughnut
hole for surrogates) and <BEYONDASCII> (scalar values without the ASCII
range). While these are technically accurate, they will not be
particularly useful for protocol designers as they are over-inclusive.
One of the rules I am working on is <UCHAR>, which is like <CHAR> except
for Unicode. That eliminates the noncharacter code points (which,
technically, are characters...that are defined as "not characters") as
well as NULL, which is already eliminated by <CHAR>.
I was going to avoid extending <VCHAR> (which is U+0021-U+007E, i.e., no
spaces and no control characters) because it's a bit too complicated.
However, there are actual protocols, including a protocol that I am
working on, that define parts of the repertoire as "graphic symbols and
spacing characters", and elsewhere, "graphic symbols" (i.e., no spaces
and no control characters). So the space characters are relevant at a
level beneath requiring a full Unicode runtime to get at the character
The newline issue is related but separate, and since IETF continues to
use CRLF as the standard for interchange, I don't see a reason to touch
More information about the Unicode