Whitespace characters in Unicode

Sun Aug 7 18:46:27 CDT 2016

On 8/5/2016 10:07 AM, Markus Scherer wrote:
> On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard 
> <lists+unicode at seantek.com <mailto:lists+unicode at seantek.com>> wrote:
>
>     What makes a character a "whitespace" in Unicode, e.g., why are
>     ZWSP and ZWNBSP not "whitespace" even though they clearly say
>     "SPACE" in them?
>
>
>     Any implementation experience from other standards
>     authors/implementers who have run into problems with shifty
>     whitespace definitions?
>
>
> Use properties, not character name patterns. If you have strong 
> reasons not to use a property as-is, then still use it, just with 
> inclusion & exclusion overrides.

Short answer: I cannot use character properties, and cannot use 
exclusion overrides.

As I have posted publicly, I am proposing some experimental 
Unicode-friendly extensions to IETF ABNF (currently in 
https://tools.ietf.org/html/draft-seantek-abnf-more-core-rules-05 , 
going to change that around a bit). There is (some) renewed interest in 
that part of the work since RFCs will permit UTF-8 in certain places, 
and IETF protocols are supposed to march towards "Net-Unicode" per RFC 5198.

Being a BNF, ABNF does not have exclusion, only incremental 
alternatives. Character properties would require a runtime library, 
which significantly goes against the purpose of (A)BNF.

The current proposed core rules have <UNICODE> (scalar values = doughnut 
hole for surrogates) and <BEYONDASCII> (scalar values without the ASCII 
range). While these are technically accurate, they will not be 
particularly useful for protocol designers as they are over-inclusive.

One of the rules I am working on is <UCHAR>, which is like <CHAR> except 
for Unicode. That eliminates the noncharacter code points (which, 
technically, are characters...that are defined as "not characters") as 
well as NULL, which is already eliminated by <CHAR>.

I was going to avoid extending <VCHAR> (which is U+0021-U+007E, i.e., no 
spaces and no control characters) because it's a bit too complicated. 
However, there are actual protocols, including a protocol that I am 
working on, that define parts of the repertoire as "graphic symbols and 
spacing characters", and elsewhere, "graphic symbols" (i.e., no spaces 
and no control characters). So the space characters are relevant at a 
level beneath requiring a full Unicode runtime to get at the character 
properties.

The newline issue is related but separate, and since IETF continues to 
use CRLF as the standard for interchange, I don't see a reason to touch 
it further.

Best regards,

Sean