White spaces for the purpose of programming languages

Markus Scherer markus.icu at gmail.com
Wed Mar 31 22:10:01 CDT 2021


On Fri, Mar 26, 2021 at 4:50 AM Corentin via Unicode <unicode at unicode.org>
wrote:

> In UAX #44, White_space is described as "Spaces, separator characters and
> other control characters which should be treated by programming languages
> as "white space" for the purpose of parsing elements."
>
> From what I can tell, ECMAScript/JS uses White_space (or
> rather Space_Separator which is slightly different), Rust uses
> Pattern_White_Space which is a more restricted set, while most other
> languages seem to only support the ASCII spaces.
>
> I wanted to confirm that the intent is that White_Space is recommended in
> programming languages.
> I assumed that Pattern_White_Space would be more suitable for that purpose,
> but it isn't actually clear from a reading of UAX31
>

We came up with Pattern_White_Space for working with ICU *rule and pattern
strings* (e.g., rules to define sort orders, rules for number spellout,
date/time/number formatting patterns).
This is why we included the RLM and LRM controls -- making it easy to keep
rule strings legible when there are RTL characters.
(If we were defining it now, I assume that we would also include the newer
ALM (U+061C), but the property is immutable so we can't add anything.)

We proposed this as a Unicode property because it seemed useful.
We were not specifically thinking about whole programming languages.
I assume that existing languages are not going to want to make a change
here.

When parsing *user input*, we generally look for all White_Space where
"space" is allowed.

Personally, I think that White_Space is unnecessarily broad for programming
language syntax.
Pattern_White_Space might be a useful starting point.

   - The bidi controls should probably not be programming "white space" on
   their own because they don't have any advance width. They should be allowed
   somewhere, maybe at token boundaries or after indenting spaces.
   - U+0085 NEL is a holdover from OS/390 and the line feed confusion on
   IBM systems. (They didn't much care what LF/NEL mapped to because their
   text systems had a "record" per line and didn't need a line separator
   character like Unix-y systems.)
      - I can't tell if the EBCDIC platforms are "alive". Elsewhere I have
      tried to find out if there is a competent C++11 compiler available.
   - Line & paragraph separators apparently never got much use.
   - Form feed? Vertical tab?
   - East Asian developers might appreciate U+3000 ideographic space
   because their IMEs tend to emit that.

So maybe just TAB, LF, CR, space (0020), and possibly wide space (3000),
plus also LRM/RLM/ALM at certain boundaries?

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210331/6edf8285/attachment.htm>


More information about the Unicode mailing list