White spaces for the purpose of programming languages

Corentin corentin.jabot at gmail.com
Fri Mar 26 06:44:11 CDT 2021


In UAX #44, White_space is described as "Spaces, separator characters and
other control characters which should be treated by programming languages
as "white space" for the purpose of parsing elements."

>From what I can tell, ECMAScript/JS uses White_space (or
rather Space_Separator which is slightly different), Rust uses
Pattern_White_Space which is a more restricted set, while most other
languages seem to only support the ASCII spaces.

I wanted to confirm that the intent is that White_Space is recommended in
programming languages.
I assumed that Pattern_White_Space would be more suitable for that purpose,
but it isn't actually clear from a reading of UAX31

Which first states in it's introduction
> A common task facing an implementer of the Unicode Standard is the
provision of a parsing and/or lexing engine for identifiers, such as
programming language variables or domain names.

But later:

Pattern Syntax : There are many circumstances where software interprets
patterns that are a mixture of literal characters, whitespace, and syntax
characters. Examples include regular expressions, Java collation rules,
Excel or ICU number formats, and many others.

(programming languages are not mentioned there)

Any clarification as to whether White_Space should be considered over
Pattern_White_Space for programming languages would be appreciated :)

I think that clarification might be useful for many users as different
programming languages have made different choices!


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210326/310f4951/attachment.htm>

More information about the Unicode mailing list