<div dir="ltr"><div dir="ltr">On Fri, Mar 26, 2021 at 4:50 AM Corentin via Unicode <<a href="mailto:unicode@unicode.org">unicode@unicode.org</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div>In UAX #44, White_space is described as "Spaces, separator characters and other control characters which should be treated by programming languages as "white space" for the purpose of parsing elements."</div></div><div><br></div><div>From what I can tell, ECMAScript/JS uses White_space (or rather Space_Separator which is slightly different), Rust uses Pattern_White_Space which is a more restricted set, while most other languages seem to only support the ASCII spaces. </div><div><br></div><div>I wanted to confirm that the intent is that White_Space is recommended in programming languages.</div><div>I assumed that Pattern_White_Space would be more suitable for that purpose,</div><div>but it isn't actually clear from a reading of UAX31</div></div></blockquote><div><br></div><div>We came up with Pattern_White_Space for working with ICU <i>rule and pattern strings</i> (e.g., rules to define sort orders, rules for number spellout, date/time/number formatting patterns).</div><div>This is why we included the RLM and LRM controls -- making it easy to keep rule strings legible when there are RTL characters.</div><div>(If we were defining it now, I assume that we would also include the newer ALM (U+061C), but the property is immutable so we can't add anything.)</div><div><br></div><div>We proposed this as a Unicode property because it seemed useful.</div><div>We were not specifically thinking about whole programming languages.</div><div>I assume that existing languages are not going to want to make a change here.</div><div><br></div><div>When parsing <i>user input</i>, we generally look for all White_Space where "space" is allowed.</div><div><br></div><div>Personally, I think that White_Space is unnecessarily broad for programming language syntax.</div><div>Pattern_White_Space might be a useful starting point.</div><div><ul><li>The bidi controls should probably not be programming "white space" on their own because they don't have any advance width. They should be allowed somewhere, maybe at token boundaries or after indenting spaces.</li><li>U+0085 NEL is a holdover from OS/390 and the line feed confusion on IBM systems. (They didn't much care what LF/NEL mapped to because their text systems had a "record" per line and didn't need a line separator character like Unix-y systems.)</li><ul><li>I can't tell if the EBCDIC platforms are "alive". Elsewhere I have tried to find out if there is a competent C++11 compiler available.</li></ul><li>Line & paragraph separators apparently never got much use.</li><li>Form feed? Vertical tab?</li><li>East Asian developers might appreciate U+3000 ideographic space because their IMEs tend to emit that.</li></ul></div><div>So maybe just TAB, LF, CR, space (0020), and possibly wide space (3000), plus also LRM/RLM/ALM at certain boundaries?</div><div><br></div><div>Best regards,</div><div>markus</div></div></div>