\b{wb}

Karl Williamson public at khwilliamson.com
Sat Aug 22 15:08:14 CDT 2015


The concept of \b in a regular expression meaning to match the boundary 
between a word and non-word was invented by Larry Wall, for the Perl 
programming language.  This was before Unicode, and a word was defined 
as alphanumerics plus the underscore, which fit well with how 
identifiers in that computer language (and many others) were defined. 
Essentially \b is defined to break between runs of word characters 
versus runs of non-word characters.

The latest version of Perl 5 (recently released) has added \b{w} based 
on Unicode's definition.  The typical expectation of its programmers is 
that it would be a drop-in replacement for the old \b, with much better 
results in parsing natural languages.

But it isn't such a replacement, creating some consternation, and the 
main reason is that, unlike \b, it treats the boundary between white 
space characters as a breaking opportunity, so that it doesn't create 
runs of them.  Thus if you have two spaces after a full stop, it treats 
each as an individual word.

My question is "Was this intentional, and if so, Why?"

TR18 says \b{w} is a"Zero-width match at a Unicode word boundary. Note 
that this is different than \b alone, which corresponds to \w and \W."

And UAX29 says "adjacent spaces are collapsed to a single space" in 
intelligent cut and paste using the WB property.



More information about the Unicode mailing list