\b{wb}

Richard Wordingham richard.wordingham at ntlworld.com
Sat Aug 22 16:46:08 CDT 2015


On Sat, 22 Aug 2015 14:08:14 -0600
Karl Williamson <public at khwilliamson.com> wrote:

> But it isn't such a replacement, creating some consternation, and the 
> main reason is that, unlike \b, it treats the boundary between white 
> space characters as a breaking opportunity, so that it doesn't create 
> runs of them.  Thus if you have two spaces after a full stop, it
> treats each as an individual word.
> 
> My question is "Was this intentional, and if so, Why?"

See below.

> TR18 says \b{w} is a"Zero-width match at a Unicode word boundary.
> Note that this is different than \b alone, which corresponds to \w
> and \W."

Unless I'm being stupid, \b and \b{w} are indeed vary different.
Consider a sequence <U+0020, U+1F1EB REGIONAL INDICATOR SYMBOL LETTER F,
U+1F1F7 REGIONAL INDICATOR SYMBOL LETTER R, U+0041 LATIN CAPITAL
LETTER A, U+0062 LATIN SMALL LETTER B>

That has two internal word boundaries, splitting it into a space, a
flag, and the word "Ab".  Is this what you want?

Worse, consider a short Thai sentence ผมไม่มีคอมพิวเตอร์ที่ดี.  That
gets split by ICU into |ผม|ไม่มี|คอมพิวเตอร์|ที่|ดี| - 5 words and
4 internal word boundaries.  Note that there's a word or two between
each boundary.  Is this what you want?

> My question is "Was this intentional, and if so, Why?"

Take a look at the rules in UAX#29 Section 4.1.1.  Apart from the first
two and the last, they all identify where word boundaries aren't.  This
is tidy - the algorithm concentrates on working out where a word
continues.

In principle, you could, I believe, extend the rules so that characters
outside words and regional indicator runs were not divided, but it
would make for a more complicated algorithm with plenty of
opportunities for error.  I think the thought was that word-free runs
did not need to be assembled into runs of non-word material.

The short answer, of course, is that the regular expression engine
could do this final step of post-processing itself.  This may get
tricky with customised word-breaking.

Richard.



More information about the Unicode mailing list