Richard Wordingham richard.wordingham at ntlworld.com
Sat Jul 4 13:20:05 CDT 2015

On Sat, 4 Jul 2015 17:02:00 +0200 (CEST)
Marcel Schneider <charupdate at orange.fr> wrote:

> On Fri, Jul 03, 2015, Richard Wordingham  wrote:
> > On Fri, 3 Jul 2015 17:19:13 +0200 (CEST)
> > Marcel Schneider  wrote:
> I considered not to reply any more in this unfaithful dialogue, where
> after bringing up some historic examples to make me think about them,
> Richard switches back to present and makes people believe I could
> suppose that any country could prefer the use of other means than
> what's world standard.

I cannot work out what you think I am making people believe you might
suppose.  I was pointing out that not everyone uses visible word
boundaries.  I will also note that people are reluctant to type
invisible characters if they don't have immediate benefits.

> Now lets come to the core: Why on earth
> do we need word boundaries for whole word search in Latin script,
> while Thai, Burmese and Cambodian scripts Richard mentions as
> examples, use implémentations that can find whole words without any
> need of "spaces or any other [separating] character"?

The Thai and Cambodian implementations are far from perfect, even when
applied to the Thai and Cambodian languages.  Using a dictionary for
the national languages on text of other languages naturally has even
worse performance.  A quick experiment suggest that for whole word
search in Thai, LibreOffice simply ignores any boundaries bwtween Thai
word characters.  Double click and ctrl/arrow use different rules.

It's quite possible that we are misinterpreting the results of whole
word searches.  One way of implementing whole word search is to do a
general search and then check whether the word found is part of a
larger word.  To do that, one might simply ask whether the
characters before and after the string found are permitted in words.
One might easily set things up so that by omission U+2060 is not
considered part of a word - the code could have been written before
U+2060 was assigned and not updated since.


More information about the Unicode mailing list