WORD JOINER vs ZWNBSP

Marcel Schneider charupdate at orange.fr
Fri Jul 3 10:19:13 CDT 2015


On Thu, Jul 02, 2015, Richard Wordingham  wrote:

> On Thu, 2 Jul 2015 10:37:17 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > (because it is
> > sufficient to simply type the words one after each other without
> > anything between, to get them as *one* word)
> 
> This only applies where it is traditional to separate words, a habit
> the Romans got out of and the Irish revived.

IMHO the case is a bit different in handwritten or engraved text vs word processing.

> Unicode Word Boundary Rule WB4 (in UAX #29 'Unicode Text
> Segmentation') decrees that U+2060 and U+FEFF be ignored in
> word-boundary determination except that newline breaks before them and
> that inserting them between between and creates an extra word
> boundary.

When we look up the set of existing format characters (Cf), the ZWSP, ZWNBSP and WJ fall out of the group in that they are used to detect word boundaries in cases like whole word search and spell checking. (They indicate word boundaries.) This is why, in reality, they are remapped to another category, a practice expressedly allowed by UAX #29. So in fact, the WB4 rule scarcely ever (say, *never*) applies to them. This can be discovered by oneself following the hints given at the very beginning of the UAX #29 content.

I believe that UAXes as well as the whole Standard are not here to decree, as Richard calls it, but to promote knowledge and to share a number of useful rules, given in accordance with practice and real needs. Perhaps some sentences are likely to be rewritten for clarification in order to stick even more with reality.

Perhaps, too, we should reconsider what we are talking about when using the expression “word boundary”. This is a bit ambiguous because UIs are designed to meet different needs, and because in English, the apostrophe is often a part of the sequences it is between. If I'm right, U+2019 or U+02BC in _month’s_ is expected to indicate a word boundary, and a search for the whole word _month_ will succeed, while _won’t_ in in the UAX #29 example is *one* word, and searching for a supposed _won_ word makes no sense (and will fail). However, both are selected as a whole by Shift+Ctrl+LEFT/RIGHT ARROW. 



[For the archive: Please refer to the last month’s thread _A new take on the English apostrophe in Unicode_. About the difference between quick cursor move and double-click select vs "whole word" search, please refer to my previous e-mails.] 

Definitely, word boundaries are found with a whole word search (see UAX #29, again).


Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150703/256647ad/attachment.html>


More information about the Unicode mailing list