WORD JOINER vs ZWNBSP
charupdate at orange.fr
Thu Jul 2 04:39:54 CDT 2015
On Tue, Jun 30, 2015, Doug Ewell wrote:
> Khaled Hosny wrote:
> >> On my netbook, which is running Windows 7 Starter, U+2060 is not a
> >> part of any of the shipped fonts.
> > It is a control character, it does not need to have a glyph in the
> > font to be properly supported.
Thank you Khaled, I will respond soon after this.
> The problem is the word "supported." Marcel is seeing a visible glyph (a
> .notdef box) for what is supposed to be an invisible, zero-width
> character, and that is leading him to conclude that Windows doesn't
> "support" this character.
The .notdef box is exactly what I see sometimes on the Notepad and every time in the Word dialogs when I use U+2060, but in fact, what I see in the document is a particular glyph, representing a tall fullheight empty box with a wide space to its right despite of the font being proportional, and in the Notepad text the same box but without space. Only when I switch the font to the one you indicate below, the word joiner displays correctly on my version of Microsoft Word. Please see the attached screenshots (I wanted to paste them into this e-mail).
> On my Win 7 machine at work, when I enter the string "onetwo"
> ("one\u2060two") and click on either word, both words are selected. That
> is exactly what I would expect WJ to do. This works on the built-in
> Notepad as well as Notepad++ and BabelPad (but not on GoDaddy's
> Web-based email client).
The selection with double-click corresponds to what Richard did with the quick cursor move. These phenomena are text processing features which give little evidence on the presence or the absence of word boundaries. So I redid your test but used the search tool, with the "Whole words only" option enabled. This gives an idea of how the application percieves the words as entities, or better said, how developers expect users to expect search results. Well that isn't really a better expression... What I want to say is that what we see is normally what we are expected to expect. Personally I wouldn't like to get selected only a part of the compound I want most probably to mark up as a whole, nor do you, Doug. This is why a double-click on no matter which spot on the sequence makes this sequence selected as a whole. By contrast, given that we took care to insert word joiners where normally we aren't expected to (because it is sufficient to simply type the words one after each other without anything between, to get them as *one* word), the software engineers expect us to wish to join what must remain a sequence of separate words. Consequently, the built-in search engine will recognize each word as a word for itself.
This is where good software deploys its benefits. Some software does not recognize the ZWNBSP or the NBSP (I don't know which one or both) as indicating the presence of a word boundary, and therefore does not work correctly. That depends also on the PDF conversion tool. Please check the screenshots (I switched the UIs to English wherever possible, that is, on LibreOffice). [This e-mail has been blocked because it contained several attached screenshots. So I resend it without attached images.]
> But out of more than 500 fonts on that machine, the only stock Microsoft
> fonts that show WJ with zero-width, instead of a .notdef glyph, are
> Javanese Text, Myanmar Text, and Segoe UI Symbol. So while it's
> inaccurate to extrapolate this to "Microsoft doesn't support WJ," the
> font support is definitely lacking.
I wish to thank you personally Doug, for this very valuable hint. Effectively, on Microsoft Word 2010 Starter on Widows 7 Starter, the WJ is not correctly displayed unless the font is switched to Segoe UI Symbol (which is the one out of the three that had been shipped with my OS). If the Segoe typeface is not appropriate in the document, we can ask Word to find and replace all istances of U+2060 with the same formatted in Segoe UI Symbol. This may be what Word users are expected to do every time. Even if that isn't really what we expect of a Productivity Suite. Perhaps, or most probably, this problem does not occur in other high-end software, as Microsoft Publisher (needs to be confirmed). But if somebody buys Microsoft Office Premium, or Professional, he should be save from that misfunctioning. As should be everybody using Microsoft software, in fact.
> The bit about characters being converted to other characters, of course,
> has nothing to do with Windows and everything to do with particular
Based on this hint, I did more tests and found out that for a proper conversion to plain text, any segment including U+00A0, U+FEFF and other format characters, when copied from a document on Microsoft Word, must first be pasted into a LibreOffice document, then copied again and finally pasted into the text editor. I should avoid to vent further about that issue, and I'd better wait for official comments; I simply suppose that there is an algorithm (say, then, as a part of Microsoft Word) detecting where the clipboard item goes to, and eventually destroying the format characters. Guess everybody to what use...
Thanks a lot!
[one pasted screenshot]
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode