WORD JOINER vs ZWNBSP

Tue Jun 30 04:25:43 CDT 2015

On Mon, Jun 30, 2015, Richard Wordingham  wrote:

> On Sat, 27 Jun 2015 17:48:41 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > On Fri, Jun 26, Richard Wordingham wrote:
> > > On Fri, 26 Jun 2015 12:48:39 +0200 (CEST) Marcel Schneider wrote:
> 
> >>> Still in French, the letter apostrophe, when used as current
> >>> apostrophe, prevents the following word from being identified as a
> >>> word because of the missing word boundary and, subsequently,
> >>> prevents the autoexpand from working. This can be fixed by adding
> >>> a word joiner after the apostrophe, thanks to an autocorrect entry
> >>> that replaces U+02BC inserted by default in typographic mode, with
> >>> U+02BC U+2060.
> 
> >> No, this doesn't work. While the primary purpose of U+2060 is to
> >> prevent line breaks, it is also used to overrule word boundary
> >> detectors in scriptio continua. (It works quite well for
> >> spell-checking Thai in LibreOffice). It's name implies to me that it
> >> is intended to prevent a word boundary being deduced, through the
> >> strong correlation between word boundaries and line break opportunities.
> >> There doesn't seem to be a code for 'zero-width word boundary at
> >> which lines should not normally be broken'.
> 
> > Well, I extrapolated from U+FEFF, which works fine for me, even in
> > this particular context.
> 
> Does the tool misinterpret U+FEFF between Thai characters as a word
> boundary? Incidentally, which tool are you talking of?

I tested on Microsoft Word 2010 Starter running on Windows 7 Starter, on a netbook. This software being based on the full versions, the interpretation of U+FEFF must be the standard behavior. I tested in Latin script. You may wish to redo the tests, so please open a new document, input two words, replace the blank with whatever character the word boundaries behavior is to be checked of, and search for one of the two words with the 'whole word' option enabled. If the result is none, the test character indicates the absence of word boundaries; if there is a result, the test character indicates the presence of word boundaries.

> >> No, this doesn't work.

Right. The letter apostrophe cannot trigger the autocorrect for itself. I must keep U+0027 in the forefront, and get it replaced with U+02BC U+FEFF to keep the autocorrect/autoexpand working for what follows. Or even better, with U+FEFF U+02BC U+FEFF to clarify word boundaries. When there is no autoexpand, weʼll input the apostrophe as U+0027 and the single quotes as U+2018, U+2019, then replace all U+0027 with U+02BC. In the Windows Notepad that works, because the close-quote is presumably not in the equivalence class for the straight apostrophe, so it replaces the U+0027s with U+02BC and lets the U+2019s alone.

Given the instability of U+FEFF but also of U+00A0, as I wrote to Peter Constable a few moments ago, it seems as if we were unfortunately reaching the limits of text encoding. The purpose of the encoding design was, if Iʼm well informed, to get readible text files, and to allow users to mark them up for local printing or PDF conversion. Other usages must have been let out of scope, because today, you cannot exchange and process plain text files as one may wish. As soon as you must use plain text as a raw material for publishing, as you must convert British English quotation marks to US English quotation marks, as you must do searches including single quotes, as you must input text (especially with leading apostrophes) on keyboards with legacy drivers, and perhaps a few things more, there seems to be no other solution than to use workarounds, hand-process, look up and correct or convert the instances one by one.

The nice thing about this is that you become a craftsman again, that you get in touch with text, and you may feel like a linotypist or a lead typesetter who takes care of every detail. As a result, the professions of corrector, typesetter, typographer shall not disappear (as it was feared), and good craftmanship will stay thriving.

Another side effect is that the need of hand-processing text files lowers the appeal of copying other peoplesʼ work. Itʼs even harder when copying text from a PDF file. Sometimes you get whole paragraphs in ready-to-use plain text (let aside the NBSPs), and sometimes (e.g. from TUS) itʼs all in small pieces and you need to delete a lot of undue line breaks, as well as to text-transform the character identifiers because their uppercasing was just small caps formatting. Finally you may prefer to provide links to the content, but unfortunately there seems to be no way to copy bookmarks—so that you need to browse the contents and be likely to learn much more by the way.

If all this was the goal, letʼs say it loud. Then this was a good idea. Very good.

Regards,
Marcel Schneider
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150630/f883a53f/attachment.html>