NNBSP (was: A last missing link for interoperable representation)

Philippe Verdy via Unicode unicode at unicode.org
Thu Jan 17 05:21:56 CST 2019


Le jeu. 17 janv. 2019 à 05:01, Marcel Schneider via Unicode <
unicode at unicode.org> a écrit :

> On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:
> >
> > On Tue, 15 Jan 2019 13:25:06 +0100
> > Philippe Verdy via Unicode <unicode at unicode.org> wrote:
> >
> >> If your fonts behave incorrectly on your system because it does not
> >> map any glyph for NNBSP, don't blame the font or Unicode about this
> >> problem, blame the renderer (or the application or OS using it, may
> >> be they are very outdated and were not aware of these features, theyt
> >> are probably based on old versions of Unicode when NNBSP was still
> >> not present even if it was requested since very long at least for
> >> French and even English, before even Unicode, and long before
> >> Mongolian was then encoded, only in Unicode and not in any known
> >> supported legacy charset: Mongolian was specified by borrowing the
> >> same NNBSP already designed for Latin, because the Mongolian space
> >> had no known specific behavior: the encoded whitespaces in Unicode
> >> are compeltely script-neutral, they are generic, and are even
> >> BiDi-neutral, they are all usable with any script).
> >
> > The concept of this codepoint started for Mongolian, but was generalised
> > before the character was approved.
>
> Indeed it was proposed as MONGOLIAN SPACE <MSP> at block start, which was
> consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
> more.


But the French "espace fine insécable" was requested long long before
Mongolian was discussed for encodinc in the UCS. The problem is that the
initial rush for French was made in a period where Unicode and ISO were
competing and not in sync, so no agreement could be found, until there was
a decision to merge the efforts. Tge early rush was in ISO still not using
any character model but a glyph model, with little desire to support
multiple whitespaces; on the Unicode side, there was initially no desire to
encode all the languages and scripts, focusing initially only on trying to
unify the existing vendor character sets which were already implemented by
a limited set of proprietary vendor implementations (notably IBM,
Microsoft, HP, Digital) plus a few of the registered chrsets in IANA
including the existing ISO 8859-*, GBK, and some national standard or de
facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this
time but still using another unrelated technology). Font standards were
still not existing and were competing in incompatible ways, all was a mess
at that time, so publishers were still required to use proprietary software
solutions, with very low interoperability (at that time the only "standard"
was PostScript, not needing any character encoding at all, but only
encoding glyphs!)

If publishers had been involded, they would have revealed that they all
needed various whitespaces for correct typography (i.e. layout). Typographs
themselves did not care about whitespaces because they had no value for
them (no glyph to sell). Adobe's publishing software were then completely
proprietary (jsut like Microsoft and others like Lotus, WordPerfect...).
Years ago I was working for the French press, and they absolutely required
us to manage the [FINE] for use in newspapers, classified ads, articles,
guides, phone books, dictionnaries. It was even mandatory to enter these
[FINE] in the composed text and they trained their typists or ads sellers
to use it (that character was not "sold" in classified ads, it was
necessary for correct layout, notably in narrow columns, not using it
confused the readers (notably for the ":" colon): it had to be
non-breaking, non-expanding by justification, narrower than digits and even
narrower than standard non-justified whitespace, and was consistently used
as a decimal grouping separator.

But at that time the most common OSes did not support it natively because
there was no vendor charset supporting it (and in fact most OSes were still
unable to render proportional fonts everywhere and were frequently limited
to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early
start). So intermediate solution was needed. Us chose not to use at all the
non-breakable thin space because in English it was not needed for basic
Latin, but also because of the huge prevalence of 7-bit ASCII for
everything (but including its own national symbol for the "$", competing
with other ISO 646 variants). There were tons of legacy applications
developed ince decenials that did not support anything else and
interoperability in US was available ony with ASCII, everything else was
unreliable.

If you remember the early years when the Internet started to develop
outside US, you remember the nightmare of non-interoperable 8-bit charsets
and the famous "mojibake" we saw everywhere. Then the competition between
ISO and Unicode lasted too long. But it was considered "too late" for
French to change anything (and Windows used in so many places by som many
users promoted the use of the Windows-1252 charset (which had a few updates
before it was frozen definitely: there was no place for NNBSP in it).
Typographers and publishers were upset: to use the NNBSP they still needed
to use proprietary *document* encodings. The W3C did not help much too (it
was long to finally adopt the UCS as a mandatory component for HTML, before
that it depended only on the old IANA charset database promoting only the
work of vendors and a few ISO standards).

France itself wanted to keep its own national variant of ISO 646 (inherited
from telegraphic systems), but it was finally abandoned: everybody was
already using windows 1252 or ISO 8859-1 (even early Unix adopters which
used a preliminary version made by Digital/DEC, then promoted by X11), or
otherwise used Adobe proprietary encodings. Unix itself had no standard (so
many different variants including with other OSes for industrial or
accounting systems, made notably by IBM,, which created so many variants,
almost one for each submarket, multiple ones in the same country, each time
split into mutliple variants between those based on ASCII, and those based
on EBCDIC...)

The truth is that publishers were forgotten, because their commercial
market was much narrower: each publisher then used its own internal
conventions. Even libaries used their own classifications. There was no
attempt to unifify the needs for publishers (working at document level) and
data processors (including OSes). This effort started only very late, when
W3C finally started to work seriously on fixing HTML, and make it more or
less interoperable with SGML (promoted by publishers). But at national
level, there were still lot of other competing standards (let's remember
teletext, including the Minitel terminal and Antiope for TV). People at
home did not have access to any system capable of rendering proportionaly
fonts. All early computers for personal use were based on fixed-width 8-bit
fonts (including in Japan). China and Korea were still not technology
advanced as they are today (there were some efforts but they were costly
and there was little return at that time).

The adoption of the UCS was extremely long, and it is still not competely
finished even if now its support is mandatory in all new computiong
standards and their revisions. The last segment where it still resists is
the mobile phone industry (how can the SMS be so restricted and so much
non-interoperable, and inefficient?)

So French has a long tradition for its "fine", its support was demanded
since long but constantly ignored by vendors making "the" standard.
Publishers themselves resisted against the adoption of the web as a
publishing platform: they prefered their legacy solutions as well, and did
not care much about interoperability, so they did not pressure enough the
standard makers to adopt the "fine". The same happened in US. There was no
"commercial" incentive to adopt it and littel money coming from that sector
(that has since suffered a lot from the loss of advertizing revenue, the
competition of online publishers, the explosion of paper cost, but as well
from the huge piracy level made on the Internet that reduced their sales
and then their effective measured audience; the same is happening now on
the TV and radio market; and on the Internet the adverizing market has been
concentrated a lot and its revenues are less and less balanced; photographs
and reporters have difficulties now to live from their work).

And there's little incentive now for creating quality products: so many
products are developed and distributed very fast, and not enough people
care about quality, or won't pay for it. The old good practives of
typographs and publishers are most often ignored, they look "exotic" or
"old-fashioned", and so many people say now these are "not needed" (just
like they'll say that supporting multiple languages is not necessary)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190117/45f8c2af/attachment.html>


More information about the Unicode mailing list