Marcel Schneider via Unicode
unicode at unicode.org
Fri Jan 18 09:27:17 CST 2019
On 17/01/2019 20:11, 梁海 Liang Hai via Unicode wrote:
> [Just a quick note to everyone that, I’ve just subscribed to this public list, and will look into this ongoing Mongolian-related discussion once I’ve mentally recovered from this week’s UTC stress. :)]
Welcome to Unicode Public.
Hopefully this discussion helps sort things out so that we’ll know both what to do wrt Mongolian and what to do wrt French.
On Jan 17, 2019, at 11:06, Asmus Freytag via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
> On 1/17/2019 9:35 AM, Marcel Schneider via Unicode wrote:
>> [On 17/01/2019 12:21, Philippe Verdy via Unicode wrote:]
>>> [quoted mail]
>>> But the French "espace fine insécable" was requested long long before Mongolian was discussed for encodinc in the UCS. The problem is that the initial rush for French was made in a period where Unicode and ISO were competing and not in sync, so no agreement could be found, until there was a decision to merge the efforts. Tge early rush was in ISO still not using any character model but a glyph model, with little desire to support multiple whitespaces; on the Unicode side, there was initially no desire to encode all the languages and scripts, focusing initially only on trying to unify the existing vendor character sets which were already implemented by a limited set of proprietary vendor implementations (notably IBM, Microsoft, HP, Digital) plus a few of the registered chrsets in IANA including the existing ISO 8859-*, GBK, and some national standard or de facto standards (Russia, Thailand, Japan, Korea).
>>> This early rush did not involve typographers (well there was Adobe at this time but still using another unrelated technology). Font standards were still not existing and were competing in incompatible ways, all was a mess at that time, so publishers were still required to use proprietary software solutions, with very low interoperability (at that time the only "standard" was PostScript, not needing any character encoding at all, but only encoding glyphs!)
>> Thank you for this insight. It is a still untold part of the history of Unicode.
> This historical summary does *not *square in key points with my own recollection (I was there). I would therefore not rely on it as if gospel truth.
> In particular, one of the key technologies that _brought industry partners to cooperate around Unicode_ was font technology, in particular the development of the /TrueType /Standard. I find it not credible that no typographers were part of that project :).
It is probably part of the (unintentional) fake blames spread by the cited author’s paper. My apologies for not sufficiently assessing the reliability of my sources. I’d already identified a number of errors but wasn’t savvy enough for seeing the other one reported by Richard Wordingham. Now the paper ends up as a mere libel. It doesn’t mention the lack of NNBSP, instead it piles up a bunch of gratuitous calumnies. Should that be the prevailing mood of average French professionals with respect to Unicode ― indeed Patrick Andries is the only French tech writer on Unicode I found whose work is acclaimed, the others are either disliked or silent (or libellists) ― then I understand only better why a significant majority of UTC is hating French.
Francophobia is also palpable in Canada, beyond any technical reasons, especially in the IT industry. Hence the position of UTC is far from isolated. If ethic and personal considerations inflect decision-making, they should consistently be an integral part of discussions here. In that vein, I’d mention that by the time when Unicode was developed, there was a global hatred against France, that originated in French colonial and foreign politics since WWII, and was revived a few years ago by the French government sinking and killing the crew’s photographer, in the port of Auckland. That crime triggered a peak of anger.
> Covering existing character sets (National, International and Industry) was _an_ (not "the") important goal at the time: such coverage was understood as a necessary (although not sufficient) condition that would enable data migration to Unicode as well as enable Unicode-based systems to process and display non-Unicode data (by conversion).
I’d take this as a touchstone to infer that there were actual data files including standard typographic spaces as encoded in U+2000..U+2006, and electronic table layout using these: “U+2007 figure space has a fixed width, known as tabular width, which is the same width as digits used in tables. U+2008 punctuation space is a space defined to be the same width as a period.”
Is that correct?
> The statement: "there was initially no desire to encode all the languages and scripts" is categorically false.
Though Unicode was designed as being limited to a 65 000 characters, and it was stated that historic scripts were out of scope, only living scripts should be encoded, for interchange.
> (Incidentally, Unicode does not "encode languages" - no character encoding does).
In an often used sense every “language” has its “alphabet”, although one does not currently refer to Latin as multiple scripts.
> What has some resemblance of truth is that the understanding of how best to encode whitespace evolved over time. For a long time, there was a confusion whether spaces of different width were simply digital representations of various metal blanks used in hot metal typography to lay out text. As the placement of these was largely handled by the typesetter, not the author, it was felt that they would be better modeled by variable spacing applied mechanically during layout, such as applying indents or justification.
Indeed it is stated that the multiple typographic spaces that made it into the Standard were not used in electronic typesetting and layout.
> Gradually it became better understood that there was a second use for these: there are situations where some elements of running text have a gap of a specific width between them, such as a figure space, which is better treated like a character under authors or numeric formatting control than something that gets automatically inserted during layout and rendering.
There seems to be a confusion about the figure space. What is this space really for?
* The Unicode Standard hints that it was used to fill up empty positions in numeric tables.
* The Unicode Line Break Algorithm UAX #14 understands that it is the group separator, although as such it is neither SI- and ISO 80000 conformant, nor is it implemented in CLDR. (Fortunately it is not, given it isn’t SI/ISO compliant, but it would have been a better pick than NBSP, because unlike NBSP, it is not justifying.)
As you were there, did you see or hear how it happened that, well, FIGURE SPACE (U+2007) was declared non-breakable, and how it happened that at the same time, PUNCTUATION SPACE (U+2008) was not declared non-breakable?
Hint: Was it understood (certainly it was) that a non-breakable PUNCTUATION SPACE would have been the “espace fine insécable” (narrow no-break space) that the French users of character sets were languishing after?
> Other spaces were found best modeled with a minimal width, subject to expansion during layout if needed.
> There is a wide range of typographical quality in printed publication. The late '70s and '80s saw many books published by direct photomechanical reproduction of typescripts. These represent perhaps the bottom end of the quality scale: they did not implement many fine typographical details and their prevalence among technical literature may have impeded the understanding of what character encoding support would be needed for true fine typography.
By that time, electronic typewriters became widespread, featuring interchangeable fonts (on type wheels), proportional advance width (for use with appropriate fonts), and bold weight (by double-typing with a tiny offset). Additionally some models had an input buffer with a linear LCD display, mitigating the expense in correction ribbon as typewriters became more and more popular.
With ordinary typewriter spacing, the narrow space was not a demand, but with proportional advance width that could have changed.
Do you remember the ratio fixed width / proportional width in the photomechanically reproduced printed matters you are referring to?
How were typewriters with proportional width shaping the perception of typography in general, and of whitespace in particular, among the authors of Unicode?
*Fine typography:* There is a current misunderstanding of “fine typography” with respect to the NARROW NO-BREAK SPACE. The use of this character **is not** part of fine typography. It is simply part of the ordinary digital representation of the French language. To declare NNBSP as belonging to “fine typography” is to make it optional. In French and in languages grouping digits with spaces, *NNBSP* is not optional, it*is mandatory.* In the actual state of Unicode, NNBSP is the only usable space for the purpose of grouping digits and of spacing off French punctuation (except some old-style French layout of the colon).
That space would be *PSP* (PUNCTUATION SPACE) **if** Unicode had made it non-breakable. In that case, the *MONGOLIAN SPACE (MSP) would eventually have been encoded, or rather the *MONGOLIAN SUFFIX CONNECTOR (MSC), for the purpose of particular shaping. If the *MONGOLIAN SPACE had actually been encoded, it would be tailorable ad libitum, and Unicode could change its properties as desired (referring to a proposed change of General category of NNBSP from Zs to Cf, and/or of line-breaking class from GL to BB IIRC).
> At the same time, Donald Knuth was refining TeX to restore high quality digital typography, initially for mathematics.
That is very interesting an certainly worth noting here, but it cannot be enough underscored how this is off-topic to this thread, and brings us away from the matter we’re actually discussing, that is writing Mongolian and French in a functional way, also in plain text. Again, NNBSP is not fine typography and it has nothing to do with high-quality typography. NNBSP is simply a matter of not ending up with messy text. Not to use NNBSP is to mess up the text.
> However, TeX did not have an underlying character encoding; it was using a completely different model mediating between source data and final output. (And it did not know anything about typography for other writing systems).
> Therefore, it is not surprising that it took a while and a few false starts to get the encoding model correct for space characters.
Isn’t that overstating the complexity of whitespaces in Unicode?
As seen from today, getting it right was as simple as giving the same GL class to both spaces allegedly encoded for tabular typesetting, but readily repurposed.
As it is, PUNCTUATION SPACE is a totally useless duplicate encoding, until/unless proven otherwise.
> Hopefully, well complete our understanding and resolve the remaining issues.
That is a great promise. Hopefully you are being backed by UTC in making it!
P. S.: The name of the Greenpeace flagship has been typeset in italics thanks to Andrew West’s online utility,  in respectfulness towards the organization, and with implicit reference to parent and sibling threads. //Please don’t interpret this gesture as backing demands for Unicode representation of italics.//
We’re (at least I’m) actually trying to understand more in detail why UTC is struggling against NNBSP as a space (thinking at changing its Gc to Cf), while at encoding time, UTC prompted Mongolian OPs to refrain from requesting a dedicated Mongolian Space rather than shifting the new space into General Punctuation for other scripts’ joint convenience.
Admittedly, French has been the only script to make extensive use of it  – a highly partial impression given many many other locales are using a space to group digits, and that space is then mandatorily NNBSP; anything else being highly unprofessional.
So we’ll look even harder at the new TUS text wrt NNBSP in Mongolian, that Richard Wordingham draw our attention to, and we’d like to understand the role of UTC acting in favor or against NNBSP, possibly with various antagonistic components within UTC.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode