Request for Information

Fri Jul 25 10:49:13 CDT 2014

On 07/24/2014 06:45 PM, Whistler, Ken wrote:
> Fantasai asked:
>
>> I would like to request that Unicode include, for each writing system it
>> encodes, some information on how it might justify.
>
> Following up on the comment and examples provided by Richard
> Wordingham, I'd like to emphasize a relevant point:
>
> Scripts may be used for *multiple* (different) writing systems.

Hence the use of "for each writing system" rather than "for each
script" in the sentence you quote above.

Also, from a practical perspective, the systems for which this
information would be *really* useful for Unicode to provide are
the lesser-used systems (like Javanese), which are tied to only
a few languages and therefore belong to only a handful of writing
systems with very little variation.

> Rules for justification of text are aspects of writing systems,
> orthographies, and typographical conventions -- and are not
> inherent properties of scripts.
>
> So while there may be strong tendencies for certain scripts to
> fall into certain typographical practices, including behavior for
> text justification, I don't think that information is inherent
> to scripts per se. And it would be misleading and gardenpathy
> for the Unicode Standard to try to treat justification as
> somehow inhering to scripts.

Sure, but the practice of using spaces to separate words is, by
your same argument, not a property of the script, but of the
writing system. However, this information--whether a script is
typically used with or without word separation--is often included
in the Unicode standard’s description of that script.

To take another example, Unicode defines a set of line breaking
conventions for UAX14's default rules. However these could also
be argued to be part of the writing system and not an inherent
property of the script. What's chosen for UAX14 is the common
case, and where there are multiple common cases UAX14 calls
them out as possible tailorings.

Are arguing that all of this information should be removed from
Unicode?

> I think it would make more sense to turn fantasai's query on its
> head, as it were: First categorize what kinds of systems of
> justification there are, and then start filling in, from best
> understood out to the fringes of knowledge of practice, what
> writing systems (using what script or combination of scripts)
> are attested as regularly using each system. Lacunae are
> inevitable, however.

Justification systems typically expand or compress spaces,
and when that fails (becoming too small or too large, where
the tolerances vary widely per writing system), fall back to
"letter-spacing". The interaction of different levels of
justification (e.g. spaces vs. letter-spacing) depends on
the justification algorithms, and the tolerances for spacing
adjustments depends on the writing system and the quality of
the typesetter.

It is my observation that systems with fewer spaces are more
tolerant of letter-spacing.

Some data-related questions here are:

   1. The frequency of spaces in that writing system.
      This is strongly related to whether stretchable spaces are
      used for word separation, phrase separation, or neither,
      and whether they are used around common or rare punctuation.

      This information is noted for some scripts in the Unicode
      standard, but it is irregularly considered. Many chapters
      make no mention of whether and how spaces are used.

      (For example, it would be nice if the standard mentioned
      whether punctuation like the Javanese pada lingsa are expected
      to be followed by a space character, so that font makers,
      layout engineers, and typists can coordinate accordingly to
      create the appropriate amount of white space on the screen.)

   2. Which characters are "separable" for justification.
      Some languages (like German) may suppress such separation.
      And the rules for determining separable "clusters" can be
      language and/or font-dependent.
      However it can be said with certainty that Latin letters,
      for example, are separable, whereas Arabic letters are not.

      This information is mostly represented in UAX29, with the
      exception that there's no really clear information on
      which scripts are "cursive" (have inseparable grapheme
      clusters).

There are exceptional systems:
   - Arabic can use cursive elongation for justification.
   - Japanese and Chinese can compress the inherent "spaces"
     within the full-width glyphs of certain punctuation.
   - Tibetan can use tsek marks as filler for justification.
     (Which is, by the way, discussed *extensively* in the Unicode
     standard, so you can't tell me that the Unicode Consortium
     considers notes on common justification practices to be out
     of scope.)

> I think it is just a mistake to assume from a query on the Script
> property identity of a character, what justification rule should
> apply to it in text.

I think when you have no further context, it is better to have
a guess informed by the character properties than one completely
ignorant of them.

> Note also that for many scripts there is no established modern
> typographical practice, so it is basically unknown or meaningless
> to ask what the justification rules are for it. Modern typographers
> setting old material will eventually make up the rules, and those
> will *become* the answer, but the Unicode Consortium cannot
> look at pictures of fragmentary Byzantine seals or fragments of
> papyri and *determine* what some normative (or even informative)
> property of justification should be for the script in such a
> record.

Right, so as I mentioned, "We don't know" is an acceptable answer.
At least then I can assume that my best guess is equivalent to
the state of the art. :)

~fantasai