Choosing the Set of Renderable Strings

Richard Wordingham via Unicode unicode at
Mon May 14 14:31:15 CDT 2018

On Mon, 14 May 2018 04:12:56 -0800
James Kass via Unicode <unicode at> wrote:

> In response to William Overington's post, it's easier to transcode
> data from a PUA scheme into Unicode than it is to enter the data from
> scratch.  (The same could be said for a customized ASCII font.)  Some
> users may not wish to wait even the handful of years it took for
> mainstream Indic complex scripts to be rendered properly.
> At this phase of Unicode's progress, however, we shouldn't encourage
> the interchange of such PUA data.  Since it's simple to transcode, any
> such data should be transcoded prior to interchange or permanent
> storage.  

> Recipients lacking systems supporting proper Unicode
> rendering for complex scripts such as Tai Tham could then transcode it
> to the PUA scheme for display/printing purposes.

The PUA scheme would be roughly equivalent to the glyph sequence
produced by the shaper. (The ccmp feature is in general not available
for the PUA, though CSS allows its use to be forced.)  However, there
would be no extra channels, such as the component-mark association often
needed for some cursive scripts. For example, in ᨣᩩ᩠ᨿ <LOW KA, SIGN U,
SAKOT, LOW YA> 'to direct', SIGN U may be realised as a mark below
left, a mark below <SAKOT, LOW YA>, or a spacing mark on the right of
<SAKOT, YA>.  One could argue that the three positions require
different glyphs for SIGN U.  Each font would need its own PUA.

> An OpenType font, a keyboard driver, and a text conversion utility
> might go a long way towards supporting complex scripts for users whose
> systems cannot otherwise currently support them.

This is where Apple had the right idea, but difficult of
implementation, and the OTL paradigm is deficient. There are several
places in Tai Tham layout where I want to swap glyphs round, but for
the layout engine to do so for me would cause grief for other Tai Tham
fonts. This rearrangement cannot be delegated to the rendering
engine.  There are Tai Tham fonts which handle Indic rearrangement in
the ccmp feature, but they are then totally defeated by either ccmp not
being enabled or by the USE doing basic Indic shaping.

There are now two approaches for Tai Tham - (1) fix USE or
restore/create a separate shaper for scripts with CVC... aksharas, and
(2) overcome the USE in the font. For the latter I need to make the
work-arounds in Da Lekh easier to copy.  I have transferred them to Ed
Trager's Hariphunchai font, yielding Lamphun, but Lamphun still needs
some further revision to the positioning logic. It wasn't as
complete as I'd hoped.  I've done a quick fix for the vowels below, but
I suspect much more work is needed to conform to the spirit of the
Hariphunchai font. I could do with someone artistic to help with the
combinations of NYA and subscript consonant such as NY.CA, and Pali
LL.HA is currently a disaster.

On Track 1, there's also more tinkering to do, such as making MEDIAL LA
and MEDIAL RA 'consonant subscript' rather than 'consonant medial'
/lw/ is an allowed onset in the Tai languages using the Tai Tham
script, so we get orthographic onset <hlw-> with MEDIAL LA in the West.
The main problem is that we do not have characters *MEDIAL WA and
*MEDIAL YA - the general subscript WA and YA are used instead, and these can
function as matres lectionis.  (In Unicode Khmer, the matres lectionis
have been reanalysed as vowels.)

I think it would also help to make SIGN AA and SIGN TALL AA into
letters as far as the USE is concerned. The default grapheme
segmentation rules already treat them as consonants. The possible
downside is that so doing might mess up some fonts.

> A good keyboard driver should be able to remove some of the burden off
> of the OpenType tables, enabling multiple
> fonts covering the same script to be used without having bloated and
> redundant OpenType tables, by offering some degree of control over the
> actual character strings which are being stored (and presented to the
> font for rendering).

It won't work.  The text input delivered by X still needs to be
supported, and without modifying the application, X can only input one
character at a time.  Not everyone uses an 'input method'.

> (Many font developers might consider that any kind of normalization
> should be handled at input rather than left up to the font.  Keyboard
> developers might have a different idea, though.)

Apparently, Hangul input should not be canonically normalised in South
Korea. I've seen an implementation of the USE render canonically
equivalent strings differently.  It wouldn't be HarfBuzz - it
normalises, as we saw when it briefly messed up Tai Tham rendering when
it swapped <tone, SAKOT> to <SAKOT, tone>.  That was rapidly fixed to normalise the other way round.

I'd completely forgotten that Thai, Lao and Tai Tham tone marks had
different combining classes.  However, in Northern Thai,
<TONE-1...TONE-2> and <TONE-2...TONE-1> seem to render the same, so
normalisation might not be relevant.  Unsurprisingly, that's the only
pair of tone-marks I've seen in the same akshara, so I don't know how
the other pairs of distinct tone marks combine.  A pair arises when two
chained syllables have different tone marks.  If they have the same
tone mark, one is suppressed.


More information about the Unicode mailing list