Choosing the Set of Renderable Strings

Richard Wordingham via Unicode unicode at unicode.org
Tue May 15 16:40:11 CDT 2018


On Tue, 15 May 2018 04:19:42 -0800
James Kass via Unicode <unicode at unicode.org> wrote:

> On Mon, May 14, 2018 at 11:31 AM, Richard Wordingham via Unicode
> <unicode at unicode.org> wrote:
> 
> > ...  One could argue that the three positions require
> > different glyphs for SIGN U.  Each font would need its own PUA.  
> 
> Or a consensus.

One would end up with a large glyph list to accommodate all designs.
Imagine applying this approach to Devanagari, with all its Sanskrit
conjuncts to be supported although some converters would only target a
small subset.

> > ... There are several
> > places in Tai Tham layout where I want to swap glyphs round, but for
> > the layout engine to do so for me would cause grief for other Tai
> > Tham fonts. This rearrangement cannot be delegated to the rendering
> > engine.  There are Tai Tham fonts which handle Indic rearrangement
> > in the ccmp feature, but they are then totally defeated by either
> > ccmp not being enabled or by the USE doing basic Indic shaping.  
> 
> Suppose the OpenType specs were revised to include a bit which could
> be set for disabling basic Indic shaping by the USE?  I wouldn't set
> it if I were just starting out to make a font for a complex script
> requiring basic Indic shaping, and cannot imagine why anyone else just
> starting out would.

One would need to set the bit while the script was not yet in Unicode,
and then you may well need to set it when the USE bites.  As another
concrete example, one couldn't use USE for the Khmer script - it too
has CVC syllables.  I believe there are also lurking problems with the
ordering of the rarer marks.

You'd come unstuck if you found your script had both preposed
subscripts and optionally preposed matras.  The USE can't handle both
in the same syllable.

One might need to ignore syllable boundaries before Indic re-ordering,
though that's probably a preference rather than a requirement.  Tai
Tham has a troublesome mark, U+1A58 TAI THAM SIGN MAI KANG LAI.  In the
West, it's 'Consonant final' and is a mark above or above right.  In
the East, it works like Burmese kinzi, and acts like a repha.  Revision
1 of the Maefahluang Dictionary of Northern Thai sits on the border.
In its text, it behaves one way in some environments, and the other
way in others.

Finally, many scripts had fonts before windows supported them.  Indeed,
isn't significant Tai Tham renderer support on Windows 7 restricted to
HarfBuzz clients?  (I don't believe M17n is significant, and I fear my
interfacing set-up only works for my fonts.) 

> >> A good keyboard driver ...  
> >
> > It won't work.  The text input delivered by X still needs to be
> > supported, and without modifying the application, X can only input
> > one character at a time.  Not everyone uses an 'input method'.  
> 
> Every keyboard uses a driver, though.  I can't speak for "X", but my
> understanding is that the keyboard driver acts as sort of a buffer
> between the user's key strokes and the application.

X attempts to present the key strokes to the application.  The
application may chose to present these key stroke to an input method to
handle, but these input methods are not reliable.  I have a battery of
three inputs methods for most applications on Ubuntu - raw X keyboard
mapping, ibus using Keyman for Linux, and fcitx using M17n.
Additionally, I find Emacs is easier to use if I talk to it in ASCII
and use its input methods for other character sets.  The advantage
there is that Emacs knows whether I am entering a command, which must
be in ASCII, or text, for which it uses the active input method.

Another issue is that normalised text can be highly inconvenient for a
font.  HarfBuzz chooses a non-standard normalisation for several
scripts simply because that makes things easier for a font. 

> > I've seen an implementation of the USE render
> > canonically equivalent strings differently.  ...  
> 
> Because the USE failed or because the font provided look-ups for each
> of those strings to different glyphs?

Remember that the USE changes the string presented to the font by
inserting dotted circles.  Essentially, <tone, SAKOT, consonant> and
<SAKOT, tone, consonant> can be penalised differently - Microsoft
inserts more dotted circles than does HarfBuzz.

Richard.


More information about the Unicode mailing list