Choosing the Set of Renderable Strings

Richard Wordingham via Unicode unicode at unicode.org
Fri May 11 12:37:27 CDT 2018


For assembling a rendering system for a script with combining marks,
is there a guide as to how to decide what strings one should exclude,
and which one should strive to support?

There will also be characters outside the script that should be
supported.  For a font, there are lists of characters for Microsoft Word
and for the Universal Scripting Engine, and it is frequently desirable
for a font to be able to display its own name.  There are also various
control and formatting characters, and punctuation characters from
outside the script.

I believe compromises are necessary.

There are issues with stacking combining marks - at one point does one
throw oneself on the mercy of the application?  Making characters small
enough to accommodate a cross-line stack of 20 within the nominal line
separation is usually not acceptable!  (There are Sanskrit manuscripts
where a stack extends across several lines.)  There are also problems
if glyphs cannot simply be stacked - it is not unknown for a
'subscript' glyph to obligatorily have a part on the baseine - preposed
'subscript' RA can required different glyphs depending on how deeply it
is stacked.

If canonical equivalence does not eliminate homographs, there is the
question of which homographs to tolerate.

I have hit this issue with Tai Tham.  The essence of the problem is
that a CVCV word with identical consonants can be abbreviated to CVV,
as in some other scripts, and dependent vowels can be written using
several vowel symbols.  All vowels have ccc=0.  Now, the accepted
proposal (i.e. the one accepted by the UTC for the ISO process) gave an
order for the vowels in such polygraphs, and most combinations
resulting from such contraction comply with this order.

The existence of such a contraction can be indicated in writing by the
(ambiguous) mark MAI SAM, and in such cases the proposed encoding of
Tai Tham text is of the form CVxV where 'x' is MAI SAM.  In such cases
I allow the constraint on vowel order to apply to each vowel
separately. This allows homographs, but I take the view that I am
rejecting homographs to facilitate searching, not to prevent spoofing.
The prevention of spoofing would use stricter rules, which would ban
some words, just as the English word "café" is prohibited in British
domain names.  (The doublet "cafe" refers to a lower class of
establishment in British English.)

However, the mark MAI SAM is not always used.  Now, if Tai Tham vowels
had non-zero combining marks, I would separate the vowels from the two
phonetic syllables by the general disruptor, CGJ, to facilitate
sorting.  At the very least the word should then be sorted with other
words starting with the same CV, and with preprocessing, the CGJ could
be replaced by the omitted consonant.  Now, Tai Tham vowels have
ccc=0, but I favour retaining the CGJ to mark the location of the
repeated consonant.

This CGJ also enables me to make some check as to whether the
individual phonetic syllables' vowel symbols are in the correct
order.  So:

(a) If the vowel symbols in CVV are in the permitted order, the string
is accepted.

(b) If the word is typed as CV<CGJ>V and the vowels on either side of
CGJ are in the correct order, the string is accepted.

(c) If the word is typed as CVV and the vowel symbols are not in the
permitted order, and I can detect this, I allow the implementation of
the Universal Script Engine (be it Microsoft, AAT or HarfBuzz) to insert
its dotted circles.  More precisely, I don't remove them.

Is this a reasonable approach to allowing both collation and
suppressing needless homographs?  My contribution to the rendering is
only the provision of a font.

Richard.  
 



More information about the Unicode mailing list