Fighting Spell-Checking by Renderers

Richard Wordingham via Unicode unicode at unicode.org
Sun May 14 01:04:31 CDT 2017


One of the early problems encountered with Unicode was that there can
be multiple ways of representing the same text.  For many scripts, the
solution was canonical equivalence - the multiple ways were declared to
be equivalent, and anything that thought they had different meanings
and should *therefore* be treated differently was non-compliant with the
Unicode standard.

Where canonical equivalence actually leads to the wrong conclusion a
method was subsequently found to make sequences canonically
inequivalent, U+034F COMBINING GRAPHEME JOINER (CGJ).  It generally
takes extra effort to insert this character.

However, canonical equivalence hit a severe problem with two-part
Indic vowels, and the use of non-zero canonical combining classes in
Indic scripts is generally low.  A similar issue might arise with
graphically non-interacting subordinated consonants, especially when
encoded as virama/coeng plus base consonant.  One solution to this
problem is for renderers to produce a strange rendering if characters
appear in a non-standard order.

However, character strings are not just rendered and compared for
identity.  They are also be transliterated, sorted into alphabetical
order, and may be input to automatic speech generation systems with
limited capabilities for resolving homographs. This may require some
way of tagging an apparently incorrectly ordered string, analogous to
the use of 'sic' in English, to indicate that the text is intended not
to accord with the 'standard' character order.

What characters are available for such a rôle?  CGJ is a possibility,
but I am concerned that it may be being overworked.  It is already
suggested as a solution for dealing with sorting when a digraph is
treated as a letter, but accidental sequences are not, as in the Welsh
letter 'ng' (which comes between 'g' and 'h' in the alphabet) as
opposed to an 'accidental' sequence such as in 'Bangor' and
'Llangollen'.

Such characters probably don't work now, but it may be possible to
persuade the suppliers to heed them.  The ideal character would be
disallowed in domain names, which should allay the greatest security
worries about simply rendering the text as it stands.

Some potential ambiguities arise from Sanskrit, and were raised long
ago by Peter Constable on the Unicode Indic list on 28 August 2006
under the heading 'contrastive /Crv/ and /Cvr/ in Telugu, Malayalam'.
The cases he gave were 'grva' v. 'gvra', 'drva' v. 'dvra' and 'srva' v.
'svra'. For the Khmer script, the KhmerOS font renders the pairs
identically, which did surprise me, as I had got it into my head that
one could tell from the depth of the <COENG, RO> where the RO came in
the sequence of conjoined letters.

Richard.



More information about the Unicode mailing list