Character Sequences of Uncertain Rendering (was: Version linking?)

Richard Wordingham via Unicode unicode at
Sat Aug 26 14:28:36 CDT 2017

On Fri, 25 Aug 2017 01:24:36 +0200
Philippe Verdy via Unicode <unicode at> wrote:

> 2017-08-17 22:37 GMT+02:00 Richard Wordingham via Unicode <
> unicode at>:  
> > Fortunately, there is no good evidence that the occurrence
> > of multiple distinct left matras is anything but a typing error,
> > though I can easily see how it might be used as a lexicographical
> > convention on the fuzzy edge of plain text.
> >
> > In a similar vein, in Malayalam, we get repeats of the 2-part vowel
> > U+0D4B MALAYALAM VOWEL SIGN OO (see Cibu Johny's report at
> >
> > ),
> > but I'm not sure what the legitimate encodings of the example word
> > കോോോ (typed here as <U+0D15, U+0D4B, U+0D4B, U+0D4B>) are.
> Even if there were typing errors, the input method should either
> signal it visually to the user (using canonical reordering), or the
> user could still cancel this reordering (e.g. CTRL+Z for undoing it)
> and the input method could still fix it and mainting the order by
> then inserting combining joiners automatically even if the user did
> not enter them directly.

I don't see how any of ZWJ, ZWNJ and CGJ would help multiple
distinct left matras or repeated 2-part vowels. You might argue for
insertion of U+25CC as a base consonant, along with the ability to
delete just it.

> The joiners should better be removed transparently by the text editor
> without requiring the user to perform complex selections or pressing
> BACKSPACE multiple times, as I don't see any use of these joiners at
> end of graphemes, or multiple joiners in a sequence.

I believe <ZWNJ, ZWJ> has a rôle in some Arabic script writing systems,
and possibly in other cursive Semitic scripts, such as Mongolian.
<Virama, ZWNJ> is required at some syllable boundaries, and it is nice
to have ZWNJ honoured in the sequence <U+1A36 TAI THAM LETTER NA,
U+200C ZWNJ, U+1A63 TAI THAM VOWEL SIGN AA>, which is composed of two
extended grapheme clusters, <U+1A36, U+200C> and <U+1A63>.  This latter,
of course, is no more than one would require of good Latin typography
that works well with an English spell-checker - I would expect 'caecum'
to have a ligature but not 'sundae'.

> Even for Latin, one can freely enter SHY controls at any place within
> words, even if they are not at correct syllabic separations: this will
> impact the rendering if there are linebreaks, but this is done on
> purpose, and still easy to correct if this was made by error (a spell
> checker could also help locate these uncommons errors in existing
> texts but would not automatically correct them without instruction
> given by the user and a user can also choose to ignore/discard these
> signals and store the text as is).

Now that beings to mind some interesting cases - <consonant, SHY, right
matra> and <consonant, SHY, left matra>.  I'm not sure where the
handling should go, but Firefox handles the former reasonably.  My one
gripe is that I don't know how to tell the system that a rendered soft
hyphen is invisible.  Some typographers claim that the glyph for the
soft hyphen (i.e. the glyph for U+00AD) should be used when it becomes
manifest.  I haven't found any cases where a line break should go
between a left matra and a base consonant, but I wouldn't be surprised
to encounter an example in a manuscript in a phonetically ordered
script.  (They are far from unknown in Thai, but that's probably due
to software deficiencies.)  TUS treats the rendering of soft hyphens as
beyond its scope except for line-breaking - the rules are
language-dependent and beyond the scope of Unicode.  I don't know if
CLDR handles rendering around line-breaking soft hyphens.

I'm wondering if there are any cases where a SHY _should_ go between
a Latin letter and diacritic.  I can't think of any.


More information about the Unicode mailing list