Question on combining character order

Richard Wordingham richard.wordingham at ntlworld.com
Sun Jun 20 23:43:37 CDT 2021


On Sun, 20 Jun 2021 02:10:34 -0700
Asmus Freytag via Unicode <unicode at corp.unicode.org> wrote:

> The short answer is "no".
> 
> A longer answer is that typing order, display order and
> phonetic/semantic order do not have to agree.
> 
> On 6/20/2021 1:54 AM, Phake Nick via Unicode wrote:
> Currently, in Unicode, combining characters like U+20DD or U+20DE,
> are to be placed behind the main character to be combined.
> > But sometimes, linguistically, it make sense for a combing mark to
> > come in front.
> > 
> > For example, the famous instant ramen brand, Maruchan, was
> > originally called "Maruto" in Japanese, as a spoken form of its
> > initial trade mark with the Japanese hiragana character "To" (Stand
> > for the company's official name, Toyo Suisan) being placed insidr a
> > circle ("Maru"). To replicate the sign using modern Unicode, users
> > would need to first input the Japanese Hiragana character "To",
> > then inout the combining circle mark of U+20DD being the maru, and
> > would result in reverse linguistic order compares to how such marks
> > are being pronounced in Japanese.
> > 
> > Another example, in Cantonese, it is customary to create new
> > Chinese characters to express a Cantonese phoneme that don't have
> > obvious connection with commonly known Chinese characters, by
> > attaching the component of a mouth (U+2F1D) onto other
> > similarly-sounded Chinese characters with different meaning. For
> > example, Unicode character U+975A, meaning "beautiful", can have
> > the component of mouth attached to it, and become U+210C1, meaning
> > beautiful. Although in this particular example, the modified
> > character have also been encoded, on some platforms it might not be
> > supported by input method modifier or are otherwise difficult to
> > enter and thus people would input the deconstructed form. But due
> > to the lack of a small mouth component for combination, and
> > combination of characyers through Ideographic Description Sequence
> > is also not being supported on most platforms, it is common for
> > people to use Latin small letter o, U+006F, to represent the
> > component. As the component is customarily written on the left side
> > of Chinese characters, and it is customary for Chinese character to
> > be written from left to right, it would be usual for the additional
> > component to be keyed in before entering the character itself. As
> > such, if a combining character featuring the component of mouth is
> > to be introduced, it would make the most sense if the combining
> > mouth component is to be typed in before the character to be
> > modified, instead of the other way round.
> > 
> > Is there mechanism in Unicode that can support such type of
> > combining characters?

(Resending)

Yes, in various degrees.

1. Coeng characters (i.e. most invisible stackers) convert the
input-logically following consonant into a consonant character.
Category Mn.

2. 'Buoyant' consonants that sit on the hanging baseline above the rest
of the consonant stack, such as U+0D4E MALAYALAM LETTER DOT REPH, and
(category Lo) ...

3. ... and eastern U+1A58 TAI THAM SIGN MAI KANG LAI (category
Mn).

There are problems with most of these:

1. Coeng characters get given a non-zero canonical combining class,
which can causes them to be separated from combining marks applied to
the previous base character.  That happens in Tai Tham, and would
happen in Kharoshthi if nuktas were applied to the initial characters
of conjoined characters.

3. The properties of U+1A58 are based on western usage, where it
functions as a final consonant, so grapheme clustering unites it with
the previous consonant.  Manipulating an isolated orthographic syllable
starting with it is awkward at best.

What you want are formally format characters (Cf), like the IDS
controls, but with a mandatory graphic effect, more like the control
characters for Egyptian hieroglyphs.  However, for Chinese character
composition, what you want might be better served by an Lo with
appropriate clustering and line-breaking operations.  It's time to move
on to 'every script is complex'.

Richard.


More information about the Unicode mailing list