From someonesdad1 at gmail.com  Thu Jun  3 13:29:42 2021
From: someonesdad1 at gmail.com (Don Peterson)
Date: Thu, 3 Jun 2021 12:29:42 -0600
Subject: Suggestion for superscripts
Message-ID: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>

For about a decade I have been wanting to be able to print Unicode
superscript characters in the output of some programs.  The most common use
case for this is to print the exponents to physical units.  An example is
kg?m/s?, which is a bit easier on the eyes and brain than kg*m/s**2.

Unfortunately, the current version 13 character set doesn't have enough
superscript characters to support common scientific usage.  From the
ucd.nounihan.grouped XML file for version 13, these are the superscript and
subscript characters I could find:

Superscripts: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Subscripts:   ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ???????????????????

Superscript characters are lacking for two fairly common use cases:
floating point exponents and fractional exponents.  These would be possible
with the addition to the superscripts of the two common radix characters
'.' and ',' and a solidus character.  However, it seems to me that *the
Unicode design should aim at least at putting all printable 7-bit ASCII
characters and the upper and lower case Greek characters commonly used in
technical work in both the subscript and superscript sets*.  I've never
commented on this before because I thought it was obvious and would be
fixed in the next Unicode revision.  I remember looking at this pretty
carefully around version 7 and being surprised by the lack.  Being a lazy
retired person for the last 20 years meant I didn't do anything about it,
which I now regret.  :^)

Because of this lack of superscript characters, one of my library functions
is forced to produce syntactically-correct but ugly output such as
m**0.75?Pa**-1.3?s???K??
for a units string input of "m(3/4) Pa(-1.3)/(s2*K)" (with syntax similar
to the GNU units program).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210603/564a2189/attachment-0001.htm>

From doug at ewellic.org  Thu Jun  3 15:26:55 2021
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 3 Jun 2021 14:26:55 -0600
Subject: Suggestion for superscripts
In-Reply-To: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
Message-ID: <002e01d758b6$cc0bb350$642319f0$@ewellic.org>

Don Peterson wrote:

> However, it seems to me that the Unicode design should aim at least at
> putting all printable 7-bit ASCII characters and the upper and lower
> case Greek characters commonly used in technical work in both the
> subscript and superscript sets.

https://unicode.org/faq/ligature_digraph.html#Txt5

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From wjgo_10009 at btinternet.com  Thu Jun  3 15:16:05 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 3 Jun 2021 21:16:05 +0100 (BST)
Subject: Suggestion for superscripts
In-Reply-To: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
Message-ID: <2764e325.fa03.179d384cbd8.Webtop.83@btinternet.com>


Interestingly, many years ago Bernard Miller, in his Bytext proposal, 
suggested what he termed "arrow parentheses".

There were eight of them.

The glyphs were each either an opening or closing parenthesis character, 
with either one or two up arrows, or one or two down arrows upon the 
parenthesis.

The single ones opened or closed a sequence of characters that were to 
be subscript or superscript, the double ones were for limits of definite 
integrals, summations and so on.

It seemed to me then, and does so now, to be a very good idea.

I am not an expert on Unicode and maybe there is some structural reason 
why this could not become implemented, even if people wanted it 
implemented.

Yet I put this forward in the hope that the idea will be considered 
seriously please.

https://www.unicode.org/mail-arch/unicode-ml/y2002-m01/0477.hl

Here is a link to The Bytext Standard document.

https://web.archive.org/web/20030317065850/http://bytext.org/The_Bytext_Standard.pdf

Arrow parentheses are on pages 33 and 34.

Oh, and notwithstanding the comments about Bytext made in the mailing 
list thread at the time, please have a look at pages 37 and 38 and 
observe what was being suggested in 2002. Hmm.

William Overington

Thursday 3 June 2021


------ Original Message ------
From: "Don Peterson via Unicode" <unicode at corp.unicode.org>
To: unicode at corp.unicode.org
Sent: Thursday, 2021 Jun 3 At 19:29
Subject: Suggestion for superscripts

For about a decade I have been wanting to be able to print Unicode 
superscript characters in the output of some programs.  The most common 
use case for this is to print the exponents to physical units.  An 
example is kg?m/s?, which is a bit easier on the eyes and brain than 
kg*m/s**2.


Unfortunately, the current version 13 character set doesn't have enough 
superscript characters to support common scientific usage.  From the 
ucd.nounihan.grouped XML file for version 13, these are the superscript 
and subscript characters I could find:


Superscripts: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Subscripts:   ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ???????????????????


Superscript characters are lacking for two fairly common use cases: 
floating point exponents and fractional exponents.  These would be 
possible with the addition to the superscripts of the two common radix 
characters '.' and ',' and a solidus character.  However, it seems to me 
that the Unicode design should aim at least at putting all printable 
7-bit ASCII characters and the upper and lower case Greek characters 
commonly used in technical work in both the subscript and superscript 
sets.  I've never commented on this before because I thought it was 
obvious and would be fixed in the next Unicode revision.  I remember 
looking at this pretty carefully around version 7 and being surprised by 
the lack.  Being a lazy retired person for the last 20 years meant I 
didn't do anything about it, which I now regret.  :^)


Because of this lack of superscript characters, one of my library 
functions is forced to produce syntactically-correct but ugly output 
such as m**0.75?Pa**-1.3?s???K?? for a units string input of "m(3/4) 
Pa(-1.3)/(s2*K)" (with syntax similar to the GNU units program).


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210603/116d1e85/attachment.htm>

From wjgo_10009 at btinternet.com  Thu Jun  3 16:36:37 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 3 Jun 2021 22:36:37 +0100 (BST)
Subject: Suggestion for superscripts
In-Reply-To: <2764e325.fa03.179d384cbd8.Webtop.83@btinternet.com>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
 <2764e325.fa03.179d384cbd8.Webtop.83@btinternet.com>
Message-ID: <30c21ada.fb7f.179d3ce8369.Webtop.83@btinternet.com>


Oops, one of the links does not work.

Here is the correct version.

https://www.unicode.org/mail-arch/unicode-ml/y2002-m01/0477.html

William Overington

Thursday 3 June 2021


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210603/d1b734dc/attachment.htm>

From someonesdad1 at gmail.com  Thu Jun  3 18:14:51 2021
From: someonesdad1 at gmail.com (Don Peterson)
Date: Thu, 3 Jun 2021 17:14:51 -0600
Subject: Suggestion for superscripts
In-Reply-To: <002e01d758b6$cc0bb350$642319f0$@ewellic.org>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
 <002e01d758b6$cc0bb350$642319f0$@ewellic.org>
Message-ID: <CAFyvY7X_7DRnLEb4qaO71r+DSyBLDD8hEK457W7Ui-fLmZMiXA@mail.gmail.com>

Alas, that's not a solution for environments like a text editor, bash
window, terminal, etc.

On Thu, Jun 3, 2021 at 2:26 PM Doug Ewell <doug at ewellic.org> wrote:

> Don Peterson wrote:
>
> > However, it seems to me that the Unicode design should aim at least at
> > putting all printable 7-bit ASCII characters and the upper and lower
> > case Greek characters commonly used in technical work in both the
> > subscript and superscript sets.
>
> https://unicode.org/faq/ligature_digraph.html#Txt5
>
> --
> Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210603/ba9136d8/attachment.htm>

From a.lukyanov at yspu.org  Fri Jun  4 00:48:28 2021
From: a.lukyanov at yspu.org (a.lukyanov)
Date: Fri, 04 Jun 2021 08:48:28 +0300
Subject: Suggestion for superscripts
In-Reply-To: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
Message-ID: <60B9BEAC.10204@yspu.org>

22:59, Don Peterson ?????:
> Unfortunately, the current version 13 character set doesn't have 
> enough superscript characters to support common scientific usage.  
> From the ucd.nounihan.grouped XML file for version 13, these are the 
> superscript and subscript characters I could find:
>
> Superscripts: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
There are more of them:

?????????????????????????
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210604/79464239/attachment.htm>

From daniel.buncic at uni-koeln.de  Fri Jun  4 02:45:23 2021
From: daniel.buncic at uni-koeln.de (Daniel Buncic)
Date: Fri, 4 Jun 2021 09:45:23 +0200
Subject: Suggestion for superscripts
In-Reply-To: <CAFyvY7X_7DRnLEb4qaO71r+DSyBLDD8hEK457W7Ui-fLmZMiXA@mail.gmail.com>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
 <002e01d758b6$cc0bb350$642319f0$@ewellic.org>
 <CAFyvY7X_7DRnLEb4qaO71r+DSyBLDD8hEK457W7Ui-fLmZMiXA@mail.gmail.com>
Message-ID: <ca3b816d-516f-1e6a-6f82-097551979a21@uni-koeln.de>

Am 03.06.2021 um 22:16 schrieb William_J_G Overington via Unicode:
> Interestingly, many years ago Bernard Miller, in his Bytext
> proposal, suggested what he termed "arrow parentheses".

Am 04.06.2021 um 01:14 schrieb Don Peterson via Unicode:
> Alas, that's not a solution for environments like a text editor, bash
> window, terminal, etc.

Well, an environment where real superscripts can for some reason not be
implemented could display those ?arrow parentheses? as control
characters.  Something like Pa?(-1.3)?s?? would in fact look better and
be more unambiguous than Pa**-1.3?s?? or Pa^-1.3?s?? (where one does not
really know whether the s?? is part of the exponent or not).

We already have lots of characters that influence the rendering of other
characters, e.g. combining diacritics, variation selectors,
right-to-left marks, zero-width joiner and non-joiner, etc.  These
?arrow parentheses? would be some more such characters, and not very
difficult to implement for most applications, which already have a way
of displaying superscripts and subscripts in rich text.

Am 04.06.2021 um 07:48 schrieb a.lukyanov via Unicode:
> There are more of them:
>
> ?????????????????????????

Yes, and even more:

capital ???????????????????, Greek ???????, Cyrillic ???, and lots of
IPA and other phonetic transcription characters, all of them named
?modifier letter?.

It somehow seems a waste of codepoints (and of time for all the
registration processes) to encode every superscript or subscript
character separately that somewhere turns up as relevant instead of just
getting away with a couple of control characters and being done with the
registration of superscript and subscript characters forever (just as we
need to register no more accented characters because we have combining
diacritics).

And as to the question of whether these are just glyphs that do not
deserve being encoded or actual characters:  s?? ? s?2, and, in the very
same manner, x to the power of 1.3??, which I currently cannot write
without rich-text markup, is not the same as x?1.3??.  This is a crucial
difference that, in my opinion, cannot be left to rich text environments.

Daniel

-- 
Prof. Dr. Daniel Bun?i?
===================================================
Slavisches Institut der Universit?t zu K?ln
Weyertal 137, D-50931 K?ln
Telefon:       +49 (0)221  470-3355
Telefax:       +49 (0)221  470-5001
Sprechstunden: http://ukoeln.de/12FE3
===================================================
Breslauer Stra?e 54, D-50321 Br?hl
Telefon:       +49 (0)2232  150 42 80
===================================================
E-Mail:        daniel at buncic.de
Homepage:      http://daniel.buncic.de/
Threema:       https://threema.id/8M375R5K
Skype:         danielbuncic
Academia:      http://uni-koeln.academia.edu/buncic
===================================================

From haberg-1 at telia.com  Fri Jun  4 03:27:54 2021
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Fri, 4 Jun 2021 10:27:54 +0200
Subject: Suggestion for superscripts
In-Reply-To: <ca3b816d-516f-1e6a-6f82-097551979a21@uni-koeln.de>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
 <002e01d758b6$cc0bb350$642319f0$@ewellic.org>
 <CAFyvY7X_7DRnLEb4qaO71r+DSyBLDD8hEK457W7Ui-fLmZMiXA@mail.gmail.com>
 <ca3b816d-516f-1e6a-6f82-097551979a21@uni-koeln.de>
Message-ID: <C04A3413-5AA6-454A-9EC4-779FC5FADF21@telia.com>


> On 4 Jun 2021, at 09:45, Daniel Buncic via Unicode <unicode at corp.unicode.org> wrote:
> 
> Am 04.06.2021 um 01:14 schrieb Don Peterson via Unicode:
>> Alas, that's not a solution for environments like a text editor, bash
>> window, terminal, etc.
> 
> Well, an environment where real superscripts can for some reason not be
> implemented could display those ?arrow parentheses? as control
> characters.  Something like Pa?(-1.3)?s?? would in fact look better and
> be more unambiguous than Pa**-1.3?s?? or Pa^-1.3?s?? (where one does not
> really know whether the s?? is part of the exponent or not).

For plain text input in a program, I use superscript and subscript parentheses, which look good and are easy to read. For example:
  ??????(-1.3)?????

I ditched the arrows approach, which I used first, inspired by programs like TeX, finding the rendering less appealing.


From raymond at almanach.co.uk  Fri Jun  4 03:30:14 2021
From: raymond at almanach.co.uk (raymond mercier)
Date: Fri, 4 Jun 2021 09:30:14 +0100
Subject: Suggestion for superscripts
In-Reply-To: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
Message-ID: <C268E3CBD5434D80B72C042297744D3A@UserPC>

Mathematicians use TeX for superscripts. It can be extended it to include Unicode, making XeTeX.
https://www.overleaf.com/learn/latex/Articles/What's_in_a_Name:_A_Guide_to_the_Many_Flavours_of_TeX
Isn?t that the way to go ?
Raymond Mercier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210604/472c106b/attachment.htm>

From doug at ewellic.org  Fri Jun  4 13:45:54 2021
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 4 Jun 2021 12:45:54 -0600
Subject: Suggestion for superscripts
In-Reply-To: <CAFyvY7X_7DRnLEb4qaO71r+DSyBLDD8hEK457W7Ui-fLmZMiXA@mail.gmail.com>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
 <002e01d758b6$cc0bb350$642319f0$@ewellic.org>
 <CAFyvY7X_7DRnLEb4qaO71r+DSyBLDD8hEK457W7Ui-fLmZMiXA@mail.gmail.com>
Message-ID: <000201d75971$d98c3520$8ca49f60$@ewellic.org>

?UnicodeMath,? described in https://www.unicode.org/notes/tn28/ , might be an interesting solution. The plain-text representation is fairly readable, and something that is not ad hoc, but actually discussed and agreed upon by a panel of mathematicians. It could be copied and pasted from the terminal or text editor to a UnicodeMath-enabled processor (if you can find one) for real formatting.

I pointed to an FAQ entry because Unicode normally posts those for questions that are, well, frequently asked (like this one) and for which the answer is unlikely to change over time. We know from experience that over time, some such decisions have been modified, or even completely reversed. Most are not.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From haberg-1 at telia.com  Fri Jun  4 14:49:17 2021
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Fri, 4 Jun 2021 21:49:17 +0200
Subject: Suggestion for superscripts
In-Reply-To: <C268E3CBD5434D80B72C042297744D3A@UserPC>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
 <C268E3CBD5434D80B72C042297744D3A@UserPC>
Message-ID: <E324D26B-4A59-4781-9BE7-696E6D53B7E3@telia.com>


> On 4 Jun 2021, at 10:30, raymond mercier via Unicode <unicode at corp.unicode.org> wrote:
> 
> Mathematicians use TeX for superscripts. It can be extended it to include Unicode, making XeTeX.
> https://www.overleaf.com/learn/latex/Articles/What's_in_a_Name:_A_Guide_to_the_Many_Flavours_of_TeX
> Isn?t that the way to go ?

There is difference whether you merely want an output math rendition, or an input legible and processable for other purposes than display. If you have a program processing math, then using TeX variants is hard enough to be a distraction, and the formulas are not copiable, not originally even the Unicode input in original TeX as it gets translated.

ConTeXt [1] is Unicode friendly, uses UTF-8 as default input, aiming at unifying all those older variants. It is available in the TeX Live distribution [2]: Just typeset using 'context ?'.

With a good text only input, a program like ConTeXt can produce a PDF with reasonably copiable formulas. That is, as long as one does not use superscript and subscripts other than what is already available in Unicode.

1. https://wiki.contextgarden.net/Main_Page
2. https://tug.org/texlive/


From eliz at gnu.org  Sat Jun  5 03:30:07 2021
From: eliz at gnu.org (Eli Zaretskii)
Date: Sat, 05 Jun 2021 11:30:07 +0300
Subject: Suggestion for superscripts
In-Reply-To: <000201d75971$d98c3520$8ca49f60$@ewellic.org> (message from Doug
 Ewell via Unicode on Fri, 4 Jun 2021 12:45:54 -0600)
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
 <002e01d758b6$cc0bb350$642319f0$@ewellic.org>
 <CAFyvY7X_7DRnLEb4qaO71r+DSyBLDD8hEK457W7Ui-fLmZMiXA@mail.gmail.com>
 <000201d75971$d98c3520$8ca49f60$@ewellic.org>
Message-ID: <8335twlnn4.fsf@gnu.org>

> Date: Fri, 4 Jun 2021 12:45:54 -0600
> From: Doug Ewell via Unicode <unicode at corp.unicode.org>
> 
> ?UnicodeMath,? described in https://www.unicode.org/notes/tn28/ , might be an interesting solution. The plain-text representation is fairly readable, and something that is not ad hoc, but actually discussed and agreed upon by a panel of mathematicians. It could be copied and pasted from the terminal or text editor to a UnicodeMath-enabled processor (if you can find one) for real formatting.

Do you know any editor that implements that TN?

From doug at ewellic.org  Sat Jun  5 11:17:09 2021
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 5 Jun 2021 10:17:09 -0600
Subject: Suggestion for superscripts
In-Reply-To: <8335twlnn4.fsf@gnu.org>
References: <CAFyvY7V_XUHrrfr26=-OzB27K6=myWsENW9nKpnb601Jt290cg@mail.gmail.com>
 <002e01d758b6$cc0bb350$642319f0$@ewellic.org>
 <CAFyvY7X_7DRnLEb4qaO71r+DSyBLDD8hEK457W7Ui-fLmZMiXA@mail.gmail.com>
 <000201d75971$d98c3520$8ca49f60$@ewellic.org> <8335twlnn4.fsf@gnu.org>
Message-ID: <003301d75a26$3c5bb860$b5132920$@ewellic.org>

Eli Zaretskii wrote:

> Do you know any editor that implements that TN [#28]?

I do not, but then I haven't looked for one. I might try checking with Murray Sargent to see if he knows any.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From mark at markdawson.io  Sun Jun  6 22:48:22 2021
From: mark at markdawson.io (Mark Dawson)
Date: Sun, 6 Jun 2021 20:48:22 -0700
Subject: Confusables.txt might be too sensitive
Message-ID: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>

Dear Unicode Mailing List,

I am a user of the metamask <https://metamask.io/> browser extension (which
is a cryptocurrency wallet). My name always gets flagged as a potential
scam simply because it contains the small Latin letter "m" (codepoint 006D).
Someone contributing to the metamask project had the idea
<https://github.com/MetaMask/metamask-extension/pull/9187> to give a
warning message if someone is using a name that might contain suspicious
characters. Seems like a good idea to me.

The contributor decided to use TR39's confusable.txt
<https://www.unicode.org/Public/security/14.0.0/confusablesSummary.txt>
file to flag suspicious characters. On line 3344 of the confusables.txt, it
lists the small Latin letter "m" (codepoint 006D) as a source character for
a confusable. Is this intentional?

No other small Latin letter is flagged as a confusable. (Not even the
letter "o"). Would Unicode consider removing the small Latin letter "m" as
a source on the confusable.txt?

Thanks,

Mark

[image: image.png]
-- 
Mark Dawson
mark at markdawson.io
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210606/2496af65/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 25412 bytes
Desc: not available
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210606/2496af65/attachment.png>

From asmusf at ix.netcom.com  Mon Jun  7 00:43:37 2021
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 6 Jun 2021 22:43:37 -0700
Subject: Confusables.txt might be too sensitive
In-Reply-To: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
References: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
Message-ID: <8d2f2ee2-adf4-39d4-4b3d-03ad6d0a1571@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210606/eddaf9d1/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 25412 bytes
Desc: not available
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210606/eddaf9d1/attachment.png>

From marius.spix at web.de  Mon Jun  7 07:16:25 2021
From: marius.spix at web.de (Marius Spix)
Date: Mon, 7 Jun 2021 14:16:25 +0200
Subject: Confusables.txt might be too sensitive
In-Reply-To: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
References: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
Message-ID: <20210607141625.533a3328@spixxi>

I guess, the problem is that m looks similar to rn. For example,
the domain "pomhub dot com" is easily confusable with a well-known
website. But that also would work the other way around, e. g.
"rnicrosoft dot com". Italic m also looks identical to italic ?
(Cyrillic t). But I agree that m should not be considered to be a mock
letter at all, especially in cases where only identifiers [a-zA-Z_] are
allowed for user names. But this will be a task of the
individual implementation, not for Unicode. 


On Sun, 6 Jun 2021 20:48:22 -0700
Mark Dawson via Unicode <unicode at corp.unicode.org> wrote:

> Dear Unicode Mailing List,
> 
> I am a user of the metamask <https://metamask.io/> browser extension
> (which is a cryptocurrency wallet). My name always gets flagged as a
> potential scam simply because it contains the small Latin letter "m"
> (codepoint 006D). Someone contributing to the metamask project had
> the idea <https://github.com/MetaMask/metamask-extension/pull/9187>
> to give a warning message if someone is using a name that might
> contain suspicious characters. Seems like a good idea to me.
> 
> The contributor decided to use TR39's confusable.txt
> <https://www.unicode.org/Public/security/14.0.0/confusablesSummary.txt>
> file to flag suspicious characters. On line 3344 of the
> confusables.txt, it lists the small Latin letter "m" (codepoint 006D)
> as a source character for a confusable. Is this intentional?
> 
> No other small Latin letter is flagged as a confusable. (Not even the
> letter "o"). Would Unicode consider removing the small Latin letter
> "m" as a source on the confusable.txt?
> 
> Thanks,
> 
> Mark
> 
> [image: image.png]


From sosipiuk at gmail.com  Mon Jun  7 12:05:48 2021
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Mon, 7 Jun 2021 13:05:48 -0400
Subject: Confusables.txt might be too sensitive
In-Reply-To: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
References: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
Message-ID: <CAM+ijLg7J64h4MQ3aNKiLYqNEw84105DzOR9VwFyOu4fB0N_7Q@mail.gmail.com>

On Mon, Jun 7, 2021 at 1:16 AM Mark Dawson via Unicode <
unicode at corp.unicode.org> wrote:

>
> No other small Latin letter is flagged as a confusable. (Not even the
> letter "o").
>

All the other latin letters ARE listed as confusable. I'm curious how the
implementation decides which ones to flag. The only thing unique about "m",
versus the rest of the latin alphabet, seems to be that it's confusable
with a two-character sequence. But surely the implementation doesn't
restrict itself to only such cases, so what is happening here? Why is "m"
causing a problem, but "o" is not, when both are confusable with other
characters? Does it have to do with the input being restricted to ASCII (or
some other limited set) and so other characters are removed as
possibilities, leaving the latin set as non-confusable (aside from "m")?

S?awomir Osipiuk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210607/fe576201/attachment.htm>

From doug at ewellic.org  Mon Jun  7 13:00:28 2021
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 7 Jun 2021 12:00:28 -0600
Subject: Confusables.txt might be too sensitive
In-Reply-To: <CAM+ijLg7J64h4MQ3aNKiLYqNEw84105DzOR9VwFyOu4fB0N_7Q@mail.gmail.com>
References: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
 <CAM+ijLg7J64h4MQ3aNKiLYqNEw84105DzOR9VwFyOu4fB0N_7Q@mail.gmail.com>
Message-ID: <000001d75bc7$0003d670$000b8350$@ewellic.org>

S?awomir Osipiuk wrote:

>> No other small Latin letter is flagged as a confusable. (Not even the
>> letter "o").
>
> All the other latin letters ARE listed as confusable.

But not in confusables.txt. It's entirely likely, as Mark Dawson surmised, that the MetaMask people simply grabbed that one file and used it as their entire security strategy.

It would hardly be the first time that someone took a small component of the Unicode (or other) standard and used it as their implementation, instead of actually reading and understanding the standard. Look what happens when someone browses the Unicode code charts and declares that language X isn't fully supported because the contextual forms aren't there. (The same happens in BCP 47 when people look only at the Language Subtag Registry and don't read the document.)

> I'm curious how the implementation decides which ones to flag. The
> only thing unique about "m", versus the rest of the latin alphabet,
> seems to be that it's confusable with a two-character sequence. But
> surely the implementation doesn't restrict itself to only such cases,
> so what is happening here?

Actually, that is probably exactly what is happening: the implementation is taking confusables.txt out of context and using it as a sledgehammer.

> Why is "m" causing a problem, but "o" is not, when both are confusable
> with other characters? Does it have to do with the input being
> restricted to ASCII (or some other limited set) and so other
> characters are removed as possibilities, leaving the latin set as non-
> confusable (aside from "m")?

I think an interesting experiment would be to try other types of confusable scenarios, such as an ENS name wholly or partially in another script such as Greek or Cyrillic, to see if MetaMask allows those while flagging 'm'.

In any case, if MetaMask flags all ENS names that contain an 'm' (or '1' or 'I'), then a whole lot of users besides Mark are sure to run into the same problem. Gosh, even the example name at ens.domains ("Yourname.eth") would generate the warning.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From asmusf at ix.netcom.com  Mon Jun  7 13:11:51 2021
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Mon, 7 Jun 2021 11:11:51 -0700
Subject: Confusables.txt might be too sensitive
In-Reply-To: <CAM+ijLg7J64h4MQ3aNKiLYqNEw84105DzOR9VwFyOu4fB0N_7Q@mail.gmail.com>
References: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
 <CAM+ijLg7J64h4MQ3aNKiLYqNEw84105DzOR9VwFyOu4fB0N_7Q@mail.gmail.com>
Message-ID: <4bc565ea-000a-8225-ea94-d029393d09a7@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210607/9fcc2477/attachment.htm>

From sosipiuk at gmail.com  Mon Jun  7 13:22:19 2021
From: sosipiuk at gmail.com (=?UTF-8?Q?S=C5=82awomir_Osipiuk?=)
Date: Mon, 7 Jun 2021 14:22:19 -0400
Subject: Confusables.txt might be too sensitive
In-Reply-To: <000001d75bc7$0003d670$000b8350$@ewellic.org>
References: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
 <CAM+ijLg7J64h4MQ3aNKiLYqNEw84105DzOR9VwFyOu4fB0N_7Q@mail.gmail.com>
 <000001d75bc7$0003d670$000b8350$@ewellic.org>
Message-ID: <CAM+ijLgpFoPd15nE4=i20hngDgKp9g40ErfQ8U=6RcNKuOBnnA@mail.gmail.com>

On Mon, Jun 7, 2021 at 2:02 PM Doug Ewell via Unicode
<unicode at corp.unicode.org> wrote:
> >
> > All the other latin letters ARE listed as confusable.
>
> But not in confusables.txt. It's entirely likely, as Mark Dawson surmised, that the MetaMask people simply grabbed that one file and used it as their entire security strategy.

D'oh! I was looking at
http://www.unicode.org/Public/security/latest/confusablesSummary.txt
(linked in Mark Dawson's original message) rather than at
http://www.unicode.org/Public/security/latest/confusables.txt

It's definitely more understandable what is happening with the latter.

From doug at ewellic.org  Mon Jun  7 14:41:15 2021
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 7 Jun 2021 13:41:15 -0600
Subject: Confusables.txt might be too sensitive
In-Reply-To: <CAM+ijLgpFoPd15nE4=i20hngDgKp9g40ErfQ8U=6RcNKuOBnnA@mail.gmail.com>
References: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
 <CAM+ijLg7J64h4MQ3aNKiLYqNEw84105DzOR9VwFyOu4fB0N_7Q@mail.gmail.com>
 <000001d75bc7$0003d670$000b8350$@ewellic.org>
 <CAM+ijLgpFoPd15nE4=i20hngDgKp9g40ErfQ8U=6RcNKuOBnnA@mail.gmail.com>
Message-ID: <000501d75bd5$15c4f530$414edf90$@ewellic.org>

Upon reading the MetaMask PRs and problem statement more closely, it seems they were mainly focused on mixed-script spoofing (e.g. using Greek '?' or Cyrillic '?' in place of Latin 'o') and randomly inserted, invisible control characters like ZWNJ.

The author of the original PR (9129, not 9187) seemed to understand the underlying problem, and even suggested an existing library, but instead of using this presumably nuanced and tested solution, someone else applied the confusables.txt sledgehammer. That contributor even commented that his solution "might even be a little too strict because it warns on 'math.eth' being so similar to 'rnath.eth'," but nobody else complained, and so here we are.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From mark at markdawson.io  Tue Jun  8 13:51:33 2021
From: mark at markdawson.io (Mark Dawson)
Date: Tue, 8 Jun 2021 11:51:33 -0700
Subject: Confusables.txt might be too sensitive
In-Reply-To: <000501d75bd5$15c4f530$414edf90$@ewellic.org>
References: <CALs6ruLfOpMC1LxRGXGnBJjexY6aexKHROr8M+hMTLR6=JOUSg@mail.gmail.com>
 <CAM+ijLg7J64h4MQ3aNKiLYqNEw84105DzOR9VwFyOu4fB0N_7Q@mail.gmail.com>
 <000001d75bc7$0003d670$000b8350$@ewellic.org>
 <CAM+ijLgpFoPd15nE4=i20hngDgKp9g40ErfQ8U=6RcNKuOBnnA@mail.gmail.com>
 <000501d75bd5$15c4f530$414edf90$@ewellic.org>
Message-ID: <CALs6ruJxUWK4AZmnNyOxOpke8ykyxadbY7ieaV=bf1o5vKu7kQ@mail.gmail.com>

Thanks everyone for the feedback on this, and thank you Doug for looking
into the MetaMask implementation. It sounds like this is a problem that I
should talk to the MetaMask maintainers about.

On Mon, Jun 7, 2021 at 12:45 PM Doug Ewell via Unicode <
unicode at corp.unicode.org> wrote:

> Upon reading the MetaMask PRs and problem statement more closely, it seems
> they were mainly focused on mixed-script spoofing (e.g. using Greek '?' or
> Cyrillic '?' in place of Latin 'o') and randomly inserted, invisible
> control characters like ZWNJ.
>
> The author of the original PR (9129, not 9187) seemed to understand the
> underlying problem, and even suggested an existing library, but instead of
> using this presumably nuanced and tested solution, someone else applied the
> confusables.txt sledgehammer. That contributor even commented that his
> solution "might even be a little too strict because it warns on 'math.eth'
> being so similar to 'rnath.eth'," but nobody else complained, and so here
> we are.
>
> --
> Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org
>
>
>

-- 
Mark Dawson
mark at markdawson.io
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210608/dfbdf9f6/attachment.htm>

From wjgo_10009 at btinternet.com  Tue Jun  8 13:06:46 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Tue, 8 Jun 2021 19:06:46 +0100 (BST)
Subject: The rules about tag sequences
Message-ID: <1ccd60c6.19bd0.179ecce2f3a.Webtop.100@btinternet.com>

In seeking to draft a reply to the comments in the

https://www.unicode.org/L2/L2021/21099-qid-feedback.pdf

document, I have wondered about the following questions.

Can we discuss them please?

Is it within the rules for tag sequences to have as a base a sequence of 
a character followed by one or more combining characters before the 
sequence of tag characters starts? The idea being that the particular 
base character followed by one or more combining characters sequence is 
chosen is one very unlikely to be used otherwise, thus avoiding 
confusion.

Can any entity, whether a company or an individual, publish a tag 
sequence and it be valid, or does it need approval by the Unicode 
Technical Committee?

William Overington

Tuesday 8 June 2021


From wjgo_10009 at btinternet.com  Tue Jun 15 09:54:29 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Tue, 15 Jun 2021 15:54:29 +0100 (BST)
Subject: An artistic setting of a poem that uses language-independent glyphs
Message-ID: <569c8002.25034.17a102aaa99.Webtop.86@btinternet.com>

Recently I started the following thread in the "Share your work" section 
of the Serif Affinity forums.

https://forum.affinity.serif.com/index.php?/topic/143812-informal-design-workshop-idea/

Although not originally intended, I was wondering what design to devise 
for the workshop thread, and in the event a design for a greetings card 
with an artistic setting of a poem written using language-independent 
glyphs was produced. Thus the finished artwork expresses a poem in 
glyphs of what is possibly in effect a type of pivot language.

Some readers might enjoy reading the thread, though the poem using 
language-independent glyphs is not introduced until the second page of 
the thread.

William Overington

Tuesday 15 June 2021


From public at khwilliamson.com  Thu Jun 17 09:28:39 2021
From: public at khwilliamson.com (Karl Williamson)
Date: Thu, 17 Jun 2021 08:28:39 -0600
Subject: Broken links to Unicode pages in current UCD files
Message-ID: <8df1f78b-9070-a6ac-ef68-7ed9bd277887@khwilliamson.com>

Someone discovered and reported various broken links.

An example is

# BidiMirroring-13.0.0.txt
# Date: 2019-09-09, 19:34:00 GMT [KW, LI, RP]
# ? 2019 Unicode?, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html

On line number 43, there is a link to

http://www.unicode.org/unicode/reports/tr9/

Following that link leads to a 404.

A link that works is http://www.unicode.org/reports/tr9/

Presumably the link in the file used to work.  One of the first rules of 
web design is to never ever break a published link.  It's fine to 
reorganize your site; just be sure that the original links are available 
as synonyms to the modern version.

In this case, the current UCD contains broken links.


From wjgo_10009 at btinternet.com  Thu Jun 17 08:29:00 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 17 Jun 2021 14:29:00 +0100 (BST)
Subject: Some language-independent glyphs for museum shops
Message-ID: <1e46d904.2a500.17a1a291e78.Webtop.100@btinternet.com>


Many museums and art galleries have online shops these days, and some 
will send items internationally, the customer paying for the items 
online by card.

Yet there is the language barrier.

In the 1970s, before the web, before cards that could be used 
internationally, I purchased some colour slides of paintings from the 
Uffizi in Florence and from the Louvre in Paris by mail order by writing 
letters in Italian and French respectively. I do not know what was the 
quality of my writing in those languages yet I did communicate 
effectively as I received replies in Italian and French respectively and 
I received the colour slides.

So what if one has symbols, precise emoji, language-independent glyphs, 
for the fields needed to make a card payment?

These symbols could be used either stand-alone or together with text in 
the language of the country in which the museum is located.

Some museums have guides and signage in several languages. Yet not in 
every language.

So language-independent glyphs could be a mini-pivot language to assist 
communication through the language barrier for a card purchase 
transaction.

I have produced ten glyph designs. Maybe a few more are needed, maybe 
the colour scheme needs changing, please discuss. Presently I have, for 
each of the ten language-independent glyphs, a colourful version and a 
graceful fallback monochrome version.

There is an experimental colour font available, free to use.

Should such glyphs be encoded in regular Unicode, or as if ligatures of 
a sequence of characters, or as QID emoji?

There is a thread where the font is being applied in examples.

https://forum.affinity.serif.com/index.php?/topic/144236-some-language-independent-glyphs-for-museum-shops/

William Overington

Thursday 17 June 2021


From asmusf at ix.netcom.com  Thu Jun 17 11:36:03 2021
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 17 Jun 2021 09:36:03 -0700
Subject: Broken links to Unicode pages in current UCD files
In-Reply-To: <8df1f78b-9070-a6ac-ef68-7ed9bd277887@khwilliamson.com>
References: <8df1f78b-9070-a6ac-ef68-7ed9bd277887@khwilliamson.com>
Message-ID: <ab3c7ee6-d965-91b1-47da-dec00c37bf78@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20210617/1da16c04/attachment.htm>

From richard.wordingham at ntlworld.com  Sun Jun 20 23:43:37 2021
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 21 Jun 2021 05:43:37 +0100
Subject: Question on combining character order
In-Reply-To: <457959e2-7564-d1ea-9cbe-324e8628954e@ix.netcom.com>
References: <CAGHjPPKhD8H8Ehnae1S+8S6pfj7Mbc=4SNg6BzhmMyptWtxB3A@mail.gmail.com>
 <457959e2-7564-d1ea-9cbe-324e8628954e@ix.netcom.com>
Message-ID: <20210621054337.0df9d97a@JRWUBU2>

On Sun, 20 Jun 2021 02:10:34 -0700
Asmus Freytag via Unicode <unicode at corp.unicode.org> wrote:

> The short answer is "no".
> 
> A longer answer is that typing order, display order and
> phonetic/semantic order do not have to agree.
> 
> On 6/20/2021 1:54 AM, Phake Nick via Unicode wrote:
> Currently, in Unicode, combining characters like U+20DD or U+20DE,
> are to be placed behind the main character to be combined.
> > But sometimes, linguistically, it make sense for a combing mark to
> > come in front.
> > 
> > For example, the famous instant ramen brand, Maruchan, was
> > originally called "Maruto" in Japanese, as a spoken form of its
> > initial trade mark with the Japanese hiragana character "To" (Stand
> > for the company's official name, Toyo Suisan) being placed insidr a
> > circle ("Maru"). To replicate the sign using modern Unicode, users
> > would need to first input the Japanese Hiragana character "To",
> > then inout the combining circle mark of U+20DD being the maru, and
> > would result in reverse linguistic order compares to how such marks
> > are being pronounced in Japanese.
> > 
> > Another example, in Cantonese, it is customary to create new
> > Chinese characters to express a Cantonese phoneme that don't have
> > obvious connection with commonly known Chinese characters, by
> > attaching the component of a mouth (U+2F1D) onto other
> > similarly-sounded Chinese characters with different meaning. For
> > example, Unicode character U+975A, meaning "beautiful", can have
> > the component of mouth attached to it, and become U+210C1, meaning
> > beautiful. Although in this particular example, the modified
> > character have also been encoded, on some platforms it might not be
> > supported by input method modifier or are otherwise difficult to
> > enter and thus people would input the deconstructed form. But due
> > to the lack of a small mouth component for combination, and
> > combination of characyers through Ideographic Description Sequence
> > is also not being supported on most platforms, it is common for
> > people to use Latin small letter o, U+006F, to represent the
> > component. As the component is customarily written on the left side
> > of Chinese characters, and it is customary for Chinese character to
> > be written from left to right, it would be usual for the additional
> > component to be keyed in before entering the character itself. As
> > such, if a combining character featuring the component of mouth is
> > to be introduced, it would make the most sense if the combining
> > mouth component is to be typed in before the character to be
> > modified, instead of the other way round.
> > 
> > Is there mechanism in Unicode that can support such type of
> > combining characters?

(Resending)

Yes, in various degrees.

1. Coeng characters (i.e. most invisible stackers) convert the
input-logically following consonant into a consonant character.
Category Mn.

2. 'Buoyant' consonants that sit on the hanging baseline above the rest
of the consonant stack, such as U+0D4E MALAYALAM LETTER DOT REPH, and
(category Lo) ...

3. ... and eastern U+1A58 TAI THAM SIGN MAI KANG LAI (category
Mn).

There are problems with most of these:

1. Coeng characters get given a non-zero canonical combining class,
which can causes them to be separated from combining marks applied to
the previous base character.  That happens in Tai Tham, and would
happen in Kharoshthi if nuktas were applied to the initial characters
of conjoined characters.

3. The properties of U+1A58 are based on western usage, where it
functions as a final consonant, so grapheme clustering unites it with
the previous consonant.  Manipulating an isolated orthographic syllable
starting with it is awkward at best.

What you want are formally format characters (Cf), like the IDS
controls, but with a mandatory graphic effect, more like the control
characters for Egyptian hieroglyphs.  However, for Chinese character
composition, what you want might be better served by an Lo with
appropriate clustering and line-breaking operations.  It's time to move
on to 'every script is complex'.

Richard.

From richard.wordingham at ntlworld.com  Tue Jun 22 02:15:31 2021
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 22 Jun 2021 08:15:31 +0100
Subject: Question on combining character order
In-Reply-To: <457959e2-7564-d1ea-9cbe-324e8628954e@ix.netcom.com>
References: <CAGHjPPKhD8H8Ehnae1S+8S6pfj7Mbc=4SNg6BzhmMyptWtxB3A@mail.gmail.com>
 <457959e2-7564-d1ea-9cbe-324e8628954e@ix.netcom.com>
Message-ID: <20210622081531.0b7688a0@JRWUBU2>

On Sun, 20 Jun 2021 02:10:34 -0700
Asmus Freytag via Unicode <unicode at corp.unicode.org> wrote:

> The short answer is "no".
> 
> A longer answer is that typing order, display order and
> phonetic/semantic order do not have to agree.
> 
> On 6/20/2021 1:54 AM, Phake Nick via Unicode wrote:
> Currently, in Unicode, combining characters like U+20DD or U+20DE,
> are to be placed behind the main character to be combined.
> > But sometimes, linguistically, it make sense for a combing mark to
> > come in front.
> > 
> > For example, the famous instant ramen brand, Maruchan, was
> > originally called "Maruto" in Japanese, as a spoken form of its
> > initial trade mark with the Japanese hiragana character "To" (Stand
> > for the company's official name, Toyo Suisan) being placed insidr a
> > circle ("Maru"). To replicate the sign using modern Unicode, users
> > would need to first input the Japanese Hiragana character "To",
> > then inout the combining circle mark of U+20DD being the maru, and
> > would result in reverse linguistic order compares to how such marks
> > are being pronounced in Japanese.
> > 
> > Another example, in Cantonese, it is customary to create new
> > Chinese characters to express a Cantonese phoneme that don't have
> > obvious connection with commonly known Chinese characters, by
> > attaching the component of a mouth (U+2F1D) onto other
> > similarly-sounded Chinese characters with different meaning. For
> > example, Unicode character U+975A, meaning "beautiful", can have
> > the component of mouth attached to it, and become U+210C1, meaning
> > beautiful. Although in this particular example, the modified
> > character have also been encoded, on some platforms it might not be
> > supported by input method modifier or are otherwise difficult to
> > enter and thus people would input the deconstructed form. But due
> > to the lack of a small mouth component for combination, and
> > combination of characyers through Ideographic Description Sequence
> > is also not being supported on most platforms, it is common for
> > people to use Latin small letter o, U+006F, to represent the
> > component. As the component is customarily written on the left side
> > of Chinese characters, and it is customary for Chinese character to
> > be written from left to right, it would be usual for the additional
> > component to be keyed in before entering the character itself. As
> > such, if a combining character featuring the component of mouth is
> > to be introduced, it would make the most sense if the combining
> > mouth component is to be typed in before the character to be
> > modified, instead of the other way round.
> > 
> > Is there mechanism in Unicode that can support such type of
> > combining characters?

(Resending)

Yes, in various degrees.

1. Coeng characters (i.e. most invisible stackers) convert the
input-logically following consonant into a consonant character.
Category Mn.

2. 'Buoyant' consonants that sit on the hanging baseline above the rest
of the consonant stack, such as U+0D4E MALAYALAM LETTER DOT REPH, and
(category Lo) ...

3. ... and eastern U+1A58 TAI THAM SIGN MAI KANG LAI (category
Mn).

There are problems with most of these:

1. Coeng characters get given a non-zero canonical combining class,
which can causes them to be separated from combining marks applied to
the previous base character.  That happens in Tai Tham, and would
happen in Kharoshthi if nuktas were applied to the initial characters
of conjoined characters.

3. The properties of U+1A58 are based on western usage, where it
functions as a final consonant, so grapheme clustering unites it with
the previous consonant.  Manipulating an isolated orthographic syllable
starting with it is awkward at best.

What you want are formally format characters (Cf), like the IDS
controls, but with a mandatory graphic effect, more like the control
characters for Egyptian hieroglyphs.  However, for Chinese character
composition, what you want might be better served by an Lo with
appropriate clustering and line-breaking operations.  It's time to move
on to 'every script is complex'.

Richard.