From charupdate at orange.fr  Mon Jan  2 12:19:04 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 2 Jan 2017 19:19:04 +0100 (CET)
Subject: Marking up hexadecimal numbers (was: Re: a character for an unknown
 character)
In-Reply-To: <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
Message-ID: <2114913271.21841.1483381144279.JavaMail.www@wwinf1p19>

On Sat, 31 Dec 2016 22:04:02 +0100 (CET), I wrote:
> On Sat, 31 Dec 2016 11:01:16 +0100, Christoph P?per wrote:
> > 
> > Richard Wordingham :
> > > 
> > >> Perhaps the letters for hexadecimal digits should have been encoded
> > >> separately?
> > > 
> > > The idea has been rejected several times.
> > 
> > It has indeed. That?s why two different technologies have to be used to get 
> > typographically harmonic hexadecimal numbers, e.g. in CSS:
> > 
> > .hex {font-variant-numeric: oldstyle-nums; text-transform: lowercase;}
> > .hex {font-variant-numeric: lining-nums; text-transform: uppercase;}
> > 
> > This works well enough for ?01ef? or ?01EF?, but will fail for conventions like 
> > ?0x01ef? and ?01EFh?. Hence:
> > 
> > .hex::before {content: "0x"; text-transform: none;}
> > .hex::after {content: "h"; text-transform: none;}
> > .hex::after {content: "?";}
> > .hex::after {content: "16"; vertical-align: sub; font-size: smaller; line-height: normal;}
> > .hex::after {content: "16"; font-variant-position: sub;}
> > .hex::after {content: "??";}
> 
> Thank you for the code. I didn?t know this, so I?ve tried and found that 
> the automatic prefixes/suffixes cannot be copied from the web page. 
> That seems to me a disadvantage.
> 
> Among the possibilities, you include Unicode subscripts. Is this current 
> practice? That seems to me very interesting to follow up, as it documents 
> that the stable representation scheme is already adopted. I?m curious to 
> what extent it is so. 
> 
[?]
> 
> I note that the "U+" prefix is missing in the list, obviously because it 
> denotes more than just a hexadecimal number, and is to be hard-coded. 
[?]

Alternatively, the CSS style derived from the above could be:

.unicode {font-variant-numeric: lining-nums; text-transform: uppercase;}
.unicode::before {content: "U+"; text-transform: none;}

But again, when the reader copies such a scalar value, he gets it without 'U+'.
Hence the idea that the '[[H]H]HHHH' could be 
parsed to add the prefix after the open-tag, so as to be able to skip the 
second line above. 

Similarly, the 'HHHH' can be complemented with '??', 
or with '0x' or '\x' or whatever, as hard-coded additions by a parser. 
This has IMO two advantages:

1) When the user copies hex numbers from the browser, hex numbers stay prefixed 
or suffixed as such.

2) When the user pastes hex numbers into a text editor, they?re not messed up 
(applies to the '??' suffix, vs '_{16}' suffix). Otherwise, a hex number like 
'1A19??' is turned to '1A1916'.

The actual policy is certainly based on the classification of hexadecimal numbers 
(and numbers in other non-decimal numeral systems) as mathematical notation, 
rather than technical notation. In a wide lecture of TUS, all measurement units 
are granted the use of superscript digits '?' and '?'. Could this policy be 
extended to include subscript '?' and '?'? This may seem an odd question, and 
responding it positively would eventually throw the door open to wider use of 
Latin superscripts in historical data first ('V? s.'), in more general data next.

As the upside I see content stability and streamlined input (provided that the 
input interface is up-to-date). Disparity in display may be considered a downside, 
since only fonts that have reduced capitals (Consolas, Lucida Console, Courier) 
have modifier letters accurately like superscripts / ordinal indicators. I?ve 
started getting habits with using modifier letters in abbreviations, and I find 
they look good in other fonts too.

Right now, it?s just up to put them on the keyboard and tell the user ?please use
them if you are comfortable with; original encoding for phonetics does not 
preclude re-use and diversification of usage conventions.? There is a need of 
some explanation to be delivered, because people who know something about Unicode 
typically oppose the sometimes passionate refrain saying that these characters are 
for use in phonetics only.

Definitely, by the actual wording of the relevant parts of the Unicode Standard, 
Unicode is fueling its own misperception.

Some hints in the opposite way, ideally in TUS 10.0 to be published this year 2017, 
would (in my opinion) be highly appreciated. Though of course that is not enough to 
make people really happy.


Marcel


From charupdate at orange.fr  Mon Jan  2 14:57:46 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 2 Jan 2017 21:57:46 +0100 (CET)
Subject: Marking up hexadecimal numbers (was: Re: a character for an unknown
 character)
In-Reply-To: <2114913271.21841.1483381144279.JavaMail.www@wwinf1p19>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <2114913271.21841.1483381144279.JavaMail.www@wwinf1p19>
Message-ID: <383840487.25884.1483390666160.JavaMail.www@wwinf1p19>

I?ve messed up my e-mail by not converting HTML to text. Please disregard.
The used webmail applies HTML tags and deletes all unknown ones. Sorry.
On Sat, 31 Dec 2016 22:04:02 +0100 (CET), I wrote:
> On Sat, 31 Dec 2016 11:01:16 +0100, Christoph P?per wrote:
> > 
> > Richard Wordingham :
> > > 
> > >> Perhaps the letters for hexadecimal digits should have been encoded
> > >> separately?
> > > 
> > > The idea has been rejected several times.
> > 
> > It has indeed. That?s why two different technologies have to be used to get 
> > typographically harmonic hexadecimal numbers, e.g. in CSS:
> > 
> > .hex {font-variant-numeric: oldstyle-nums; text-transform: lowercase;}
> > .hex {font-variant-numeric: lining-nums; text-transform: uppercase;}
> > 
> > This works well enough for ?01ef? or ?01EF?, but will fail for conventions like 
> > ?0x01ef? and ?01EFh?. Hence:
> > 
> > .hex::before {content: "0x"; text-transform: none;}
> > .hex::after {content: "h"; text-transform: none;}
> > .hex::after {content: "?";}
> > .hex::after {content: "16"; vertical-align: sub; font-size: smaller; line-height: normal;}
> > .hex::after {content: "16"; font-variant-position: sub;}
> > .hex::after {content: "??";}
> 
> Thank you for the code. I didn?t know this, so I?ve tried and found that 
> the automatic prefixes/suffixes cannot be copied from the web page. 
> That seems to me a disadvantage.
> 
> Among the possibilities, you include Unicode subscripts. Is this current 
> practice? That seems to me very interesting to follow up, as it documents 
> that the stable representation scheme is already adopted. I?m curious to 
> what extent it is so. 
> 
[?]
> 
> I note that the "U+" prefix is missing in the list, obviously because it 
> denotes more than just a hexadecimal number, and is to be hard-coded. 
[?]

Alternatively, the CSS style derived from the above could be:

.unicode {font-variant-numeric: lining-nums; text-transform: uppercase;}
.unicode::before {content: "U+"; text-transform: none;}

But again, when the reader copies such a scalar value, he gets it without 'U+'.
Hence the idea that the '<span class="unicode">[[H]H]HHHH</span>' could be 
parsed to add the prefix after the open-tag, so as to be able to skip the 
second line above. 

Similarly, the '<span class="hex">HHHH</span>' can be complemented with '??', 
or with '0x' or '\x' or whatever, as hard-coded additions by a parser. 
This has IMO two advantages:

1) When the user copies hex numbers from the browser, hex numbers stay prefixed 
or suffixed as such.

2) When the user pastes hex numbers into a text editor, they?re not messed up 
(applies to the '??' suffix, vs '_{16}' suffix). Otherwise, a hex number like 
'1A19??' is turned to '1A1916'.

The actual policy is certainly based on the classification of hexadecimal numbers 
(and numbers in other non-decimal numeral systems) as mathematical notation, 
rather than technical notation. In a wide lecture of TUS, all measurement units 
are granted the use of superscript digits '?' and '?'. Could this policy be 
extended to include subscript '?' and '?'? This may seem an odd question, and 
responding it positively would eventually throw the door open to wider use of 
Latin superscripts in historical data first ('V? s.'), in more general data next.

As the upside I see content stability and streamlined input (provided that the 
input interface is up-to-date). Disparity in display may be considered a downside, 
since only fonts that have reduced capitals (Consolas, Lucida Console, Courier) 
have modifier letters accurately like superscripts / ordinal indicators. I?ve 
started getting habits with using modifier letters in abbreviations, and I find 
they look good in other fonts too.

Right now, it?s just up to put them on the keyboard and tell the user ?please use
them if you are comfortable with; original encoding for phonetics does not 
preclude re-use and diversification of usage conventions.? There is a need of 
some explanation to be delivered, because people who know something about Unicode 
typically oppose the sometimes passionate refrain saying that these characters are 
for use in phonetics only.

Definitely, by the actual wording of the relevant parts of the Unicode Standard, 
Unicode is fueling its own misperception.

Some hints in the opposite way, ideally in TUS 10.0 to be published this year 2017, 
would (in my opinion) be highly appreciated. Though of course that is not enough to 
make people really happy.


Marcel


From christoph.paeper at crissov.de  Tue Jan  3 02:31:42 2017
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Tue, 3 Jan 2017 09:31:42 +0100
Subject: a character for an unknown character
In-Reply-To: <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
Message-ID: <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>

Marcel Schneider <charupdate at orange.fr>:
> On Sat, 31 Dec 2016 11:01:16 +0100, Christoph P?per wrote:
>> 
>> It has indeed. That?s why two different technologies have to be used to get 
>> typographically harmonic hexadecimal numbers, e.g. in CSS: ?
> 
> Thank you for the code. I didn?t know this,

Well, case-insensitivity was intended as *an* argument in favor of encoding digits A?F/a?f, although I know that there are also good arguments against it. (There are certainly also arguments in favor of encoding 0?9 another time just for hexadecimal numbers.)

> so I?ve tried and found that 
> the automatic prefixes/suffixes cannot be copied from the web page. 

Browsers are still disagreeing about that, but yes, since the affix is generated content by CSS it is considered style and is likely to not get pasted into plain text environments. One could also argue that CSS should be able to render numbers in different styles and bases, but that?s currently neither supported nor planned.

> Among the possibilities, you include Unicode subscripts.

Just for the sake of completeness.

> The font-variant-numeric: oldstyle-nums seems not to work with any font. 

Browser and font support is required and limited, but not as much as few years ago.

> I note that the "U+" prefix is missing in the list, obviously because it 
> denotes more than just a hexadecimal number, and is to be hard-coded.

Yes, I was talking about hexadecimal numbers in general, not limit to Unicode code points.

From drott at google.com  Tue Jan  3 07:14:26 2017
From: drott at google.com (=?UTF-8?Q?Dominik_R=C3=B6ttsches?=)
Date: Tue, 3 Jan 2017 15:14:26 +0200
Subject: Leading ZWJ in Emoji sequences page
Message-ID: <CAN6muBujzyZXOCmCe4OLs5Ow5jY7SwQyKZaSaXTPenWLQBTnoQ@mail.gmail.com>

Hi Mark, others,

in http://unicode.org/emoji/charts/emoji-zwj-sequences.html as well as in
the beta 5.0 version of this page, some of the "Browser" fields have a
leading ZWJ.

Compare copying the full cell contents to the URL bar after "codepoints.net/"
for example and it shows the leading ZWJ.

I suggest to remove those as this can lead to unepxected text selection
behavior in browsers for example.

Regards,

Dominik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170103/3bde125f/attachment.html>

From mark at macchiato.com  Tue Jan  3 07:25:52 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 3 Jan 2017 14:25:52 +0100
Subject: Leading ZWJ in Emoji sequences page
In-Reply-To: <CAN6muBujzyZXOCmCe4OLs5Ow5jY7SwQyKZaSaXTPenWLQBTnoQ@mail.gmail.com>
References: <CAN6muBujzyZXOCmCe4OLs5Ow5jY7SwQyKZaSaXTPenWLQBTnoQ@mail.gmail.com>
Message-ID: <CAJ2xs_FMrDm0ZvQBTGvHyHMEi1jjmQR6UmBqXKi-ggurj91TOw@mail.gmail.com>

Thanks for catching this!

Mark

On Tue, Jan 3, 2017 at 2:14 PM, Dominik R?ttsches <drott at google.com> wrote:

> Hi Mark, others,
>
> in http://unicode.org/emoji/charts/emoji-zwj-sequences.html as well as in
> the beta 5.0 version of this page, some of the "Browser" fields have a
> leading ZWJ.
>
> Compare copying the full cell contents to the URL bar after "
> codepoints.net/" for example and it shows the leading ZWJ.
>
> I suggest to remove those as this can lead to unepxected text selection
> behavior in browsers for example.
>
> Regards,
>
> Dominik
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170103/651719f7/attachment.html>

From charupdate at orange.fr  Tue Jan  3 18:24:52 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 4 Jan 2017 01:24:52 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use (was: Re: a
 character for an unknown character)
In-Reply-To: <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
Message-ID: <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>


On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote:

> > Among the possibilities, you include Unicode subscripts. 
>
> Just for the sake of completeness. 

This tends to conclude that preformatted subscripts are really an option here. 
The TUS snippets [1][2] and common practice show that whatever characters are 
on the keyboard, are used or re-used for superscripts, such as the degree sign 
as superscript o, and the feminine ordinal indicator as superscript a. Layouts 
are baffling inconsistent across countries; so the Belgian AZERTY layout has 
superscript three where its French (France) counterpart has an empty shift state, 
while SUPERSCRIPT ONE is missing on both, despite of the AltGr shift state being 
partially used, and all three being a part of Latin-1. Thus, the consciousness 
of the usefulness of a given character has not always a tight relation to its 
presence on the keyboard.

In the Unicode era, this may tend to expand to the insight that the availability 
of an almost complete range of superscripts, and a set of subscripts, including 
Latin letters, calls the need to add them on national keyboard layouts to cater 
for the demand of increasingly important user groups and communities. Supporting 
this does eventually not require the Unicode Standard to be reworded, because 
TUS mainly reflects encoding principles and usage recommendations, without being 
a typography manual. 

TUS 9.0, ?22.4, p. 786, explains that the recommendation not to use preformatted 
characters outside phonetics is a mere application of a design principle, 
regardless of the practical usefulness of the scheme. I note that in the snippet
quoted below, the digit ??DC0016?? is already messed up by copy-pasting it to 
plain text. By contrast, copying it from Adobe Reader to Microsoft Word brings 
the font size difference with it, but not the vertical alignment, presumably 
because the original specifies a custom subscript style that has no generic 
subscripting information and is not cross-platform compatible. This example 
highlights a serious downside of the markup-based representation scheme.

As demonstrated with the apostrophe, a recommendation may be changed according 
to common practice, and reconsidered in the light of differently weighed rules 
and principles, in favor of what Asmus Freytag pointed on December 28??, 2016, 
in reply to Richard Wordingham:

> > > > Ideal solutions can also be defeated by limited keyboard layouts. As a 
> > > > result, I have no idea whether the singular of "fithp" (one of Larry 
> > > > Niven's alien species) should be spelt with U+02BC or U+2019, though in 
> > > > ASCII I can just write "fi'". 
> > > 
> > > The only place where "uni" doesn't apply in Unicode is that there's never 
> > > just a single principle that applies, but always multiple ones that are 
> > > in tension --- and in the edge cases, the tension can be felt keenly. 
> > > 

As seen in another example in a 2015 thread on plain text custom fractions, 
the English Microsoft Community website is hosting recommendations on how to 
insert fractions made of superscripts, subscripts and the fraction slash U+2044, 
using a list of autocorrections in Word. To test, I?ve added to the autocorrect 
list four items converting '.s.' to '??', '.n.' to '??', '.r.' to '??', '.t.' to '??'. 
The result looks fine in Cambria, bad in uncomplete fonts mixed with a 
fallback font, while Arial has the superscript 'n' in a non-standard way, 
as a legacy remainder, despite of TUS specifying that all those characters 
should be harmonized. 

It?s up to the user to choose the best fitting option depending on usage 
and environment. As already discussed, formatting is a working solution 
at the condition that plain text will never be a requirement.

I hope that this lengthy contribution may help to straighten the way for 
the users to feel free to use superscript and subscript characters the way 
they prefer.


Marcel

[1] TUS 9.0, ?22.4, p. 786:
|
| In general, the Unicode Standard does not attempt to describe the positioning 
| of a character above or below the baseline in typographical layout. 
| Therefore, the preferred means to encode superscripted letters or digits, 
| such as ?1st? or ?DC0016?, is by style or markup in rich text. [?]
| In addition, superscript digits are used to indicate tone in transliteration 
| of many languages. The use of superscript two and superscript three is common 
| legacy practice when referring to units of area and volume in general texts.
|
http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G42931

[2] TUS 9.0, ?7.8, p. 327:
|
| The superscript forms of the i and n letters can be found in the
| Superscripts and Subscripts block (U+2070..U+209F). The fact that the latter 
| two letters contain the word ?superscript? in their names instead of ?modifier 
| letter? is an historical artifact of original sources for the characters, and 
| is not intended to convey a functional distinction in the use of these 
| characters in the Unicode Standard.
| 
| Superscript modifier letters are intended for cases where the letters carry 
| a specific meaning, as in phonetic transcription systems, and are not 
| a substitute for generic styling mechanisms for superscripting of text, 
| as for footnotes, mathematical and chemical expressions, and the like.
|
http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G24762


From asmusf at ix.netcom.com  Tue Jan  3 21:20:42 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 3 Jan 2017 19:20:42 -0800
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
Message-ID: <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>

On 1/3/2017 4:24 PM, Marcel Schneider wrote:
> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote:
>
>>> Among the possibilities, you include Unicode subscripts.
>> Just for the sake of completeness.
> This tends to conclude that preformatted subscripts are really an option here.

Not so. You yourself quote this statement:

| Superscript modifier letters are intended for cases where the letters carry
| a specific meaning, as in phonetic transcription systems, and are not
| a substitute for generic styling mechanisms for superscripting of text,
| as for footnotes, mathematical and chemical expressions, and the like.

It is clear that the uses that you advocate go against this intent.

Therefore, your conclusion that this is "an option" is nothing more than 
a very personal
opinion on your part (and one that many people here would consider 
misguided if
presented as general recommendation).

A./

From john.w.kennedy at gmail.com  Tue Jan  3 23:36:38 2017
From: john.w.kennedy at gmail.com (John W Kennedy)
Date: Wed, 4 Jan 2017 00:36:38 -0500
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
Message-ID: <BB23C452-D03B-441F-B7E7-46267FE358DE@gmail.com>


> On Jan 3, 2017, at 10:20 PM, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> 
> On 1/3/2017 4:24 PM, Marcel Schneider wrote:
>> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote:
>> 
>>>> Among the possibilities, you include Unicode subscripts.
>>> Just for the sake of completeness.
>> This tends to conclude that preformatted subscripts are really an option here.
> 
> Not so. You yourself quote this statement:
> 
> | Superscript modifier letters are intended for cases where the letters carry
> | a specific meaning, as in phonetic transcription systems, and are not
> | a substitute for generic styling mechanisms for superscripting of text,
> | as for footnotes, mathematical and chemical expressions, and the like.
> 
> It is clear that the uses that you advocate go against this intent.
> 
> Therefore, your conclusion that this is "an option" is nothing more than a very personal
> opinion on your part (and one that many people here would consider misguided if
> presented as general recommendation).
> 
> A./

As long as this is being discussed, what about the historic practice of using M? (nowadays often seen as M? instead) in Scottish names?e.g., M?Donald?as a typographic substitute for M(superscript c)?

-- 
John W Kennedy
Having switched to a Mac in disgust at Microsoft's combination of incompetence and criminality.


From asmusf at ix.netcom.com  Wed Jan  4 00:48:09 2017
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Tue, 3 Jan 2017 22:48:09 -0800
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <BB23C452-D03B-441F-B7E7-46267FE358DE@gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
 <BB23C452-D03B-441F-B7E7-46267FE358DE@gmail.com>
Message-ID: <e251aaad-15b8-158f-da5c-db7eb73862f5@ix.netcom.com>

On 1/3/2017 9:36 PM, John W Kennedy wrote:
> As long as this is being discussed, what about the historic practice of using M? (nowadays often seen as M? instead) in Scottish names?e.g., M?Donald?as a typographic substitute for M(superscript c)?

What about it? There are dozens, perhaps hundreds of fallbacks that have 
been used over time, both in hot metal typography as well as with 
typewriters or digital systems. Some practices may have started in ways 
similar to a fallback, but have now evolved into standard practice. 
Other ones remain fallbacks or went out of fashion.

It's an interesting example, but what kind of discussion did you have in 
mind?

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170103/c9dc3deb/attachment.html>

From duerst at it.aoyama.ac.jp  Wed Jan  4 02:12:00 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Wed, 4 Jan 2017 17:12:00 +0900
Subject: IdnaTest.txt and RFC 5893
In-Reply-To: <C5D0A6E1-FC91-4443-A90A-63B7382BA410@alastairs-place.net>
References: <C5D0A6E1-FC91-4443-A90A-63B7382BA410@alastairs-place.net>
Message-ID: <fd5b315a-bf21-52a0-7144-c95ea4298307@it.aoyama.ac.jp>

Hello Alastair,

On 2016/12/06 20:51, Alastair Houghton wrote:
> Hi all,
>
> I must be missing something; in IdnaTest.txt, in the BIDI TESTS section, there are examples like (line 74)

Can you tell us where you got IdnaTest.txt from?

>   B;	0?.\u05D0;	;	xn--0-sfa.xn--4db	#	0?.?
>
> which the file alleges are valid, but I cannot for the life of me see why.  First, ?0?.?? is clearly a ?Bidi domain name? since it has at least one RTL label, ???.  As such, the Bidi Rule (RFC 5893 section 2) should be applied to its labels, and the label ?0?? fails [B1], since the first character has Bidi property EN, not L, R or AL.

On first sight, it looks to me as if you're correct.

For the exact interpretation of RFC 5893, you'd better write to the 
mailing list of the former IDNA(bis) WG at idna-update at alvestrand.no.

Regards,   Martin.

> Similarly (line 93)
>
>   B;	??.\u05D0;	;	xn--0ca88g.xn--4db	#	??.?
>
> Again, ???.?? is clearly a ?Bidi domain name?, but ???? fails [B6], because ??? has Bidi property ON, not L, EN or NSM.
>
> Have I misunderstood something fundamental here?  Could someone explain why those examples are valid, in spite of RFC 5893?
>
> Kind regards,
>
> Alastair.
>
> --
> http://alastairs-place.net
>
>
> .
>

-- 
Prof. Dr.sc. Martin J. D?rst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan

From verdy_p at wanadoo.fr  Wed Jan  4 02:12:43 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 4 Jan 2017 09:12:43 +0100
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <BB23C452-D03B-441F-B7E7-46267FE358DE@gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
 <BB23C452-D03B-441F-B7E7-46267FE358DE@gmail.com>
Message-ID: <CAGa7JC2PrB1WskV-MuXGfShAtPvRP7cGwXu42bGVWwhYMZxwKw@mail.gmail.com>

This is the traditional use of the apostrophe to be used to marc an elision
at end of words. Nothing new.

2017-01-04 6:36 GMT+01:00 John W Kennedy <john.w.kennedy at gmail.com>:

>
> > On Jan 3, 2017, at 10:20 PM, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> >
> > On 1/3/2017 4:24 PM, Marcel Schneider wrote:
> >> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote:
> >>
> >>>> Among the possibilities, you include Unicode subscripts.
> >>> Just for the sake of completeness.
> >> This tends to conclude that preformatted subscripts are really an
> option here.
> >
> > Not so. You yourself quote this statement:
> >
> > | Superscript modifier letters are intended for cases where the letters
> carry
> > | a specific meaning, as in phonetic transcription systems, and are not
> > | a substitute for generic styling mechanisms for superscripting of text,
> > | as for footnotes, mathematical and chemical expressions, and the like.
> >
> > It is clear that the uses that you advocate go against this intent.
> >
> > Therefore, your conclusion that this is "an option" is nothing more than
> a very personal
> > opinion on your part (and one that many people here would consider
> misguided if
> > presented as general recommendation).
> >
> > A./
>
> As long as this is being discussed, what about the historic practice of
> using M? (nowadays often seen as M? instead) in Scottish names?e.g.,
> M?Donald?as a typographic substitute for M(superscript c)?
>
> --
> John W Kennedy
> Having switched to a Mac in disgust at Microsoft's combination of
> incompetence and criminality.
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170104/9a94d38e/attachment.html>

From alastair at alastairs-place.net  Wed Jan  4 04:28:38 2017
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Wed, 4 Jan 2017 10:28:38 +0000
Subject: IdnaTest.txt and RFC 5893
In-Reply-To: <fd5b315a-bf21-52a0-7144-c95ea4298307@it.aoyama.ac.jp>
References: <C5D0A6E1-FC91-4443-A90A-63B7382BA410@alastairs-place.net>
 <fd5b315a-bf21-52a0-7144-c95ea4298307@it.aoyama.ac.jp>
Message-ID: <44B684E5-3EC7-43DE-8BFE-19935FEC8946@alastairs-place.net>

On 4 Jan 2017, at 08:12, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
> 
> Hello Alastair,
> 
> On 2016/12/06 20:51, Alastair Houghton wrote:
>> Hi all,
>> 
>> I must be missing something; in IdnaTest.txt, in the BIDI TESTS section, there are examples like (line 74)
> 
> Can you tell us where you got IdnaTest.txt from?

Yes, sorry, I should have included that information.  It?s here, with the IDNA mapping table

  http://www.unicode.org/Public/idna/9.0.0/

which I arrived at from UTS #46 (<http://www.unicode.org/reports/tr46>).

>>  B;	0?.\u05D0;	;	xn--0-sfa.xn--4db	#	0?.?
>> 
>> which the file alleges are valid, but I cannot for the life of me see why.  First, ?0?.?? is clearly a ?Bidi domain name? since it has at least one RTL label, ???.  As such, the Bidi Rule (RFC 5893 section 2) should be applied to its labels, and the label ?0?? fails [B1], since the first character has Bidi property EN, not L, R or AL.
> 
> On first sight, it looks to me as if you're correct.
> 
> For the exact interpretation of RFC 5893, you'd better write to the mailing list of the former IDNA(bis) WG at idna-update at alvestrand.no.

RFC 5893 seems pretty clear to me, and the problem really is that the test vectors (which come from unicode.org) seem (to me) to be incorrect.  I think the Unicode list is, therefore, the right place to raise this issue, but you?re right that it might attract attention from the right people if I also fire off a mail to the IDNA WG list.

>> Similarly (line 93)
>> 
>>  B;	??.\u05D0;	;	xn--0ca88g.xn--4db	#	??.?
>> 
>> Again, ???.?? is clearly a ?Bidi domain name?, but ???? fails [B6], because ??? has Bidi property ON, not L, EN or NSM.
>> 
>> Have I misunderstood something fundamental here?  Could someone explain why those examples are valid, in spite of RFC 5893?

As an additional data point, ICU?s IDNA demo web page appears to think these names are OK.

Kind regards,

Alastair.

--
http://alastairs-place.net


From john.w.kennedy at gmail.com  Wed Jan  4 05:44:14 2017
From: john.w.kennedy at gmail.com (John W Kennedy)
Date: Wed, 4 Jan 2017 06:44:14 -0500
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <CAGa7JC2PrB1WskV-MuXGfShAtPvRP7cGwXu42bGVWwhYMZxwKw@mail.gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
 <BB23C452-D03B-441F-B7E7-46267FE358DE@gmail.com>
 <CAGa7JC2PrB1WskV-MuXGfShAtPvRP7cGwXu42bGVWwhYMZxwKw@mail.gmail.com>
Message-ID: <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com>

No it isn?t. It isn?t an apostrophe; it?s a left single quote, although some modern printers mistakenly suppose it to be an apostrophe, and substitute one. And it isn?t an elision; it?s meant as a substitute glyph for a superscript c. (I confess that, not being from Scotland, I thought it to be an elision for over fifty years, but when I was preparing a transcription of William Dunlap?s ?Andr?: a Tragedy in Five Acts? [New York, 1798], in which a character named ?M?Donald? plays a major role, I looked into the matter, and was surprised to learn the truth.)


> On Jan 4, 2017, at 3:12 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> This is the traditional use of the apostrophe to be used to marc an elision at end of words. Nothing new.
> 
> 2017-01-04 6:36 GMT+01:00 John W Kennedy <john.w.kennedy at gmail.com>:
>> 
>> > On Jan 3, 2017, at 10:20 PM, Asmus Freytag <asmusf at ix.netcom.com> wrote:
>> >
>> > On 1/3/2017 4:24 PM, Marcel Schneider wrote:
>> >> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote:
>> >>
>> >>>> Among the possibilities, you include Unicode subscripts.
>> >>> Just for the sake of completeness.
>> >> This tends to conclude that preformatted subscripts are really an option here.
>> >
>> > Not so. You yourself quote this statement:
>> >
>> > | Superscript modifier letters are intended for cases where the letters carry
>> > | a specific meaning, as in phonetic transcription systems, and are not
>> > | a substitute for generic styling mechanisms for superscripting of text,
>> > | as for footnotes, mathematical and chemical expressions, and the like.
>> >
>> > It is clear that the uses that you advocate go against this intent.
>> >
>> > Therefore, your conclusion that this is "an option" is nothing more than a very personal
>> > opinion on your part (and one that many people here would consider misguided if
>> > presented as general recommendation).
>> >
>> > A./
>> 
>> As long as this is being discussed, what about the historic practice of using M? (nowadays often seen as M? instead) in Scottish names?e.g., M?Donald?as a typographic substitute for M(superscript c)?
>> 
>> --
>> John W Kennedy
>> Having switched to a Mac in disgust at Microsoft's combination of incompetence and criminality.
>> 
>> 
>> 
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170104/ab079022/attachment.html>

From verdy_p at wanadoo.fr  Wed Jan  4 06:43:50 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 4 Jan 2017 13:43:50 +0100
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
 <BB23C452-D03B-441F-B7E7-46267FE358DE@gmail.com>
 <CAGa7JC2PrB1WskV-MuXGfShAtPvRP7cGwXu42bGVWwhYMZxwKw@mail.gmail.com>
 <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com>
Message-ID: <CAGa7JC0eA=RjMP9OwtUiNe15VBUA-_KbWAuagBBzBApc9FyiNQ@mail.gmail.com>

Linguistically, it is an apostrophe, even if it's represented by a single
quote (same as in French), because the "letter apostrophe" is not used
(that letter apostrohpe was encoded in Unicode very late and there's no
desire to change the mappings in French or Scottish). If you think it is a
substitute only because the very superficial apparence of that superscript
c, I think it is just a hack used by some old printer that did not have
that letter in their case box. In 1798 printing a book was expensive and
metal fonts were also costly, and writers acepted some minor transforms of
their manuscript by the printer (and frequent typos as well). Later
reeditions frequently correct these typos.

Note that in French the right single quote is normally not used at all as a
quotation mark, and when it appears between two letters it is unambiguously
an apostrophe. I think the letter apostrophe was addede later in Unicode
only for English to allow distrinctions. But I've rarely seen used. Later
it was used as a substitute for a glottal stop in some
Polynesian/Melanesian languages but the actual character was encoded and is
preferable (its glyph is distinctive).


2017-01-04 12:44 GMT+01:00 John W Kennedy <john.w.kennedy at gmail.com>:

> No it isn?t. It isn?t an apostrophe; it?s a left single quote, although
> some modern printers mistakenly suppose it to be an apostrophe, and
> substitute one. And it isn?t an elision; it?s meant as a substitute glyph
> for a superscript c. (I confess that, not being from Scotland, I thought it
> to be an elision for over fifty years, but when I was preparing a
> transcription of William Dunlap?s ?Andr?: a Tragedy in Five Acts? [New
> York, 1798], in which a character named ?M?Donald? plays a major role, I
> looked into the matter, and was surprised to learn the truth.)
>
>
> On Jan 4, 2017, at 3:12 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> This is the traditional use of the apostrophe to be used to marc an
> elision at end of words. Nothing new.
>
> 2017-01-04 6:36 GMT+01:00 John W Kennedy <john.w.kennedy at gmail.com>:
>
>>
>> > On Jan 3, 2017, at 10:20 PM, Asmus Freytag <asmusf at ix.netcom.com>
>> wrote:
>> >
>> > On 1/3/2017 4:24 PM, Marcel Schneider wrote:
>> >> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote:
>> >>
>> >>>> Among the possibilities, you include Unicode subscripts.
>> >>> Just for the sake of completeness.
>> >> This tends to conclude that preformatted subscripts are really an
>> option here.
>> >
>> > Not so. You yourself quote this statement:
>> >
>> > | Superscript modifier letters are intended for cases where the letters
>> carry
>> > | a specific meaning, as in phonetic transcription systems, and are not
>> > | a substitute for generic styling mechanisms for superscripting of
>> text,
>> > | as for footnotes, mathematical and chemical expressions, and the like.
>> >
>> > It is clear that the uses that you advocate go against this intent.
>> >
>> > Therefore, your conclusion that this is "an option" is nothing more
>> than a very personal
>> > opinion on your part (and one that many people here would consider
>> misguided if
>> > presented as general recommendation).
>> >
>> > A./
>>
>> As long as this is being discussed, what about the historic practice of
>> using M? (nowadays often seen as M? instead) in Scottish names?e.g.,
>> M?Donald?as a typographic substitute for M(superscript c)?
>>
>> --
>> John W Kennedy
>> Having switched to a Mac in disgust at Microsoft's combination of
>> incompetence and criminality.
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170104/ec6c8a9e/attachment.html>

From moyogo at gmail.com  Wed Jan  4 07:30:12 2017
From: moyogo at gmail.com (Denis Jacquerye)
Date: Wed, 04 Jan 2017 13:30:12 +0000
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <CAGa7JC0eA=RjMP9OwtUiNe15VBUA-_KbWAuagBBzBApc9FyiNQ@mail.gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
 <BB23C452-D03B-441F-B7E7-46267FE358DE@gmail.com>
 <CAGa7JC2PrB1WskV-MuXGfShAtPvRP7cGwXu42bGVWwhYMZxwKw@mail.gmail.com>
 <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com>
 <CAGa7JC0eA=RjMP9OwtUiNe15VBUA-_KbWAuagBBzBApc9FyiNQ@mail.gmail.com>
Message-ID: <CAJKta0xt92eA9ux9TaK5yTqCC1xUi6H2W=A1_bNKGekrcps_WQ@mail.gmail.com>

Philippe, you are talking about 0027 APOSTROPHE, 2019 RIGHT SINGLE
QUOTATION MARK and 02BC MODIFIER LETTER APOSTROPHE.
John is clearly talking about 2018 LEFT SINGLE QUOTATION MARK (or if you
want to stretch it 02BB MODIFIER LETTER TURNED COMMA) being used as a
substitute for superscript c.
They all look alike at small size or in some fonts, which explains your
misunderstanding even if John was explicit about it being a left single
quote.

On Wed, 4 Jan 2017 at 12:48 Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Linguistically, it is an apostrophe, even if it's represented by a single
> quote (same as in French), because the "letter apostrophe" is not used
> (that letter apostrohpe was encoded in Unicode very late and there's no
> desire to change the mappings in French or Scottish). If you think it is a
> substitute only because the very superficial apparence of that superscript
> c, I think it is just a hack used by some old printer that did not have
> that letter in their case box. In 1798 printing a book was expensive and
> metal fonts were also costly, and writers acepted some minor transforms of
> their manuscript by the printer (and frequent typos as well). Later
> reeditions frequently correct these typos.
>
> Note that in French the right single quote is normally not used at all as
> a quotation mark, and when it appears between two letters it is
> unambiguously an apostrophe. I think the letter apostrophe was addede later
> in Unicode only for English to allow distrinctions. But I've rarely seen
> used. Later it was used as a substitute for a glottal stop in some
> Polynesian/Melanesian languages but the actual character was encoded and is
> preferable (its glyph is distinctive).
>
>
> 2017-01-04 12:44 GMT+01:00 John W Kennedy <john.w.kennedy at gmail.com>:
>
> No it isn?t. It isn?t an apostrophe; it?s a left single quote, although
> some modern printers mistakenly suppose it to be an apostrophe, and
> substitute one. And it isn?t an elision; it?s meant as a substitute glyph
> for a superscript c. (I confess that, not being from Scotland, I thought it
> to be an elision for over fifty years, but when I was preparing a
> transcription of William Dunlap?s ?Andr?: a Tragedy in Five Acts? [New
> York, 1798], in which a character named ?M?Donald? plays a major role, I
> looked into the matter, and was surprised to learn the truth.)
>
>
> On Jan 4, 2017, at 3:12 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> This is the traditional use of the apostrophe to be used to marc an
> elision at end of words. Nothing new.
>
> 2017-01-04 6:36 GMT+01:00 John W Kennedy <john.w.kennedy at gmail.com>:
>
>
> > On Jan 3, 2017, at 10:20 PM, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> >
> > On 1/3/2017 4:24 PM, Marcel Schneider wrote:
> >> On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote:
> >>
> >>>> Among the possibilities, you include Unicode subscripts.
> >>> Just for the sake of completeness.
> >> This tends to conclude that preformatted subscripts are really an
> option here.
> >
> > Not so. You yourself quote this statement:
> >
> > | Superscript modifier letters are intended for cases where the letters
> carry
> > | a specific meaning, as in phonetic transcription systems, and are not
> > | a substitute for generic styling mechanisms for superscripting of text,
> > | as for footnotes, mathematical and chemical expressions, and the like.
> >
> > It is clear that the uses that you advocate go against this intent.
> >
> > Therefore, your conclusion that this is "an option" is nothing more than
> a very personal
> > opinion on your part (and one that many people here would consider
> misguided if
> > presented as general recommendation).
> >
> > A./
>
> As long as this is being discussed, what about the historic practice of
> using M? (nowadays often seen as M? instead) in Scottish names?e.g.,
> M?Donald?as a typographic substitute for M(superscript c)?
>
> --
> John W Kennedy
> Having switched to a Mac in disgust at Microsoft's combination of
> incompetence and criminality.
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170104/ee746fb0/attachment.html>

From charupdate at orange.fr  Wed Jan  4 08:20:40 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 4 Jan 2017 15:20:40 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
Message-ID: <106567611.8574.1483539640797.JavaMail.www@wwinf1k14>

On Wed, 4 Jan 2017 00:36:38 -0500, Asmus Freytag wrote:
> 
> On 1/3/2017 4:24 PM, Marcel Schneider wrote:
> > On Tue, 3 Jan 2017 09:31:42 +0100, Christoph P?per wrote:
> >
> >>> Among the possibilities, you include Unicode subscripts.
> >> Just for the sake of completeness.
> > This tends to conclude that preformatted subscripts are really an option here.
> 
> Not so. You yourself quote this statement:
> 
> | Superscript modifier letters are intended for cases where the letters carry
> | a specific meaning, as in phonetic transcription systems, and are not
> | a substitute for generic styling mechanisms for superscripting of text,
> | as for footnotes, mathematical and chemical expressions, and the like.
> 
> It is clear that the uses that you advocate go against this intent.

This is because even complemented with UAXes and TRs, the Core Specifications 
cannot cover the whole practice. It seems that to stay inside reasonable limits, 
a significant number of usage cases have been left out, e.g. the mentioned use of 
plain text for styled custom vulgar fractions is a recognized practice, but stays 
persistently excluded from TUS. However, since the inclusion of this could consist 
in adding three lines to the text, there is more to it. Out of technical as well 
as ethical considerations, Unicode is unable to promote the discussed usages, but 
without strongly discouraging them. The snippet above [1] would be less harsh at 
the expense of some redundancy:

| Superscript modifier letters are intended for cases where the letters carry
| a specific meaning, as in phonetic transcription systems, and are not INTENDED 
| AS a substitute for generic styling mechanisms for superscripting of text,
| as for footnotes, mathematical and chemical expressions, and the like.

This resolves to the meaning that super-/subscripting in more or less ordinary 
text is outside the design principles of the Unicode Standard, because the 
boundary between the feasible and the unfeasible would be hard to draw, as shown 
with the recent example of the plain text database for chemical formulas. So to 
protect itself against the temptation of drawing that boundary (drawing it at risk 
of being subsequently compelled to move it further), Unicode *declares* those 
characters as being *intended for* special contexts, according to their very 
encoding history.

Trying to understand to what extent this principle is applicable, I note that 
the three cited examples currently imply much more formatting than superscripting. 
This is the case of structural formulae in _chemistry_, complex _mathematical_ 
expressions, and _footnote_ management and layout. By contrast, when it?s only 
about super- or subscripting a few digits or Latin letters, markup and use 
of rich text may be considered overkill. And in the case of content that the 
reader may wish to copy-paste, things like the ?16? affix of hex numbers should 
remain distinct. Hence, styling is only ?the preferred means?, not the mandatory 
way to represent superscript letters or digits.[2] And this is tied to a /design/ 
principle of the Standard. I believe that /usage/ principles may diverge.

> 
> Therefore, your conclusion that this is "an option" is nothing more than 
> a very personal opinion on your part (and one that many people here would 
> consider misguided if presented as general recommendation).

Presenting this as general recommendation was indeed what I intended when starting 
the first thread of this discussion. Thanks to your and other subscribers? replies, 
I?ve come to the insight that this cannot be recommended throughout, not in a 
general way. However, this not being "an option" remains still very unclear to me. 
As a result of prior discussions, we know that other list participants do use e.g. 
superscript characters in a more extensive way. 

I think there are two levels of action: 

(1) to encode new preformatted characters;
(2) to encourage re-use of already existing ones. 

I understand that Unicode is consistently reluctant in both, while ISO/IEC is able 
to do more in (1) given that they sometimes add (or remove) characters to(/from) 
the new repertoire, and National Bodies are in a position to do (2) through usage 
recommendations of their own. Let alone all the other people who may use or not 
use available preformatted characters for any purpose, eventually sharing the hint 
and?in the best case?the means to input them efficiently. 

Or am I missing something?

Given that the WG of the French standard keyboard is actually interested in getting 
encoded a new ordinal indicator (kind of '?'), I feel the more urged to stay tuned, 
and to comment on subsequent e-mails, too.

Marcel

[1] TUS 9.0, ?7.8, p. 327.
http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G24762

[2] TUS 9.0, ?22.4, p. 786.
http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G42931


From alastair at alastairs-place.net  Wed Jan  4 09:13:36 2017
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Wed, 4 Jan 2017 15:13:36 +0000
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <106567611.8574.1483539640797.JavaMail.www@wwinf1k14>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
 <106567611.8574.1483539640797.JavaMail.www@wwinf1k14>
Message-ID: <0A3C029B-3C76-43E8-B6A5-4EF96093B044@alastairs-place.net>

On 4 Jan 2017, at 14:20, Marcel Schneider <charupdate at orange.fr> wrote:
> As a result of prior discussions, we know that other list participants do use e.g. 
> superscript characters in a more extensive way. 
> 
> I think there are two levels of action: 
> 
> (1) to encode new preformatted characters;
> (2) to encourage re-use of already existing ones. 
> 
> I understand that Unicode is consistently reluctant in both, while ISO/IEC is able 
> to do more in (1) given that they sometimes add (or remove) characters to(/from) 
> the new repertoire, and National Bodies are in a position to do (2) through usage 
> recommendations of their own. Let alone all the other people who may use or not 
> use available preformatted characters for any purpose, eventually sharing the hint 
> and?in the best case?the means to input them efficiently. 
> 
> Or am I missing something?
> 
> Given that the WG of the French standard keyboard is actually interested in getting 
> encoded a new ordinal indicator (kind of '?'), I feel the more urged to stay tuned, 
> and to comment on subsequent e-mails, too.

I can understand the desire to encode the new ordinal indicator.

Perhaps another option worth contemplating might be to standardise some control code points, to provide a mechanism for ?plain text? to include the necessary minimum of formatting information without additional markup.  The advantage of this approach is that it would make it explicitly obvious that Unicode wasn?t going to include further super or subscript forms, while providing everyone that wants them with access to a full set of super or subscripts subject to system (or font) support.

A simple form of this might be to encode the new zero-width modifier code points SUBSCRIPT and SUPERSCRIPT that work somewhat like the variation selectors, so e.g.

  U+0032  DIGIT TWO
  U+????  SUPERSCRIPT
  U+0033  DIGIT THREE
  U+????  SUBSCRIPT

would display as ?? on fonts that supported the new modifiers.  The advantage of taking this very simplistic approach is that it can be dealt with in the OpenType (or AAT) tables in modern fonts, rather than necessitating changes to rendering code.  It is also obviously not an attempt to replace markup, but will cope with most common ?plain text? uses.

Kind regards,

Alastair.

--
http://alastairs-place.net


From doug at ewellic.org  Wed Jan  4 13:20:14 2017
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 04 Jan 2017 12:20:14 -0700
Subject: Superscript and Subscript Characters in General Use
Message-ID: <20170104122014.665a7a7059d7ee80bb4d670165c8327d.20efb7fc52.wbe@email03.godaddy.com>

Marcel Schneider wrote:

> This is because even complemented with UAXes and TRs, the Core
> Specifications cannot cover the whole practice. It seems that to stay
> inside reasonable limits, a significant number of usage cases have
> been left out, e.g. the mentioned use of plain text for styled custom
> vulgar fractions is a recognized practice, but stays persistently
> excluded from TUS.

I don't understand the relevance to vulgar fractions.

Much of this thread has dealt with Basic Latin characters that have no
superscript or subscript clones, and how their absence prevents certain
passages from being representable in plain text. This is your basic
debate over what constitutes plain text.

As explained in the July 2015 thread about vulgar fractions, TUS
sections 6.2 and 22.3 thoroughly explain the use of U+2044 FRACTION
SLASH with normal "Nd" digits. If I want to write "ninety-nine and
forty-four one-hundredths," with the non-precomposed vulgar fraction, I
can write "99?44?100" and be fully compliant with the Standard. This
has nothing to do with what is and isn't plain text.

The fact that many current rendering systems can't render this correctly
is an implementation matter, though a hard-to-fix one. (Note that the
fallback display is perfectly readable and correct, unless you see a box
for U+2009.)

The fact that TUS doesn't sanction the use of U+2044 with superscript
and subscript digits, which I imagine Marcel was alluding to, is
irrelevant. TUS is a character encoding standard, not a glyph encoding
standard.

If Marcel is talking about distinguishing between horizontal and
diagonal slashes in vulgar fractions, this is still not a question of
plain text. However, in the emoji era, this type of presentation
variation has become something that Unicode cares about, and so it might
be handled in some way in the future, such as with a variation selector.
I suspect this mechanism has been "excluded from TUS" because it doesn't
yet exist.


--
Doug Ewell | Thornton, CO, US | ewellic.org


From nobody_uses at outlook.com  Wed Jan  4 15:18:32 2017
From: nobody_uses at outlook.com (eduardo marin)
Date: Wed, 4 Jan 2017 21:18:32 +0000
Subject: Soyombo empty letter frame
Message-ID: <MWHPR2001MB10531808B8387D2B0465A17A82F80@MWHPR2001MB1053.namprd20.prod.outlook.com>

The Soyombo proposal is beautiful, but it is missing a very important character in my opinion: http://www.unicode.org/L2/L2015/15004-soyombo.pdf

Encoding an empty letter frame will allow for its proper description in plain text (as it is clear in the proposal itself), it could be used as an stylized cursor in text processors and also we could make zwj sequences such that combining with consonants makes it only render the nucleus.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170104/8525a3e7/attachment.html>

From charupdate at orange.fr  Wed Jan  4 15:48:29 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 4 Jan 2017 22:48:29 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <0A3C029B-3C76-43E8-B6A5-4EF96093B044@alastairs-place.net>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
 <106567611.8574.1483539640797.JavaMail.www@wwinf1k14>
 <0A3C029B-3C76-43E8-B6A5-4EF96093B044@alastairs-place.net>
Message-ID: <57352104.17641.1483566509921.JavaMail.www@wwinf1k14>

On Wed, 4 Jan 2017 15:13:36 +0000, Alastair Houghton wrote:
> 
> > Given that the WG of the French standard keyboard is actually interested in getting
> > encoded a new ordinal indicator (kind of '?'), I feel the more urged to stay tuned,
> > and to comment on subsequent e-mails, too.
> 
> I can understand the desire to encode the new ordinal indicator.
> 
> Perhaps another option worth contemplating might be to standardise some control 
> code points, to provide a mechanism for ?plain text? to include the necessary 
> minimum of formatting information without additional markup. The advantage of 
> this approach is that it would make it explicitly obvious that Unicode wasn?t 
> going to include further super or subscript forms, while providing everyone that 
> wants them with access to a full set of super or subscripts subject to system 
> (or font) support.
> 
> A simple form of this might be to encode the new zero-width modifier code points 
> SUBSCRIPT and SUPERSCRIPT that work somewhat like the variation selectors, so e.g.
> 
> U+0032 DIGIT TWO
> U+???? SUPERSCRIPT
> U+0033 DIGIT THREE
> U+???? SUBSCRIPT
> 
> would display as ?? on fonts that supported the new modifiers. The advantage of 
> taking this very simplistic approach is that it can be dealt with in the OpenType 
> (or AAT) tables in modern fonts, rather than necessitating changes to rendering 
> code. It is also obviously not an attempt to replace markup, but will cope with 
> most common ?plain text? uses. 

This would indeed make for stable plain text representations that convey the 
necessary vertical alignment. However its encoding would imply that the design 
principle of ?not attempt[ing] to describe the positioning of a character 
above or below the baseline in typographical layout? is superseded in this 
particular case, that provides a universal mechanism for a basic formatting 
parameter. Consistently this would call for some extensions catering for other 
formatting parameters. The expense in code points would be very low, the scheme 
would meet user expectations, and the Standard would become even more performative 
and thus, even more attractive through its enhancing the plain text environment. 
Eventually, the display of text editors, that actually is internally directed 
(for syntactic highlighting), would become text-guided. This is not far from 
rich-text.

It all tends to the conclusion that the French demand is based upon: 
modifier letters that are superscript forms, are not real superscripts, they 
don?t fit the expectations of people regarding superscripts and abbreviations. 
I already expressed my point of view in this discussion. But the real concern 
could be to emulate the Spanish ordinal indicators, arguing that their being 
a part of Unicode justifies similar facilities for other languages. Here the 
Unicode position is that the Spanish ordinal indicators are backcompat code 
points for roundtrip compatibility with ISO/IEC 8859-1. This clearly results 
from the Code Charts at U+00AA, U+00BA. There has been a deadline, that 
diligence made to precede. Let alone that a complete set of ordinal indicators 
for French necessitates four letters, that is probably exceeding the framework 
of 8-bit charsets common to several countries. 

As far as the discussion grew until now, I feel that French must live with 
the existing infrastructure. Hence the idea of re-using four modifier letters 
for that purpose.

If I?m wrong with this idea, that could be good or bad news. Good news if the 
generic SUPERSCRIPT and SUBSCRIPT variant selectors (or alternatively, new 
ordinal indicators) will be effectively encoded. Bad news if that as well as 
the re-use of modifier letters will be discarded. In-between, I see the out-of-
the-box modifier letter solution, as a kind of second-best choice. Better than 
nothing at all. In certain circumstances, better than markup and formatting.

Kind regards,

Marcel


From richard.wordingham at ntlworld.com  Wed Jan  4 16:12:00 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 4 Jan 2017 22:12:00 +0000
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <CAGa7JC0eA=RjMP9OwtUiNe15VBUA-_KbWAuagBBzBApc9FyiNQ@mail.gmail.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
 <BB23C452-D03B-441F-B7E7-46267FE358DE@gmail.com>
 <CAGa7JC2PrB1WskV-MuXGfShAtPvRP7cGwXu42bGVWwhYMZxwKw@mail.gmail.com>
 <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com>
 <CAGa7JC0eA=RjMP9OwtUiNe15VBUA-_KbWAuagBBzBApc9FyiNQ@mail.gmail.com>
Message-ID: <20170104221200.2a04ba12@JRWUBU2>

On Wed, 4 Jan 2017 13:43:50 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Note that in French the right single quote is normally not used at
> all as a quotation mark, and when it appears between two letters it
> is unambiguously an apostrophe. I think the letter apostrophe was
> addede later in Unicode only for English to allow distrinctions. But
> I've rarely seen used. Later it was used as a substitute for a
> glottal stop in some Polynesian/Melanesian languages but the actual
> character was encoded and is preferable (its glyph is distinctive).

As consonants, what we have are spacing clones (U+02BC and U+02BE) of
the smooth breathing, usually used for glottal stops, and spacing
clones of the rough breathing (U+02BD and U+02BF).  We also have the
modifier modifications of the IPA letters U+02C0 and U+02C1. These
usages only fit English well when representing the glottalisation (or
even total loss) of /t/ after vowels.  
 
> 2017-01-04 12:44 GMT+01:00 John W Kennedy <john.w.kennedy at gmail.com>:
> 
> > No it isn?t. It isn?t an apostrophe; it?s a left single quote,
> > although some modern printers mistakenly suppose it to be an
> > apostrophe, and substitute one. And it isn?t an elision; it?s meant
> > as a substitute glyph for a superscript c.

For which I would suggest U+02BF MODIFIER LETTER LEFT HALF RING would
be the best modern representative of the substitute character!

Of course, that would further increase confusion of those who initially
read U+02BF as a superscript 'c', and only later, if ever, realise that
it's actually a rough breathing carefully distinguished from the
similar punctuation marks.

Richard.


From charupdate at orange.fr  Wed Jan  4 17:36:49 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 5 Jan 2017 00:36:49 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170104122014.665a7a7059d7ee80bb4d670165c8327d.20efb7fc52.wbe@email03.godaddy.com>
References: <20170104122014.665a7a7059d7ee80bb4d670165c8327d.20efb7fc52.wbe@email03.godaddy.com>
Message-ID: <1956590416.18185.1483573009321.JavaMail.www@wwinf1k14>

On Wed, 04 Jan 2017 12:20:14 -0700, Doug Ewell wrote:
> 
> Marcel Schneider wrote:
> 
> > This is because even complemented with UAXes and TRs, the Core
> > Specifications cannot cover the whole practice. It seems that to stay
> > inside reasonable limits, a significant number of usage cases have
> > been left out, e.g. the mentioned use of plain text for styled custom
> > vulgar fractions is a recognized practice, but stays persistently
> > excluded from TUS.
> 
> I don't understand the relevance to vulgar fractions.

Vulgar fractions represented using super- and subscript digits around the 
FRACTION SLASH U+2044, that kerns, are one example illustrating superscript 
and subscript characters in general use. It is cited because it is the subject 
of a Microsoft Community wiki that is well referenced on the web:

https://answers.microsoft.com/en-us/msoffice/wiki/msoffice_word-mso_other/styled-fractions-in-windows/4a07d5fa-2484-4e39-b1f3-70bb3eb0c332

I recall again that when I launched the related 2015 thread, I was ignoring 
this page, until close to the end of the thread, when I found and shared the 
link. Vulgar fractions rather than mathematical fractions due to the slant of 
the fraction slash. (Though the so-called VULGAR FRACTIONs can be displayed 
with an horizontal bar, as TUS and Doug state (below).

> 
> Much of this thread has dealt with Basic Latin characters that have no
> superscript or subscript clones, and how their absence prevents certain
> passages from being representable in plain text. This is your basic
> debate over what constitutes plain text.

There was indeed a concern about what performance to recognize to plain text. 
But that had been settled to the extent that Unicode does not sustain attempts 
to fully represent styled mathematical expressions, but that a set of preformatted 
alphabets should be completed: superscripts lowercase (q) and uppercase, 
subscripts lowercase, and small caps (that take the place of subscript capitals).

Now I?m advocating the recognition of the re-use of existing modifier letters 
instead of new or newly modified superscripts, as well as the demand for ordinal 
indicators in French.

> 
> As explained in the July 2015 thread about vulgar fractions, TUS
> sections 6.2 and 22.3 thoroughly explain the use of U+2044 FRACTION
> SLASH with normal "Nd" digits. If I want to write "ninety-nine and
> forty-four one-hundredths," with the non-precomposed vulgar fraction, I
> can write "99?44?100" and be fully compliant with the Standard. This
> has nothing to do with what is and isn't plain text.

This and the spelling with SOLIDUS are referred to as fallback. What I 
complain of as not mentioned in the Standard, is that U+2044 can be used 
with superscript and subscript digits, rather than ASCII digits. The kerning 
of the FRACTION SLASH makes it fit for this use case, and in certain high-end 
fonts, especially Arial Unicode MS, the result is fully identical to precomposed 
fractions. This all is plain text. What isn?t, is the use of U+2044 as a format 
control, as specified in that part of the Standard. High-end software is meant 
to automatically apply fraction styling when U+2044 is detected between digits.

> 
> The fact that many current rendering systems can't render this correctly
> is an implementation matter, though a hard-to-fix one. (Note that the
> fallback display is perfectly readable and correct, unless you see a box
> for U+2009.)

Agreed. Here the use of superscript and subscript digits is not indispensable 
to the readability. In this case, their availability constitutes a facility 
for better representation?even in plain text.

> 
> The fact that TUS doesn't sanction the use of U+2044 with superscript
> and subscript digits, which I imagine Marcel was alluding to, is
> irrelevant. TUS is a character encoding standard, not a glyph encoding
> standard.

The distinction between baseline digits and superscript/subscript digits 
is in my opinion not a glyphic issue, since in Unicode they all are available 
as distinct characters.

> 
> If Marcel is talking about distinguishing between horizontal and
> diagonal slashes in vulgar fractions, this is still not a question of
> plain text. However, in the emoji era, this type of presentation
> variation has become something that Unicode cares about, and so it might
> be handled in some way in the future, such as with a variation selector.
> I suspect this mechanism has been "excluded from TUS" because it doesn't
> yet exist.

I?m not talking about this, and I don?t miss it in Unicode. Some fonts might 
have horizontal fraction bars. However, such a variation selector could be 
handy.

The plain text custom fractions are IMO a good example of the re-use of 
superscript and subscript characters. More, I thought that the fraction 
slash had been encoded to work with them, until I learned in TUS that this 
was not intended. The 2015 thread brought up that the observed synergy is 
due to an initiative of the font designer(s). The fact that this happened 
in a font that claims conformity to the Standard, seems to me non-trivial.

Marcel


From markus.icu at gmail.com  Wed Jan  4 17:40:15 2017
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 4 Jan 2017 15:40:15 -0800
Subject: IdnaTest.txt and RFC 5893
In-Reply-To: <44B684E5-3EC7-43DE-8BFE-19935FEC8946@alastairs-place.net>
References: <C5D0A6E1-FC91-4443-A90A-63B7382BA410@alastairs-place.net>
 <fd5b315a-bf21-52a0-7144-c95ea4298307@it.aoyama.ac.jp>
 <44B684E5-3EC7-43DE-8BFE-19935FEC8946@alastairs-place.net>
Message-ID: <CAN49p6obEGyqEupr3L_j-8=qFEi+7ofErX3rCRznGGCwTHxLYQ@mail.gmail.com>

On Wed, Jan 4, 2017 at 2:28 AM, Alastair Houghton <
alastair at alastairs-place.net> wrote:

> RFC 5893 seems pretty clear to me, and the problem really is that the test
> vectors (which come from unicode.org) seem (to me) to be incorrect.


https://tools.ietf.org/html/rfc5893#section-2 says "*The following rule*,
consisting of six conditions, *applies to labels* in Bidi domain names."

That's what the ICU code does -- applying the rule to each label -- and I
assume that's the basis for the test data.

The latter part of this RFC section says that *if* certain conditions are
met *for all labels, then* the domain name as a whole displays well.

ICU does not currently check for multi-label bidi combinations.

FYI the ICU checkLabelBiDi() code is currently here
<http://bugs.icu-project.org/trac/browser/trunk/icu4j/main/classes/core/src/com/ibm/icu/impl/UTS46.java#L541>
(Java
version).

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170104/6484b639/attachment.html>

From doug at ewellic.org  Wed Jan  4 18:33:06 2017
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 04 Jan 2017 17:33:06 -0700
Subject: Superscript and Subscript Characters in General Use
Message-ID: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>

Marcel Schneider wrote:

>> I don't understand the relevance to vulgar fractions.
>
> Vulgar fractions represented using super- and subscript digits around
> the FRACTION SLASH U+2044

Don't do that.

The fact that someone, even a Microsoft MVP, posted an article about
this glyph hack does not make it a good idea. It's kind of like making a
grinning frog or caterpillar out of Telugu letters.

> What I complain of as not mentioned in the Standard, is that U+2044
> can be used with superscript and subscript digits, rather than ASCII
> digits.

Almost any character(s) in Unicode "can be" used with almost any other.
You can surround U+2044 with emoji if you like. That doesn't mean you
should.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From mark at kli.org  Wed Jan  4 19:54:17 2017
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 4 Jan 2017 20:54:17 -0500
Subject: Soyombo empty letter frame
In-Reply-To: <MWHPR2001MB10531808B8387D2B0465A17A82F80@MWHPR2001MB1053.namprd20.prod.outlook.com>
References: <MWHPR2001MB10531808B8387D2B0465A17A82F80@MWHPR2001MB1053.namprd20.prod.outlook.com>
Message-ID: <c3d7ed6c-04da-043d-aa9d-8c6f6e92ad0d@kli.org>

On 01/04/2017 04:18 PM, eduardo marin wrote:
>
> The Soyombo proposal is beautiful, but it is missing a very important 
> character in my opinion: 
> http://www.unicode.org/L2/L2015/15004-soyombo.pdf 
> <http://www.unicode.org/L2/L2015/15004-soyombo.pdf>
>
> Encoding an empty letter frame will allow for its proper description 
> in plain text (as it is clear in the proposal itself), it could be 
> used as an stylized cursor in text processors and also we could make 
> zwj sequences such that combining with consonants makes it only render 
> the nucleus.
>

According to the proposal:

    In the proposed encoding a combination of frame and nucleus is
    considered an atomic letter.... This approach enhances the
    conceptualization and identification of letters in the script; for
    instance, the letter ?ka? refers inherently to the fully-formed (X)
    and not to the nucleus (X).

In other words, they are explicitly rejecting the model considering the 
"frame" as an item in its own right.  I realize that you are not calling 
for redefining all the letters in terms of frame+nucleus, but encoding 
the frame seems to be something the proposers deliberately decided 
against doing.  In calling for encoding the frame (and why just one 
frame?  Wouldn't you want both the "closed" and "open" ones?), I think 
you really are going against what seems to be a design principle of the 
proposers.  Which of course you are completely entitled to do: just that 
you probably are better off talking it over with the proposers directly, 
to learn their thinking and so they can learn yours.

~mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170104/f3d71822/attachment.html>

From pandey at umich.edu  Wed Jan  4 20:31:11 2017
From: pandey at umich.edu (Anshuman Pandey)
Date: Wed, 4 Jan 2017 21:31:11 -0500
Subject: Soyombo empty letter frame
In-Reply-To: <c3d7ed6c-04da-043d-aa9d-8c6f6e92ad0d@kli.org>
References: <MWHPR2001MB10531808B8387D2B0465A17A82F80@MWHPR2001MB1053.namprd20.prod.outlook.com>
 <c3d7ed6c-04da-043d-aa9d-8c6f6e92ad0d@kli.org>
Message-ID: <8D4E2FA2-3FCC-4B0B-AE38-5F20EC6A3DAE@umich.edu>


> On Jan 4, 2017, at 8:54 PM, Mark E. Shoulson <mark at kli.org> wrote:
> 
>> On 01/04/2017 04:18 PM, eduardo marin wrote:
>> The Soyombo proposal is beautiful, but it is missing a very important character in my opinion: http://www.unicode.org/L2/L2015/15004-soyombo.pdf
>> 
>> Encoding an empty letter frame will allow for its proper description in plain text (as it is clear in the proposal itself), it could be used as an stylized cursor in text processors and also we could make zwj sequences such that combining with consonants makes it only render the nucleus.
> 
> According to the proposal:
> 
> In the proposed encoding a combination of frame and nucleus is considered an atomic letter.... This approach enhances the conceptualization and identification of letters in the script; for instance, the letter ?ka? refers inherently to the fully-formed (X) and not to the nucleus (X).
> In other words, they are explicitly rejecting the model considering the "frame" as an item in its own right.  I realize that you are not calling for redefining all the letters in terms of frame+nucleus, but encoding the frame seems to be something the proposers deliberately decided against doing.  In calling for encoding the frame (and why just one frame?  Wouldn't you want both the "closed" and "open" ones?), I think you really are going against what seems to be a design principle of the proposers.  Which of course you are completely entitled to do: just that you probably are better off talking it over with the proposers directly, to learn their thinking and so they can learn yours.
> 
> ~mark

As the author of the Soyombo proposal, I should like to say that I did indeed consider proposing the two frames for encoding as "pedagogical" characters. I did not mention the possibility of such in the proposal, but the present discussion persuades me to reinvestigate the issue. I'd be happy to hear the opinion of others.

All the best,
Anshuman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170104/f854e371/attachment.html>

From charupdate at orange.fr  Wed Jan  4 23:56:58 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 5 Jan 2017 06:56:58 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
Message-ID: <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>

On Wed, 04 Jan 2017 12:20:14 -0700, Doug Ewell wrote:
> 
> Marcel Schneider wrote:
> 
> >> I don't understand the relevance to vulgar fractions.
> >
> > Vulgar fractions represented using super- and subscript digits around
> > the FRACTION SLASH U+2044
> 
> Don't do that.
> 
> The fact that someone, even a Microsoft MVP, posted an article about
> this glyph hack does not make it a good idea.

I found it a good idea long before I found and read the article.[1] It is very 
coherent, and seemed to me the best way to make sense of the fraction slash in 
a character encoding standard that does things seriously. Since I?ve read the 
article, I?m glad that a Microsoft MVP worked out solutions to help people who 
have incomplete keyboard layouts. Several readers were so kind as to comment on 
the usefulness of the article and the shared data.

> It's kind of like making a
> grinning frog or caterpillar out of Telugu letters.

I don?t think that Telugu art and ASCII art could be compared to writing 
numbers with fractions made of superscripts and subscripts. Perhaps there is 
a difference between Telugu art and ASCII art in that, ASCII is more common, 
but the availability of super-/subscript Western Arabic digits should not be 
compared to the availability of a rather uncommon script. 

> 
> > What I complain of as not mentioned in the Standard, is that U+2044
> > can be used with superscript and subscript digits, rather than ASCII
> > digits.
> 
> Almost any character(s) in Unicode "can be" used with almost any other.
> You can surround U+2044 with emoji if you like. That doesn't mean you
> should.

Not to represent vulgar fractions in a legible way. Superscript and subscript 
digits are particular in that, they have compatibility mappings to ASCII digits, 
so that they are not only human readable, but machine readable. See TUS ?22.4 [2].

As of ?readability for the human reader? (NamesList, header), vulgar fractions 
represented using superscripts-FRACTION SLASH-subscripts have also the advantage 
of being stable across environments, unless some characters are not supported, 
in which case they can be parsed and replaced with formatted ASCII-based fractions, 
e.g. before the text is pasted into an ANSI-encoded form (that replaces with '?').
And they meet user expectations. Preformatted fractions are so demanded that the 
most frequent of them were encoded in early standards and included in national 
keyboard layouts. They entered Unicode for roundtrip compatibility [3]. That means, 
this is not the specific Unicode way of representing fractions, obviously because 
of the limitation of the number of those fractions. Now, the common denominator of 
the Unicode scheme and the user expectations is to represent vulgar fractions using 
preformatted super-/subscripts along with the?accurately kerning?FRACTION SLASH.
Therefore (again) that has been implemented in fonts like Arial Unicode MS.

The stability of this representation scheme prevents content corruption (see 
the counter-examples in TUS below, where the PDF tool used arbitrary characters 
mapped to special fonts; though that is another?already discussed?issue [3]).

I suggest that the specification of the fraction slash in TUS [4] be updated. 
It remained roughly unchanged since version 2.0 (the other one that I?ve checked). 
First, U+2044 should be used where applicable (actually there is still U+002F).
There should be *two* ?standard form[s] of a fraction built using the fraction 
slash?. Further we read that ?the displaying software is [?] mapping the fraction 
to a unit?. Does that mean that the preformatted fraction is substituted if 
available? Or should it read ?_formatting_ the fraction _as_ a unit??
I note, too, that typically the software waits for the digit-slash-digit sequence 
to be selected and fraction formatting being applied at request, so that this 
could eventually be mentioned, given that the fraction slash is even more 
uncommon on keyboards than the complete range of super- and subscript digits.

Regards,
Marcel

[1] Styled Fractions in Windows, Created by Jeeped, July 18, 2013, MVP, Wiki Author:
https://answers.microsoft.com/en-us/msoffice/wiki/msoffice_word-mso_other/styled-fractions-in-windows/4a07d5fa-2484-4e39-b1f3-70bb3eb0c332

[2] TUS 9.0, ?22.4, p. 786:
| 
| Parsing of Superscript and Subscript Digits. In the Unicode Character Database, superscript
| and subscript digits have not been given the General_Category property value
| Decimal_Number (gc=Nd), so as to prevent expressions like 23 from being interpreted like
| 23 by simplistic parsers. This should not be construed as preventing more sophisticated
| numeric parsers, such as general mathematical expression parsers, from correctly identifying
| these compatibility superscript and subscript characters as digits and interpreting them
| appropriately. See also the discussion of digits in Section 22.3, Numerals.
| 
http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G46374

[3] TUS 9.0, ?22.3, p. 784:
| 
| Fractions
| 
| The Number Forms block (U+2150..U+218F) contains a series of vulgar fraction characters,
| encoded for compatibility with legacy character encoding standards. These characters
| are intended to represent both of the common forms of vulgar fractions: forms with a
| right-slanted division slash, such as G, as shown in the code charts, and forms with a horizontal
| division line, such as H, which are considered to be alternative glyphs for the same
| fractions, as shown in Figure 22-8. A few other vulgar fraction characters are located in the
| Latin-1 block in the range U+00BC..U+00BE.
| 
| Figure 22-8. Alternate Forms of Vulgar Fractions
| 
| G H
| 
| The unusual fraction character, U+2189 vulgar fraction zero thirds, [?]
| 
| The vulgar fraction characters are given compatibility decompositions using U+2044 ?/?
| fraction slash. Use of the fraction slash is the more generic way to represent fractions in
| text; it can be used to construct fractional number forms that are not included in the collections
| of vulgar fraction characters. For more information on the fraction slash, see ?Other
| Punctuation? in Section 6.2, General Punctuation.
| 
http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G46039


[4] TUS 9.0, ?6.2, p. 277:
| 
| Fraction Slash. U+2044 fraction slash is used between digits to form numeric fractions,
| such as 2/3 and 3/9. The standard form of a fraction built using the fraction slash is defined
| as follows: any sequence of one or more decimal digits (General Category = Nd), followed
| by the fraction slash, followed by any sequence of one or more decimal digits. Such a fraction
| should be displayed as a unit, such as ? or !. The precise choice of display can depend
| on additional formatting information.
| 
| If the displaying software is incapable of mapping the fraction to a unit, then it can also be
| displayed as a simple linear sequence as a fallback (for example, 3/4). If the fraction is to be
| separated from a previous number, then a space can be used, choosing the appropriate
| width (normal, thin, zero width, and so on). For example, 1 + thin space + 3 + fraction
| slash + 4 is displayed as 1?.
| 
http://www.unicode.org/versions/Unicode9.0.0/ch06.pdf#G2000


From moyogo at gmail.com  Thu Jan  5 01:22:39 2017
From: moyogo at gmail.com (Denis Jacquerye)
Date: Thu, 05 Jan 2017 07:22:39 +0000
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
Message-ID: <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>

On Thu, 5 Jan 2017 at 06:03 Marcel Schneider <charupdate at orange.fr> wrote:

> On Wed, 04 Jan 2017 12:20:14 -0700, Doug Ewell wrote:
> >
> > Marcel Schneider wrote:
> >
> > >> I don't understand the relevance to vulgar fractions.
> > >
> > > Vulgar fractions represented using super- and subscript digits around
> > > the FRACTION SLASH U+2044
> >
> > Don't do that.
> >
> > The fact that someone, even a Microsoft MVP, posted an article about
> > this glyph hack does not make it a good idea.
>
> I found it a good idea long before I found and read the article.
>
>
It is not such a good idea, if at all. Superscript and subscript are not
the same thing as denominator and numerators. Many fonts make the
difference and ? or 1?2 or 1/2 will not look like of ?/? or ??? in many
cases.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170105/b6750e44/attachment.html>

From alastair at alastairs-place.net  Thu Jan  5 03:46:53 2017
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Thu, 5 Jan 2017 09:46:53 +0000
Subject: IdnaTest.txt and RFC 5893
In-Reply-To: <CAN49p6obEGyqEupr3L_j-8=qFEi+7ofErX3rCRznGGCwTHxLYQ@mail.gmail.com>
References: <C5D0A6E1-FC91-4443-A90A-63B7382BA410@alastairs-place.net>
 <fd5b315a-bf21-52a0-7144-c95ea4298307@it.aoyama.ac.jp>
 <44B684E5-3EC7-43DE-8BFE-19935FEC8946@alastairs-place.net>
 <CAN49p6obEGyqEupr3L_j-8=qFEi+7ofErX3rCRznGGCwTHxLYQ@mail.gmail.com>
Message-ID: <DF9AA094-277E-4E65-A2B9-CFFBF2AB1A8D@alastairs-place.net>

On 4 Jan 2017, at 23:40, Markus Scherer <markus.icu at gmail.com> wrote:
> 
> On Wed, Jan 4, 2017 at 2:28 AM, Alastair Houghton <alastair at alastairs-place.net> wrote:
> RFC 5893 seems pretty clear to me, and the problem really is that the test vectors (which come from unicode.org) seem (to me) to be incorrect.
> 
> https://tools.ietf.org/html/rfc5893#section-2 says "The following rule, consisting of six conditions, applies to labels in Bidi domain names."
> 
> That's what the ICU code does -- applying the rule to each label -- and I assume that's the basis for the test data.

Absolutely.  But the crucial part is ?in Bidi domain names?.  That is, it applies to *all* labels that are part of a Bidi domain name, not just RTL labels.  It did not say ?applies to RTL labels in Bidi domain names? and in fact even explicitly states that (in the first bullet point at the end of section 2):

  ...Note that even LTR labels and pure ASCII labels have to be tested.

Not to mention the fact that parts 5 and 6 of the rule apply specifically to LTR labels.

So it?s quite clear that given the domain name ?0?.??, both ??? *and* ?0?? need to be checked using the Bidi Rule.  Unless someone can explain why ?0?? does not fail the test, surely we all agree that line 74 is incorrect:

> B;	0?.\u05D0;	;	xn--0-sfa.xn--4db	#	0?.?

and similarly with line 93:

> B;	??.\u05D0;	;	xn--0ca88g.xn--4db	#	??.?

> ICU does not currently check for multi-label bidi combinations.

I was a bit puzzled by this, because the code clearly does (both in the C++ and Java versions) and yet the online demo doesn?t appear to object to the above test cases.  So I wrote a quick test program against the C++ version of ICU 58.2 and fed it both test cases, and, sure enough, ICU agrees that there is a BiDi error in both of the above cases.

Kind regards,

Alastair.

--
http://alastairs-place.net


From charupdate at orange.fr  Thu Jan  5 05:33:49 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 5 Jan 2017 12:33:49 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
Message-ID: <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>

On Thu, 05 Jan 2017 07:22:39 +0000, Denis Jacquerye wrote:
> 
> On Thu, 5 Jan 2017 at 06:03 Marcel Schneider  wrote:
> 
> > On Wed, 04 Jan 2017 12:20:14 -0700, Doug Ewell wrote:
> > >
> > > Marcel Schneider wrote:
> > >
> > > >> I don't understand the relevance to vulgar fractions.
> > > >
> > > > Vulgar fractions represented using super- and subscript digits around
> > > > the FRACTION SLASH U+2044
> > >
> > > Don't do that.
> > >
> > > The fact that someone, even a Microsoft MVP, posted an article about
> > > this glyph hack does not make it a good idea.
> >
> > I found it a good idea long before I found and read the article.
> >
> >
> It is not such a good idea, if at all. Superscript and subscript are not
> the same thing as denominator and numerators. Many fonts make the
> difference and ? or 1?2 or 1/2 will not look like of ?/? or ??? in many
> cases. 

Indeed I remember that conclusion from the 2015 thread. If the fraction 
formatting facility is available, it should be used. If it isn?t, I?d suggest 
not to leave the ASCII fallbacks, but to use super- and subscripts instead. 
This still seems an overall second-best solution, that may turn into best solution 
depending on the font used. If Arial Unicode MS is used (though it is no longer 
a part of new Windows versions), it really looks exactly like preformatted 
fractions in the same font. But I can understand that denominators are meant 
to align on the baseline, while subscripts are often set slightly below. 

Though sometimes suboptimal, ?styled? plain text custom vulgar fractions still 
offer a far better readability than their plain ASCII fallbacks. To be consistent, 
fractions could be represented throughout this way in a given document, avoiding 
the mix-up of preformatted fractions with precomposed fractions. 

Marcel


From charupdate at orange.fr  Thu Jan  5 05:38:58 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 5 Jan 2017 12:38:58 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170104221200.2a04ba12@JRWUBU2>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <20161229014759.5a51c747@JRWUBU2>
 <c0152dd7-a548-3a10-0852-ef3845bc3a02@ix.netcom.com>
 <1339076265.17001.1483057435430.JavaMail.www@wwinf1p12>
 <20161230123727.52a00633@JRWUBU2>
 <5CF85CCB-4321-4620-BE57-C7914C84BCC7@crissov.de>
 <551789226.12010.1483218242564.JavaMail.www@wwinf1p10>
 <F3F32D4E-7603-451C-A478-CF3FD6DE1720@crissov.de>
 <842947588.29513.1483489492387.JavaMail.www@wwinf1p19>
 <cb341f3a-f65e-a4d3-dbb0-5167b61e5d51@ix.netcom.com>
 <BB23C452-D03B-441F-B7E7-46267FE358DE@gmail.com>
 <CAGa7JC2PrB1WskV-MuXGfShAtPvRP7cGwXu42bGVWwhYMZxwKw@mail.gmail.com>
 <3ADB5847-528D-45B0-A963-F0CACC7A69E9@gmail.com>
 <CAGa7JC0eA=RjMP9OwtUiNe15VBUA-_KbWAuagBBzBApc9FyiNQ@mail.gmail.com>
 <20170104221200.2a04ba12@JRWUBU2>
Message-ID: <1043493870.5967.1483616338879.JavaMail.www@wwinf1k39>

On Wed, 4 Jan 2017 00:36:38 -0500, John W Kennedy wrote to Asmus Freytag:

> As long as this is being discussed, what about the historic practice of using 
> M? (nowadays often seen as M? instead) in Scottish names?e.g., M?Donald?as a 
> typographic substitute for M(superscript c)? 

My first idea at reading was, that this adds to the examples of character re-use 
from lack of appropriate characters on the keyboard (or in the typecase, as you 
explained later).

On Tue, 3 Jan 2017 22:48:09 -0800, Asmus Freytag (c) replied:

> What about it? There are dozens, perhaps hundreds of fallbacks that have
> been used over time, both in hot metal typography as well as with
> typewriters or digital systems. Some practices may have started in ways
> similar to a fallback, but have now evolved into standard practice.
> Other ones remain fallbacks or went out of fashion.
> 
> It's an interesting example, but what kind of discussion did you have in
> mind? 

This designs the principle of user choices that may supersede standard preferences. 
So I?m picking up that the accurate concept is FALLBACK. It probably expands to 
the rule that every character may be re-used as a fallback for any other character 
(unformatted or formatted) if this meets user expectations and preferences. 

Consequently, the entire ranges of modifier letters, punctuation, symbols and other 
characters can be used as fallbacks to write superscript abbreviations. Some of 
them are obviously more appropriate than others: MODIFIER LETTER SMALL C would 
better fit this use case but was unavailable somewhere (typecase, charset, keyboard) 
and thus was not retained, while the second-best (single open-quote, or MODIFIER 
LETTER TURNED COMMA, as Denis Jacquerye suggests) was used.

So we have the confirmation that it?s up to the users and their keyboard layout 
providers and font designers to choose the best fitting fallbacks among the 
existing Unicode characters. For lack of anything better, MODIFIER LETTER SMALL E 
is the designated fallback candidate for the hypothetical/on-coming/up-coming 
French ordinal indicator _kind-of-'?'_, and the other three ordinal indicator 
fallbacks '?', '?' and '?' are also readily available. 

Citing again another example: To represent the French abbreviation of ?number?, 
MODIFIER LETTER SMALL O would better fit than the widely used DEGREE SIGN, that 
is the only one available on the current keyboard among these two, while the 
RING ABOVE would be too small, MASCULINE ORDINAL INDICATOR (sometimes used as 
another fallback by people who have it on the keyboard) has often an underline 
that is unpreferred in French, and SUPERSCRIPT ZERO is somewhat too big:
?????????????????n? - n? - n? - n? - n?

The fallback scheme applies to custom vulgar fractions as well: their representation 
with super-/subscript digits has seemingly the status of a fallback, during the time 
when it?s not yet recognized as alternate standard representation. I mean that 
officially, it is likely to be considered a fallback, while in practice, it has 
already become a working solution.

Further, on Wed, 4 Jan 2017 22:12:00 +0000, Richard Wordingham wrote:
> 
> > 2017-01-04 12:44 GMT+01:00 John W Kennedy :
> >
> > > No it isn?t. It isn?t an apostrophe; it?s a left single quote,
> > > although some modern printers mistakenly suppose it to be an
> > > apostrophe, and substitute one. And it isn?t an elision; it?s meant
> > > as a substitute glyph for a superscript c.
> 
> For which I would suggest U+02BF MODIFIER LETTER LEFT HALF RING would
> be the best modern representative of the substitute character!

While I?d thought a wile about the left half ring (having it on the keyboard, in 
group 3), when trying it I found it too tiny: M?Donald - M?Donald. 
Why a representative of a substitute? Probably because MODIFIER LETTER SMALL C is 
already used as a substitute of superscript small C, despite of the Standard 
specifying that the modifier letters are not [intended as] a substitute for this. 
So the left half ring might be considered the best representative, while the best 
modern *solution* for a substitute would really be the modifier letter:
???????????????M?Donald - M?Donald ( - M?Donald).

> 
> Of course, that would further increase confusion of those who initially
> read U+02BF as a superscript 'c', and only later, if ever, realise that
> it's actually a rough breathing carefully distinguished from the
> similar punctuation marks. 

Indeed it would be a pity to stick with alternatives and worst case fallbacks 
if a better solution is readily available. Among MODIFIER LETTERs, TURNED COMMA 
is already a ?typographical alternative for? REVERSED COMMA and LEFT HALF RING, 
so that it could seem consistent that SMALL C be a typographical alternative for 
superscript small c, knowing that the (probably) only thing that matters of 
a fallback, is whether it evolves into standard practice, remains a fallback, 
or goes out of fashion. E.g., the DEGREE SIGN had evolved into standard practice 
as a (representative of the) substitute for superscript small o, but could perhaps 
go out of fashion when a comprehensive set of MODIFIER SMALL letters can be easily 
accessed on standard keyboards, in the best case completed with automatic sequences 
for 'n?' and 'N?', that have the advantage over the degree sign that they can easily 
be complemented with a plural s: 'n??' and 'N??'.

Marcel


From asmusf at ix.netcom.com  Thu Jan  5 05:55:06 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 5 Jan 2017 03:55:06 -0800
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
Message-ID: <52a425a1-47b7-f63d-8526-c2799897bccf@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170105/6d0bab1c/attachment.html>

From asmusf at ix.netcom.com  Thu Jan  5 05:56:15 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 5 Jan 2017 03:56:15 -0800
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
Message-ID: <95570a60-20f5-6f7e-ccce-788f7b7741c5@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170105/5ee11155/attachment.html>

From mark at macchiato.com  Thu Jan  5 09:55:47 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 5 Jan 2017 16:55:47 +0100
Subject: IdnaTest.txt and RFC 5893
In-Reply-To: <DF9AA094-277E-4E65-A2B9-CFFBF2AB1A8D@alastairs-place.net>
References: <C5D0A6E1-FC91-4443-A90A-63B7382BA410@alastairs-place.net>
 <fd5b315a-bf21-52a0-7144-c95ea4298307@it.aoyama.ac.jp>
 <44B684E5-3EC7-43DE-8BFE-19935FEC8946@alastairs-place.net>
 <CAN49p6obEGyqEupr3L_j-8=qFEi+7ofErX3rCRznGGCwTHxLYQ@mail.gmail.com>
 <DF9AA094-277E-4E65-A2B9-CFFBF2AB1A8D@alastairs-place.net>
Message-ID: <CAJ2xs_EJqwOOLuiShfu6pYt0rh7Miz5c5dgErZLtEZqCPGm2vw@mail.gmail.com>

Alastair, thanks for finding it and bringing it up. I think you're right
that the problem is in that the test generation code doesn't properly apply
the bidi criteria to *all* the labels if *any* of the labels are RTL, but
instead is probably just going on a label-by-label basis. Thankfully, it
looks like ICU does handle it right, by your note. (The test file
generation doesn't use the ICU code.)

Could you please report this via http://www.unicode.org/reporting.html so
that we make sure that it is tracked and brought up to the UTC?

Mark


Mark

On Thu, Jan 5, 2017 at 10:46 AM, Alastair Houghton <
alastair at alastairs-place.net> wrote:

> On 4 Jan 2017, at 23:40, Markus Scherer <markus.icu at gmail.com> wrote:
> >
> > On Wed, Jan 4, 2017 at 2:28 AM, Alastair Houghton <
> alastair at alastairs-place.net> wrote:
> > RFC 5893 seems pretty clear to me, and the problem really is that the
> test vectors (which come from unicode.org) seem (to me) to be incorrect.
> >
> > https://tools.ietf.org/html/rfc5893#section-2 says "The following rule,
> consisting of six conditions, applies to labels in Bidi domain names."
> >
> > That's what the ICU code does -- applying the rule to each label -- and
> I assume that's the basis for the test data.
>
> Absolutely.  But the crucial part is ?in Bidi domain names?.  That is, it
> applies to *all* labels that are part of a Bidi domain name, not just RTL
> labels.  It did not say ?applies to RTL labels in Bidi domain names? and in
> fact even explicitly states that (in the first bullet point at the end of
> section 2):
>
>   ...Note that even LTR labels and pure ASCII labels have to be tested.
>
> Not to mention the fact that parts 5 and 6 of the rule apply specifically
> to LTR labels.
>
> So it?s quite clear that given the domain name ?0?.??, both ??? *and* ?0??
> need to be checked using the Bidi Rule.  Unless someone can explain why
> ?0?? does not fail the test, surely we all agree that line 74 is incorrect:
>
> > B;    0?.\u05D0;      ;       xn--0-sfa.xn--4db       #       0?.?
>
> and similarly with line 93:
>
> > B;    ??.\u05D0;      ;       xn--0ca88g.xn--4db      #       ??.?
>
> > ICU does not currently check for multi-label bidi combinations.
>
> I was a bit puzzled by this, because the code clearly does (both in the
> C++ and Java versions) and yet the online demo doesn?t appear to object to
> the above test cases.  So I wrote a quick test program against the C++
> version of ICU 58.2 and fed it both test cases, and, sure enough, ICU
> agrees that there is a BiDi error in both of the above cases.
>
> Kind regards,
>
> Alastair.
>
> --
> http://alastairs-place.net
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170105/eddf575a/attachment.html>

From charupdate at orange.fr  Thu Jan  5 12:43:32 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 5 Jan 2017 19:43:32 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <95570a60-20f5-6f7e-ccce-788f7b7741c5@ix.netcom.com>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <95570a60-20f5-6f7e-ccce-788f7b7741c5@ix.netcom.com>
Message-ID: <1818698400.22051.1483641812457.JavaMail.www@wwinf1p21>

On Thu, 5 Jan 2017 03:56:15 -0800, Asmus Freytag wrote:
> 
> On 1/5/2017 3:33 AM, Marcel Schneider wrote:
> >
> > If Arial Unicode MS is used (though it is no longer 
> > a part of new Windows versions), it really looks exactly like preformatted 
> > fractions in the same font. But I can understand that denominators are meant 
> > to align on the baseline, while subscripts are often set slightly below. 
> 
> That's just the kind of issue that you will run into with undisciplined hacks.
> 
> Just... don't.

So that cannot be recommended for general use, even outside of publishing software. 
The question left would be about readability of drafts and so on. From now on, when 
I?ve to choose between fractions this way: '2/7', and this way: '???', I should 
always use ASCII only? I?m thinking of an e-mail, like this one. I?m still unable 
to understand why the unformatted fraction should be better than the preformatted 
presentation (even when the latter is suboptimal). 

I still believe that keyboard layout developers are in debt of providing all and 
every characters of a given script and the related sets of numerals, generic 
punctuation and symbols, in order to enable the end-user to choose whatever effect 
he intends to produce. Since keyboards are shaping the practice, people are probably 
best served when the layout allows eveybody to adapt himself to all use cases. 

Earlier on Thu, 5 Jan 2017 03:55:06 -0800, Asmus Freytag wrote:
> 
> On 1/4/2017 4:33 PM, Doug Ewell wrote:
> > 
> > > What I complain of as not mentioned in the Standard, is that U+2044
> > > can be used with superscript and subscript digits, rather than ASCII
> > > digits.
> > 
> > Almost any character(s) in Unicode "can be" used with almost any other.
> > You can surround U+2044 with emoji if you like. That doesn't mean you
> > should.
> 
> This is a key point.
>
> You can use many code points to get some "effect", but that doesn't mean 
> it represents good practice or should be recommended.

This is particularly true for the French use of DEGREE SIGN for superscript o, 
that 99?% of the users are said to type to get the 'n?' abbreviation, or 'r?', 
'v?', 'f?'. It doesn?t look really bad, is stable, and easy to input. The downside 
comes at least when it?s up to append a plural s. And even before, it?s poor 
typography, because depending on the font, the degree sign may look very different 
from a real superscript o. With respect to this, the modifier letter o is way better.

> There are no "traffic cops" out there that will flag you down for having made 
> a poor decision, but that's not a reason enough to endorse random suggestions.
>
> This goes particularly for practices that need support in systems and/or fonts to work
> correctly. If some implementer supports the recommended normal size digits for 2044
> why should they do the additional work of making sure it works for super/sub script.

If the implementers really do support the fraction slash U+2044 as triggering the 
authentic fraction formatting, then they may spare the extra work. But this feature 
is uncommon enough as to think seriously about the fallback options. And if despite 
of being discouraged by all recommendations (including mine), the use of super/sub 
scripts gets thriving, it would be a good idea to support them along with normal 
size digits, the more as this does not require a lot of supplemental code (just 
twenty equivalence classes, I guess).

In the meantime, what options are available as fallback? The recommendation [1] 
is unrealistic: A system (OS + program + font) that is unable to map digits to 
numerators/denominators, cannot be expected neither to map U+2044 to U+002F, as 
specified. Therefore, the fraction slash is left between the digits. Since in most 
proportional fonts it is so kerning that it overlaps baseline digits when displayed 
in between, this can hardly be used as a recommended fallback. This looks good in 
some fonts only, while in most proportional fonts it doesn?t. Obviously, this use 
case is not intended. 

So perhaps all users might be given the unrecommended possibility to choose an 
unrecommended second-best solution. This would require to make sure that everybody 
gets the point of being at risk of running into issues. In any case, U+2044 ought 
to be on the keyboard, according to the Standard (in order to input the specified 
sequences). As of super/sub scripts, I think it would be a pity to keep them away. 

The rest could probably be considered as being up to the user.

In any case, fashion is unforeseeable.

Marcel

[1] TUS 9.0, ?6.2, p. 277:
| 
| If the displaying software is incapable of mapping the fraction to a unit, then it can also be
| displayed as a simple linear sequence as a fallback (for example, 3/4). [?]
| 
http://www.unicode.org/versions/Unicode9.0.0/ch06.pdf#G2000


From petercon at microsoft.com  Thu Jan  5 16:35:29 2017
From: petercon at microsoft.com (Peter Constable)
Date: Thu, 5 Jan 2017 22:35:29 +0000
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
Message-ID: <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Marcel Schneider
Sent: Thursday, January 5, 2017 3:34 AM

> If Arial Unicode MS is used (though it is no longer a part of new Windows versions)

The Arial Unicode MS font was never included in any version of Windows. It was only ever included in Microsoft Office.


Peter


From charupdate at orange.fr  Thu Jan  5 23:42:14 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 6 Jan 2017 06:42:14 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
Message-ID: <538089927.246.1483681334517.JavaMail.www@wwinf1p15>

On Thu, 5 Jan 2017 22:35:29 +0000, Peter Constable wrote:
> 
> From: Unicode [mailto:unicode-bounces_at_unicode.org] On Behalf Of Marcel Schneider 
> Sent: Thursday, January 5, 2017 3:34 AM 
> 
> > If Arial Unicode MS is used (though it is no longer a part of new Windows versions) 
> 
> The Arial Unicode MS font was never included in any version of Windows. It was only 
> ever included in Microsoft Office. 

I?m very sorry, thank you for the correction. I?ve mixed up OS and applications. 
Now it displays: ?Arial Unicode MS is unavailable on this machine. Do you want to 
use it nevertheless?? (translated from French). Since a few days I know that Arial 
Unicode MS is a part of the system fonts on macOS Snow Leopard. I?m now unable to 
use it on my netbook. Even existing documents don?t display well any more, they?re 
messed up with .notdef boxes. When trying to get preformatted custom fractions the 
old way around, it switches to MS Gothic for the subscripts. In 2015, I was about 
to buy Office 2010. Office 2013 requires too much RAM and I don?t like it. I?m 
aware that many people are roughly in the same position, at least regarding the 
Arial Unicode MS font. So perhaps, representing fractions with super/sub scripts 
ought to be removed from my recommendations, at least for more than drafts or 
informal papers. However, it seems to match the expectations of many people.

But that?s the least part of the topic. The main concern in this thread is the use 
of modifier letters as a fallback instead of ordinal indicators and for superscript 
in abbreviations. I agree that inside the document, formatting is much more powerful, 
as it doesn?t require complete fonts (and makes style fine-tuning easy). Nevertheless, 
the user might prioritize the stability of the document when it comes to plain text, 
and he could be interested in a better-looking display of letters that elsewhere 
should be superscripted. Here, the modifier letters could be a ready-to-use fallback. 
Converting them to formatted baseline letters could be achieved with a macro in VBA.

Couldn?t this be included in the next Office version as an out-of-the-box feature?

Marcel


From asmusf at ix.netcom.com  Fri Jan  6 02:21:29 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 6 Jan 2017 00:21:29 -0800
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
Message-ID: <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170106/ceeb6388/attachment.html>

From charupdate at orange.fr  Fri Jan  6 08:30:19 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 6 Jan 2017 15:30:19 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
Message-ID: <514716041.11559.1483713019939.JavaMail.www@wwinf1p15>

On Fri, 6 Jan 2017 00:21:29 -0800, Asmus Freytag wrote:
> 
> On 1/5/2017 9:42 PM, Marcel Schneider wrote:
> > 
> > Nevertheless, 
> > the user might prioritize the stability of the document when it comes to plain text, 
> > and he could be interested in a better-looking display of letters that elsewhere 
> > should be superscripted. Here, the modifier letters could be a ready-to-use fallback
> 
> The use of such hacks is destabilizing to any efforts to systematically format superscripts 
> across a document.

That supposes a rich text environment. The orthographical correctness of some 
languages, among which French, requires traditionally either a rich text environment 
or some in-line markup like TeX (at the expense of direct usability, i.e. without 
a LaTeX converter). That is limit non-conformant to the design principles of Unicode. 
As I understand them, Unicode provides all characters that are needed to correctly 
spell any language. This goal remains unreached as long as the orthography of some 
languages cannot be entirely achieved without relying on formatting markup. (I?m 
aware that complex scripts require hinted fonts for glyph reordering and glyph 
substitution, but this still is plain text.) 

The superscripting of abbreviation endings belongs to another level of correctness than the arbitrary stress as expressed with italics, bold, underline 
(obsolete in this use), extra letter spacing (German, rather old-style), capitalization, or extra acute accents as in Dutch. 

This is why Karl Pentzlin [1] cited ?Biblio^{que}? vs ?Biblioque?, where the latter 
is ?no valid French word.? 

>From this it becomes now clear that Alastair Houghton?s suggestion [2] of encoding 
a superscript variant selector, would meet this requirement and is therefore not 
to be confused with the first step towards making Unicode support rich text.

Saying it loud: The fact that French and a few other languages cannot be written 
in a correct orthography when the environment is plain text, seems to me hard to 
accept.

> Text fonts may not support them, because for "ordinary" text, by Unicode's
> recommendation, one would use ordinary letters / digits with superscript markup.

A text font that does not support all modifier letters has less of a text font than 
of a title font. Ornamental fonts are produced in such a variety that completing 
them is/was economically unfeasible. I?m considering this statement rather in the 
past tense, because diacriticized letters are already (on request) automatically 
generated and added to the font at creation. If automatic superscripting shouldn?t 
already be implemented, it will be soon, I suppose. So more and more (new and 
updated) fonts will support them. But wherever they aren?t, a _Convert modifier 
letters to superscript_ feature (or an equivalent macro command) ought to be able 
to make the text conformant to legacy handling.

> So, by using these hacks, anytime a document is re-formatted with a different font style,
> you are in danger of either losing these to boxes, or to be faced with random font styles.

Yes, people should always be aware that the use of modifier letters has its downside, 
as has the use of superscripted baseline letters. I currently write e-mails (like 
this one) in a text editor (Notepad++). Several features I use here, are IMO missing 
in all e-mail clients, as column editing, line reordering, and so on. So I appreciate 
to be able to spell correctly in plain text, without sloppy fallbacks (i.e. baseline 
fallbacks for superscript). It?s a matter of making the most of the exsisting charset.
I believe that modifier letter fallbacks are very functional. When I paste them into 
an HTML mail form, the display is always correct and doesn?t need to add superscript 
by hand in the whole mail. Furthermore, I can even use superscript in the subject.

> If you don't think that is a real problem: some (many) character pickers will insert font+code point into
> an application. These font bindings often survive and suddenly your text, when read on a different
> computer looks like a ransom note, just because the new machine has a new "default" font, and 
> that is applied to all letters that don't have a specific font binding.

Basically this is a good scheme, because character pickers typically are used for 
symbols. There are also two kinds: local, and online. I sometimes pick in the 
full-size PDF of the Code Charts. They?re the best character picker IMO.

> Some font pickers are "stupid" enough to do this for simple accented code points that would have
> been in the currently selected font anyway.

That?s really bad. I know that some people are writing documents by picking accented 
letters in the special characters dialog. I can figure out that some other people 
may use an online picker instead, partly because the word processor they?re using 
may be a web-app. Anyhow, this is very unefficient. The reason may be that one 
often thinks either that a keyboard cannot be completed, or that completing a 
keyboard would make it unusable, or hard to use, or full of stickers. Here?s one 
main challenge of keyboard layout development.

> Your suggestions will just add to these problems.
> If editing in a rich text environment, work in rich text. And then lean on implementers to get
> export correct to other rich text formats....

I really worked nearly all the time in a rich text environment, and I added plenty 
of autocorrections to speed up writing. Today, I work most of the time in plain 
text. I don?t use LaTeX, but I know that this is easily exported to many other 
formats. PDF is a main target format. Most of the drawbacks start when the reader 
wishes to copy-paste some lines of a (basically searchable) PDF either to rich text 
or to plain text? but that is not the issue here.

I hope that my future recommendations will solve more problems than they?ll create!

Marcel

[1] Karl Pentzlin?s MODIFIER LETTER SMALL Q proposal: 
http://www.unicode.org/L2/L2010/10230-modifier-q.pdf

[2] Alastair Houghton?s SUPERSCRIPT/SUBSCRIPT variant selectors suggestion:
http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0016.html


From charupdate at orange.fr  Fri Jan  6 13:02:25 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 6 Jan 2017 20:02:25 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
Message-ID: <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>

Another important point for the modifier letter fallbacks to work (if supported), 
would be that fonts support diacritics combined with modifier small letters. 
In 2014 I requested the superscript small '?' (not noticing that the intended 
abbreviation is incorrect), but encoding new characters like this one would be 
useless because it is decomposable, and out of date since the deadline is long past.
But the superscript '?' that I?ve recently mentioned is still used (in 'S^{t?}' for 
'Soci?t?' [Corporation], different from 'S??' which is the abbreviation of 'Sainte' 
[Saint, feminine]); and in Spanish, superscript '?' is used, Denis Jacquerye noted 
while pointing the need of working with?and enhancing support of?higher level 
protocols. [4]

Higher level protocols will still stay recommended as the standard high-end solution, 
while the use of modifier letters could get the status of an alternate fallback. 
Once it has it, modifier letter small q could be encoded and the whole set updated 
at font level for support of combining diacritics, while software may add two commands 
for round-trip conversion between modifier letters and superscript baseline letters, 
and probably between preformatted fractions and formatted fractions; I?m quite sure 
that all this is possible right now in VBA.

I?ve added some more references to my previous mail with respect to past year?s 
discussion of formatting variation selectors. As there was a typo and missing line 
breaks (symptomatic of not using any spell checker and of editing the layout by 
hand in a text editor), I feel the need of letting follow the corrected version 
below. 

Best regards,

Marcel

On Fri, 6 Jan 2017 00:21:29 -0800, Asmus Freytag wrote:
> 
> On 1/5/2017 9:42 PM, Marcel Schneider wrote:
> > 
> > Nevertheless, 
> > the user might prioritize the stability of the document when it comes to plain text, 
> > and he could be interested in a better-looking display of letters that elsewhere 
> > should be superscripted. Here, the modifier letters could be a ready-to-use fallback
> 
> The use of such hacks is destabilizing to any efforts to systematically format superscripts 
> across a document.

That supposes a rich text environment. The orthographical correctness of some 
languages, among which French, requires traditionally either a rich text environment 
or some in-line markup like TeX (at the expense of direct usability, i.e. without 
a LaTeX converter). That is limit non-conformant to the design principles of Unicode. 
As I understand them, Unicode provides all characters that are needed to correctly 
spell any language. This goal remains unreached as long as the orthography of some 
languages cannot be entirely achieved without relying on formatting markup. (I?m 
aware that complex scripts require hinted fonts for glyph reordering and glyph 
substitution, but this still is plain text.) 

The superscripting of abbreviation endings belongs to another level of correctness 
than the arbitrary stress as expressed with italics, bold, underline (obsolete in 
this use), extra letter spacing (German, rather old-style), capitalization, or 
extra acute accents as in Dutch. 

This is why Karl Pentzlin [1] cited ?Biblio^{que}? vs ?Biblioque?, where the latter 
is ?no valid French word.? 

>From this it becomes now clear that Alastair Houghton?s [2] suggestion of encoding 
a superscript variant selector, would meet this requirement and is therefore not 
to be confused with the first step towards making Unicode support rich text. This 
was indeed the traditional argument opposed to previous similar suggestions. [3]

Following the actual scheme, French and a few other languages cannot be written 
in a correct orthography when the environment is plain text. That seems to me 
hard to accept.

> Text fonts may not support them, because for "ordinary" text, by Unicode's
> recommendation, one would use ordinary letters / digits with superscript markup.

A text font that does not support all modifier letters has less of a text font than 
of a title font. Ornamental fonts are produced in such a variety that completing 
them is/was economically unfeasible. I?m considering this statement rather in the 
past tense, because diacriticized letters are already (on request) automatically 
generated and added to the font at creation. If automatic superscripting shouldn?t 
already be implemented, it will be soon, I suppose. So more and more (new and 
updated) fonts will support them. But wherever they aren?t, a _Convert modifier 
letters to superscript_ feature (or an equivalent macro command) ought to be able 
to make the text conformant to legacy handling.

> So, by using these hacks, anytime a document is re-formatted with a different font style,
> you are in danger of either losing these to boxes, or to be faced with random font styles.

Yes, people should always be aware that the use of modifier letters has its downside, 
as has the use of superscripted baseline letters. I currently write e-mails (like 
this one) in a text editor (Notepad++). Several features I use here, are IMO missing 
in all e-mail clients, as column editing, line reordering, and so on. So I appreciate 
to be able to spell correctly in plain text, without sloppy fallbacks (i.e. baseline 
fallbacks for superscript). It?s a matter of making the most of the existing charset.
I believe that modifier letter fallbacks are very functional. When I paste them into 
an HTML mail form, the display is always correct and doesn?t need to add superscript 
by hand in the whole mail. Furthermore, I can even use superscript in the subject.

> If you don't think that is a real problem: some (many) character pickers will insert font+code point into
> an application. These font bindings often survive and suddenly your text, when read on a different
> computer looks like a ransom note, just because the new machine has a new "default" font, and 
> that is applied to all letters that don't have a specific font binding.

Basically this is a good scheme, because character pickers typically are used for 
symbols. There are also two kinds: local, and online. I sometimes pick in the 
full-size PDF of the Code Charts. They?re the best character picker IMO.

> Some font pickers are "stupid" enough to do this for simple accented code points that would have
> been in the currently selected font anyway.

That?s really bad. I know that some people are writing documents by picking accented 
letters in the special characters dialog. I can figure out that some other people 
may use an online picker instead, partly because the word processor they?re using 
may be a web-app. Anyhow, this is very unefficient. The reason may be that one 
often thinks either that a keyboard cannot be completed, or that completing a 
keyboard would make it unusable, or hard to use, or full of stickers. Here?s one 
main challenge of keyboard layout development.

> Your suggestions will just add to these problems.
> If editing in a rich text environment, work in rich text. And then lean on implementers to get
> export correct to other rich text formats....

I really worked nearly all the time in a rich text environment, and I added plenty 
of autocorrections to speed up writing. Today, I work most of the time in plain 
text. I don?t use LaTeX, but I know that this is easily exported to many other 
formats. PDF is a main target format. Most of the drawbacks start when the reader 
wishes to copy-paste some lines of a (basically searchable) PDF either to rich text 
or to plain text? but that is not the issue here.

I hope that my future recommendations will solve more problems than they?ll create!

Marcel

[1] Karl Pentzlin?s MODIFIER LETTER SMALL Q proposal: 
http://www.unicode.org/L2/L2010/10230-modifier-q.pdf

[2] Alastair Houghton?s SUPERSCRIPT/SUBSCRIPT variant selectors suggestion:
http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0016.html

[3] Re: Why incomplete subscript/superscript alphabet ? a.lukyanov
http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0001.html
Re: Why incomplete subscript/superscript alphabet ? Leonardo Boiko 
http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0013.html
Re: Why incomplete subscript/superscript alphabet ? Jukka K. Korpela 
http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0014.html
Re: Why incomplete subscript/superscript alphabet ? Steve Swales 
http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0015.html
Re: Why incomplete subscript/superscript alphabet ? Neil Harris 
http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0017.html

[4] Re: Why incomplete subscript/superscript alphabet ? Denis Jacquerye 
http://www.unicode.org/mail-arch/unicode-ml/y2016-m10/0037.html


From christoph.paeper at crissov.de  Fri Jan  6 17:21:37 2017
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Sat, 7 Jan 2017 00:21:37 +0100
Subject: WAP Pictogram Specification as Emoji Source
Message-ID: <F8262A7A-6B4A-4284-95E7-22143F93A23F@crissov.de>

I just discovered the WAP Pictogram specification (WAP-213-WAPInterPic), last published in April 2001 and updated in November 2001.

- http://www.wapforum.org/what/technical.htm (requires OMA credentials)
- http://www.openmobilealliance.org/tech/affiliates/wap/wap-213-wapinterpic-20010406-a.pdf

It describes a way to reference locally stored graphics using the `pict` URL scheme in WML or XHTML:

    <img
     localsrc=?pict:///core/arrow/right?
          src=?http://www.pict.com/xx/rightArrow.wbmp?
          alt=?-&gt;?
     />
    
    <object data=?pict:///time/season/winter?>
      <object data=?pict:///weather/snow?>
        <img src=?http://www.pict.com/xx/snowman.wbmp?
             alt=?snowman?/>
      </object>
    </object>

Reading through section 7 Pictogram Set, it?s obvious that WAP pictograms have been unified with Japanese (i-mode) emojis upon their encoding in Unicode 6+. However, the mapping is not obvious in all cases and I think there are some pictograms that have been omitted / forgotten or could have better annotation, e.g.:

- /emotion/{trapped,tutting,shine,smell,pullFace,shakenHeart}
- /human/body/foot
- /map/{policeStation,spa,zoo}
- /sport/{sport,scuba}
- /time/event{anniversary,holiday,newYearsDay}

I can imagine a crudely equivalent Unicode emoji for almost all of them, but definitely not for Scuba Diving. I haven?t seen ? or at least not recognized ? a scuba gear, flipper, snorkel or diver in documentation of Japanese vendor sets.

Is there a mapping file available at the Unicode website that I?ve missed?

I haven?t found any reference or vendor-specific images, by the way, and if it wasn?t just used as an example domain anyway, pict.com seems now defunct.

From duerst at it.aoyama.ac.jp  Fri Jan  6 22:12:10 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Sat, 7 Jan 2017 13:12:10 +0900
Subject: WAP Pictogram Specification as Emoji Source
In-Reply-To: <F8262A7A-6B4A-4284-95E7-22143F93A23F@crissov.de>
References: <F8262A7A-6B4A-4284-95E7-22143F93A23F@crissov.de>
Message-ID: <28146f12-21df-f7d0-2b15-aa39e1c78eaf@it.aoyama.ac.jp>

On 2017/01/07 08:21, Christoph P?per wrote:
> I just discovered the WAP Pictogram specification (WAP-213-WAPInterPic), last published in April 2001 and updated in November 2001.

> I haven?t found any reference or vendor-specific images, by the way, and if it wasn?t just used as an example domain anyway, pict.com seems now defunct.

Isn't WAP overall pretty much defunct these days?

(Well, many including me predicted as much pretty much when it first 
showed up.)

Regards,   Martin.

From verdy_p at wanadoo.fr  Sat Jan  7 05:43:03 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 7 Jan 2017 12:43:03 +0100
Subject: WAP Pictogram Specification as Emoji Source
In-Reply-To: <28146f12-21df-f7d0-2b15-aa39e1c78eaf@it.aoyama.ac.jp>
References: <F8262A7A-6B4A-4284-95E7-22143F93A23F@crissov.de>
 <28146f12-21df-f7d0-2b15-aa39e1c78eaf@it.aoyama.ac.jp>
Message-ID: <CAGa7JC2hw2eY5Hi6D3JGSbkGNvcadgiXWNpzs23ow6=4wAceVQ@mail.gmail.com>

Technically it is is operational within operators. Old mobile phones still
have an advantage that has completely been forgotten with smartphone, it is
their very long battery lifetime, and there are still mobile phones sold
today that are NOT smartphones, have NO Internet connectivity (only
GSM/EDGE and SMS) and that will remain in charge for about 2 weeks, when my
smartphone gets out of charge in less than 24 hours (or several times a
day).
So no complex layered networking protocol stacks, no advanced typography
and a minimalist display. WAP is still supported on the EDGE/GPRS interface
(used also with the Internet protocol under 2G networks which works almost
everywhere when 3G/4G/5G signals cannot be received).
However don't expect using this for feature rich interaction including for
sending cute "WAP pictograms" that these devices will anyway not be able to
decipher and render. I bet that WAP pictograms was an early specification
for test that was in fact never needed, because the target audience goal
was better achieved with Internet protocols and encoding standards, but
also no one really wanted to administer a registry for the names (see the
death of pict.com: no one paying for it, specification redundant with
classic URIs on the web for referencing images), or standardizing the
glyphs.

The existing standard with normalized glyphs and semantics however exist,
notably for traffic signs (on streets/roads, railways, rivers/canals,
seas...), or in various industry standards (including for food, chemical
products, or cleaning instructions for textiles, or additional glyphs for
recycling, hazards or pollution). We are far from being complete in Unicode
there, even if the supporting standards are effective, sometimes even
mandatory, and very used. The problem for them is that these standards are
not necessarily international, and incompatible with each other but still
regulated and required and you cannot unify the glyphs specified by one of
these standards with those from a competing standard (or with those glyphs
already implemented in the UCS). And for now Unicode has resisted the idea
of standardizing sets of symbols for specific standards, and notably if the
glyphs are too strictly defined (not allowing variations/derivations
without breaking the intended regulated semantics).


2017-01-07 5:12 GMT+01:00 Martin J. D?rst <duerst at it.aoyama.ac.jp>:

> On 2017/01/07 08:21, Christoph P?per wrote:
>
>> I just discovered the WAP Pictogram specification (WAP-213-WAPInterPic),
>> last published in April 2001 and updated in November 2001.
>>
>
> I haven?t found any reference or vendor-specific images, by the way, and
>> if it wasn?t just used as an example domain anyway, pict.com seems now
>> defunct.
>>
>
> Isn't WAP overall pretty much defunct these days?
>
> (Well, many including me predicted as much pretty much when it first
> showed up.)
>
> Regards,   Martin.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170107/72c80d65/attachment.html>

From charupdate at orange.fr  Sat Jan  7 21:48:04 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Sun, 8 Jan 2017 04:48:04 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
Message-ID: <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>

I?m bringing quickly two updates to what has been previously said, while there?s
no other on-going discussion:

# Combining diacritics on modifier letters

I?m surprised to see that combining diacritics are already supported with 
modifier letters. When in my past mails I believed that they aren?t, I remembered 
some example in a last year?s thread, that didn?t look well, as isn?t the rendering 
in my drafts.

Now I?ve used the string U+0053 U+1D57 U+1D49 U+0301 ('S???') in the subject and 
in the body of an e-mail, sent and received it and printed it out to a PDF file. 
It renders fairly well everywhere except in the subject at writing (too high) and 
in the subject in the inbox and the displayed mail (too far). This is clearly a 
font issue (equally in Chrome and in Firefox, using a webmail).

When used in business mail, this could be appealing on one hand (at least when the 
bugging fonts have been updated) and convey a connotation of respectfulness, while 
on the other hand it could still raise the suspicion of unefficiency and time 
waste, as long as people aren?t aware how easy it is to input, thinking at 
character pickers. I actually hold two modifiers down while typing 'te', and hit 
the acute dead key (in the base shift state) and the space bar. To some degree 
it?s the same situation as with the letter '?', that many people here still type 
as ASCII fallback 'oe' (despite of having a shortcut in Word) and that is now 
coming on the standard keyboard.

# Expected modifier letter small q

Given that the use of modifier letters in the place of formatted superscript in 
abbreviations in English, French, Italian, Portuguese, Spanish and perhaps some 
other languages is a fallback that for high-end processing is to be replaced with 
formatted baseline letters, one could consider using a fallback character while 
waiting for the *MODIFIER LETTER SMALL Q to be encoded. The best approximation 
seems to be U+1DA3 MODIFIER LETTER SMALL TURNED H. In a draft, the abbreviation 
of 'Biblioth?que' [Library] as in [1] would then be spelled 'Biblio???'. That could 
eventually become a fixed convention supported by the conversion macros (that 
target only abbreviations in natural languages, not phonetics, nor random strings). 

Among the fallbacks discussed so far, this last one could be considered a ?hack?. 
This is why it must not be put in the place of *MODIFIER LETTER SMALL Q on the 
keyboard layouts. This allocation should still output a message string such as: 
? ^q_unavailable?, or ? ^q_n?existe_pas? (the maximum number of characters is 16, 
conforming to the Windows limitation; on macOS it?s practically 20, because with 
a few more, TextEdit on Snow Leopard shuts down, whatever "maxout" is set to). 
U+1DA3 MODIFIER LETTER SMALL TURNED H can be input by '[superscript]#h', where 
'#' (after compose or another dead key) is the (newly defined) composition 
character for "turned".

Once the macros will be written and available (any help is welcome!), should I 
still be flagged down for undisciplined hacks and for endorsing random suggestions?

Kind regards,

Marcel

[1] Karl Pentzlin?s *MODIFIER LETTER SMALL Q proposal: 
http://www.unicode.org/L2/L2010/10230-modifier-q.pdf


From charupdate at orange.fr  Sun Jan  8 05:43:02 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Sun, 8 Jan 2017 12:43:02 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
Message-ID: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>

On Sun, 8 Jan 2017 04:48:04 +0100 (CET), I wrote:

> Among the fallbacks discussed so far, this last one could be considered a ?hack?. 
> This is why it must not be put in the place of *MODIFIER LETTER SMALL Q on the 
> keyboard layouts. This allocation should still output a message string such as: 
> ? ^q_unavailable?, or ? ^q_n?existe_pas? (the maximum number of characters is 16, 

Please read 'code units' instead of ?characters?.

The polysemics of the underscore is another collateral issue. In the strings above, 
its current function of a space replacement mixes up with its possible TeX value. 
I wonder whether the LaTeX converter is able to parse that correctly. 

Please note, too, that the leading space and the low lines are intended to facilitate 
the deletion of the message by hitting Ctrl + Backspace.

> conforming to the Windows limitation; on macOS it?s practically 20, because with 
> a few more, TextEdit on Snow Leopard shuts down, whatever "maxout" is set to). 

When that happened, it was set to a higher value. Then I?ve shortened the output 
string and set ?maxout="20"?.

Anyhow, this is no longer an issue here, and this shortcoming is mentioned only for 
equality?s sake.

Marcel


From charupdate at orange.fr  Sun Jan  8 09:38:33 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Sun, 8 Jan 2017 16:38:33 +0100 (CET)
Subject: WAP Pictogram Specification as Emoji Source
In-Reply-To: <CAGa7JC2hw2eY5Hi6D3JGSbkGNvcadgiXWNpzs23ow6=4wAceVQ@mail.gmail.com>
References: <F8262A7A-6B4A-4284-95E7-22143F93A23F@crissov.de>
 <28146f12-21df-f7d0-2b15-aa39e1c78eaf@it.aoyama.ac.jp>
 <CAGa7JC2hw2eY5Hi6D3JGSbkGNvcadgiXWNpzs23ow6=4wAceVQ@mail.gmail.com>
Message-ID: <502664845.9352.1483889914046.JavaMail.www@wwinf1p19>

On Sat, 7 Jan 2017 00:21:37 +0100, Christoph P?per wrote: 
> 
> I just discovered the WAP Pictogram specification (WAP-213-WAPInterPic), 
> last published in April 2001 and updated in November 2001. 
> [?]
> [?] it?s obvious that WAP pictograms have been unified with Japanese (i-mode) emojis 
> upon their encoding in Unicode 6+. However, the mapping is not obvious in all cases 
> and I think there are some pictograms that have been omitted / forgotten [?]

On Sat, 7 Jan 2017 13:12:10 +0900, Martin J. D?rst replied:
> 
> Isn't WAP overall pretty much defunct these days? 
> 
> (Well, many including me predicted as much pretty much when it first 
> showed up.) 

On Sat, 7 Jan 2017 12:43:03 +0100, Philippe Verdy replied:
> 
> Technically it is is operational within operators. Old mobile phones still 
> have an advantage that has completely been forgotten with smartphone, it is 
> their very long battery lifetime, and there are still mobile phones sold 
> today that are NOT smartphones, have NO Internet connectivity (only 
> GSM/EDGE and SMS

and MMS and WAP ? this seems to be what I have.

> ) and that will remain in charge for about 2 weeks, when my 
> smartphone gets out of charge in less than 24 hours (or several times a day). 
> So no complex layered networking protocol stacks, no advanced typography 
> and a minimalist display. WAP is still supported on the EDGE/GPRS interface 
> (used also with the Internet protocol under 2G networks which works almost 
> everywhere when 3G/4G/5G signals cannot be received). 
> However don't expect using this for feature rich interaction including for 
> sending cute "WAP pictograms" that these devices will anyway not be able to 
> decipher and render.

My terminal is able to display colorful pictograms, but I remember that some 
time ago, the screens were mainly monochrome (and even smaller).

> I bet that WAP pictograms was an early specification 
> for test that was in fact never needed, because the target audience goal 
> was better achieved with Internet protocols and encoding standards, but 
> also no one really wanted to administer a registry for the names (see the 
> death of pict.com : no one paying for it, specification redundant with 
> classic URIs on the web for referencing images), or standardizing the glyphs. 
> 
> The existing standard with normalized glyphs and semantics however exist, 
> notably for traffic signs (on streets/roads, railways, rivers/canals, 
> seas...), or in various industry standards (including for food, chemical 
> products, or cleaning instructions for textiles, or additional glyphs for 
> recycling, hazards or pollution).

There must be several standards in various domains.

> We are far from being complete in Unicode 
> there, even if the supporting standards are effective, sometimes even 
> mandatory, and very used. The problem for them is that these standards are 
> not necessarily international, and incompatible with each other but still 
> regulated and required and you cannot unify the glyphs specified by one of 
> these standards with those from a competing standard (or with those glyphs 
> already implemented in the UCS).

Yes. This has been discussed in 2003:

http://unicode.org/mail-arch/unicode-ml/y2003-m06/0274.html

and in 2015:

http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0004.html
http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0013.html

> And for now Unicode has resisted the idea 
> of standardizing sets of symbols for specific standards, and notably if the 
> glyphs are too strictly defined (not allowing variations/derivations 
> without breaking the intended regulated semantics). 

That is the point. Such constraints are opposed to the Unicode principles of 
encoding symbols, Asmus Freytag explained in another context 12 days ago:

http://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0115.html


Marcel


From charupdate at orange.fr  Sun Jan  8 15:05:10 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Sun, 8 Jan 2017 22:05:10 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
Message-ID: <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>

Is there any other reason not to use superscript characters for 
abbreviations in plain text? (Except those reasons already 
discussed, that sum up in the compatibility issue when it comes 
to rich text.)
(Same question for subscript digits for vulgar fractions, that may 
be considered a kind of abbreviations:
- 'Page two of seven' is represented as 'Page 2/7';
- 'Two seventh' is (or should be) represented as '???'.)

I?m asking this question now as long as this thread is not closed.

I?m not in the habits of asking many questions, but perhaps this 
one I should have asked.

Marcel


From charupdate at orange.fr  Mon Jan  9 06:42:51 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 9 Jan 2017 13:42:51 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
Message-ID: <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>

On Fri, 6 Jan 2017 06:42:14 +0100 (CET), I wrote:
> [?] Here, the modifier letters could be a ready-to-use fallback. 
> Converting them to formatted baseline letters could be achieved with a macro in VBA. 
> 
> Couldn?t this be included in the next Office version as an out-of-the-box feature? 
http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0036.html

A conversion macro is now ready for Notepad++, that uses regexes and adds TeX markup.
To get around security issues this time, it is attached below. Thanks Unicode for 
forwarding. The XML file has some explanations in the header and can be manually 
added to the user storage file of the software. Macros for LibreOffice and for 
Office in VBA are in project but I cannot currently write them.

Along with this, I feel compelled to submit three detail issues around the topic:

(1) Interpretation of sequences: The fraction slash is specified to be interpreted 
as a sort of format control, and the software is supposed to either format the 
fraction, or to generate a linear fallback that in TUS (9.0, ?6.2, p. 277) shows 
up with a character substitution, eventually to emulate a glyph substitution for 
U+2044 by a glyph similar to U+002F. But the software isn?t meant to perform a 
glyph substitution as a fallback for another glyph substitution. 

Is that process conformant to this requirement: ?A process shall not assume that it 
is required to interpret any particular coded character sequence.? (TUS 9.0, ?3.2, p. 80) 
Probably I?m missing some clues here.


(2) Font conformance: Most fonts seem unhinted so that they cannot substitute 
numerator and denominator glyphs, and the digits remain normal size. Nevertheless, 
U+2044 FRACTION SLASH kerns so much that it overlaps many of the adjacent digits, 
typically 3, 6, 8, 9, 0. Therefore and to get neat fractions, the user may work in 
rich text and use the generic super-/subscript formatting. The Unicode Core 
Specification gives hints for implementers to automate this process in another way, 
while leaving the door open to an unformatted fallback throwout. Some proportional 
fonts however have an unkerning fraction slash. These inconsistencies in support 
and display baffle me.

Is there any place in the Unicode Standard where the kerning is specified? I 
believe that there isn?t. So which design decision should be preferred? I think it 
would be the kerning option.


(3) Variation selectors: Today, many characters are given variation selector 
sequences, so that I believe that the idea could be maintained that letters and 
digits deserve some information about whether they are a part of abbreviations, 
or of a vulgar fraction (and then, whether they are numerators or denominators). 
While the latter can be catered for by the glyph substitution mechanism triggered 
by the presence of U+2044, the former would require an *ABBREVIATION INDICATOR as 
it has already been suggested, an invisible formatting control. This however should 
have been proposed twenty years ago by the mainly concerned communities. Adding and 
implementing it today would perhaps be inefficient. The more as the concerned 
sequences are mainly found in Latin script, where thanks to phonetics, superscript 
forms are already available.


After having completed this, I can?t help wondering about the dynamics that show up 
in this and other related threads over the years, and particularly these past days. 
While rarely anybody takes offence of the misuse of the DEGREE SIGN as a kind of 
superscript 'o', many objections are raised whenever people dare to grab the small 
modifier letters on the keyboard and type them in their text editors, e-mail clients 
and webmail forms. What is the matter about this practice? Proper handling of such 
text files turns out to be quite easy and straigtforward, and round-trip conversion 
is at reach. For once here is a draft format that in some circumstances can even 
display in a finish-like look. 

For what reason should that be strongly discouraged and prohibited? This is the 
more surprising as it would restore the missing equality of the world?s languages 
with respect to plain text. Is this still overstating that principle?


Regards,

Marcel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shortcuts.xml
Type: text/xml
Size: 22885 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170109/dca4318f/attachment.xml>

From charupdate at orange.fr  Mon Jan  9 15:39:40 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 9 Jan 2017 22:39:40 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
Message-ID: <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>

I?m saddened to have fallen into a monologue. Thus I?ll quickly debrief 
the arguments opposed so far, to check whether I?m missing some point:

? English ordinals with baseline endings are incorrect too, so that they need 
formatting, as do French ones:
? English is in the same case as French and a few other languages, that cannot be 
correctly spelt in plain text without superscript ordinal indicators. Here the 
modifier small letters can be a ready-to-use, often well-looking fallback.

? Those modifier letters have poor font support, so that the text is messed up:
? Most work fonts do support them. For incomplete (ornamental) fonts, conversion 
tools will replace the modifier letters with formatted current characters.

? The modifier letters don?t currently match superscripting styles, nor do 
superscript and subscript digits match fraction styling in most fonts:
? For high-end processing, the text is converted to legacy presentation. 
Fraction styling is anyway missing in current software, while the superscript 
and subscript formatting doesn?t match true vulgar fractions neither, though 
nobody seems to care among the implementers.

? Implementers do hard work to provide fraction styling, so that they mustn?t 
be bothered with alternate characters to support, as super/sub scripts:
? This additional support is very easy to implement, as it typically needs no 
more than a set of equivalence classes.

? Character pickers add other problems with font bindings, when people use them 
instead of looking for an appropriate keyboard layout:
? If the goal is to correctly spell all natural languages in plain text, the 
character availability is ideally completed with updated keyboard layouts for input.

So far by memory. Going through the archives:

? Plain text is often unable to express stress or other aspects of the information 
that is a part of the content:
? The issue is only about correctly spelling all languages in plain text. 
Superscripting of abbreviation endings (and of numerators) belongs to another level 
of correctness than arbitrary stress and other postiche complements.

? Unicode considers superscripting for the representation of natural languages 
as out of scope:
? Whenever superscripting is required for the correct spelling and unambiguous 
representation of natural languages, this requirement should be relaxed, as it is 
for a set of technical notations.

? ?As long as French is ordinary text, the abbreviations require styled (rich) text.?
? No human language can be dismissed to rich text for its orthographically correct 
representation, without infringing the design principles of Unicode.

? Baseline fallbacks are unambiguous for all French abbreviations, at least in 
context:
? True. Some other scripts provide much less written information and leave more 
ambiguities. But this is intrinsic, not by lack of character encoding. Wherever 
baseline fallbacks are considered incorrect, or not ?pure?, superscripts must be 
provided in plain text, at least as an unambiguous fallback.

? Other means are available to unambiguously represent abbreviations, and they can 
be written out:
? Every traditional spelling must be supported in Unicode.

? Some French and Spanish abbreviation endings include accented letters, that aren?t 
a part of the limited set of modifier letters:
? Following the Unicode design principles, a complete base letter alphabet suffices 
since combining diacritics can be added. In practice, these diacritics appear to 
sometimes interact well even with superscript base letters. Where they don?t yet, 
it?s a matter of updating the fonts, or alternately of falling back to legacy 
processing (after using a macro, a plugin, or another tool to convert the text 
to legacy representation).

? Adding *MODIFIER LETTER SMALL Q for use as a superscript in natural languages 
would bring up the need to provide the same facilities to all other scripts, for 
equality?s sake:
? Latin script seems to be the only one that uses superscript in current text. 
If some languages using other scripts still cannot be orthographically spelt in 
plain text, it?s up to work out the corresponding proposals filling the gaps.

? Superscript abbreviations in natural languages must use generic formatting 
features, so as they are used for ?footnotes, mathematical and chemical 
expressions, and the like?:
? These three domains are of another level of complexity, and thus cannot be 
compared with ordinary text. On the other hand, the use of formatting for 
orthographic superscripts in ordinary text should be considered a legacy fallback, 
not a standard way of writing natural languages.

? Vulgar fractions made of super- and subscripts are not machine-readable and 
cannot be parsed correctly without any not yet available convention, somewhat 
like arbitrary emoji or ASCII art:
? They have a compatibility mapping to ASCII digits, and Unicode has taken care 
to prevent misparsing.

? Using super and sub scripts in abbreviations and fractions is bad practice, a 
sort of random suggestion:
? It can be tagged as bad (though not really ?random?) practice because TUS does 
not specify it (while not discouraging it neither). To make it good practice, 
referencing it in the standard as an alternate representation would suffice. 
(Cf. above, again:
? Implementers do hard work to provide fraction styling, so that they mustn?t 
be bothered with alternate characters to support, as super/sub scripts:
? This additional support is very easy to implement, as it typically needs no 
more than a set of equivalence classes.)

? Using those modifier letters and super/sub scripts in that contexts is an 
undisciplined hack:
? The idea that Unicode characters are only to be used with a specific, 
conventional meaning is considered a misperception of the Standard. Flagging 
character re-use as a hack builds a severe limitation to the principle of 
character polysemics. This is the start-point from where we need to investigate 
on who is enforcing this kind of discipline, who is interested in restricting the 
use of the discussed characters to keep (and even, to draw) people away from 
using them as long as they display well, and establishing new practice-proof usage 
protocols, including gateways to legacy protocols.

Hopefully,

Marcel


From richard.wordingham at ntlworld.com  Mon Jan  9 16:24:14 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 9 Jan 2017 22:24:14 +0000
Subject: Specification of Encoding of Plain Text
Message-ID: <20170109222414.72f83204@JRWUBU2>

Where, if anywhere, is the encoding of plain text specified?  I am
particularly concerned with the arrangement of the code sequences for
non-spacing abstract characters once one has determined an encoding for
the abstract characters.

For example, a naive reading of TUS 9.0 Section 16.4 Subsection
"Ordering of Syllable Components" would lead one to believe that the
word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA,
U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL
SIGN U, U+17C6 KHMER SIGN NIKAHIT>.  However, on further investigation,
I cannot find any text that says that <U+1781, U+17C6, U+17D2, U+1789,
U+17BB> would not be compliant with the Unicode standard.  Have I
missed anything?

One might hope that the subsection about 'logical order' in TUS 9.0
Section 2.2 Unicode Design Principles would help, but:

1) Section 3 'Conformance' says nothing about logical order; and
2) The subsection about 'logical order' seems to assume that there
exists a common practice; it does not actually place any requirement
on this common practice. 

Richard.


From asmusf at ix.netcom.com  Mon Jan  9 16:34:17 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Mon, 9 Jan 2017 14:34:17 -0800
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
Message-ID: <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170109/13862974/attachment.html>

From asmusf at ix.netcom.com  Tue Jan 10 02:06:05 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 10 Jan 2017 00:06:05 -0800
Subject: Specification of Encoding of Plain Text
In-Reply-To: <20170109222414.72f83204@JRWUBU2>
References: <20170109222414.72f83204@JRWUBU2>
Message-ID: <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170110/b0d7db3f/attachment.html>

From mark at macchiato.com  Tue Jan 10 03:11:41 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 10 Jan 2017 10:11:41 +0100
Subject: Specification of Encoding of Plain Text
In-Reply-To: <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
Message-ID: <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>

What I really wish we had would be a machine readable set of regexes for
each complex script (and for each language-script combination that is
different than the default for that script).

Such a regex R could be used for determining the well-formed ordering of
code points within words. The regex need not be for syllables, or grapheme
clusters, or any other formal construct. The *only* requirement it would
need to fulfill is that you could determine well-formed words with:

word := (R)+

That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV would pass
the text, but CCV would fail. Ideally R would be as simple as possible (but
no simpler).


Mark

On Tue, Jan 10, 2017 at 9:06 AM, Asmus Freytag <asmusf at ix.netcom.com> wrote:

> On 1/9/2017 2:24 PM, Richard Wordingham wrote:
>
> Where, if anywhere, is the encoding of plain text specified?  I am
> particularly concerned with the arrangement of the code sequences for
> non-spacing abstract characters once one has determined an encoding for
> the abstract characters.
>
> For example, a naive reading of TUS 9.0 Section 16.4 Subsection
> "Ordering of Syllable Components" would lead one to believe that the
> word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA,
> U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL
> SIGN U, U+17C6 KHMER SIGN NIKAHIT>.
>
> Richard,
>
> the group of Khmer experts that developed the recent label generation
> rules for root zone domain names considers that ordering the only one
> supported,  a specification you find here: https://www.icann.org/en/
> system/files/files/proposal-khmer-lgr-15aug16-en.pdf
>
> That document states:
>
> *7.4 Context of COENG Sign (U+17D2)*
> The sign ? KHMER SIGN COENG (U+17D2) used for subscripting consonants must
> occur between two consonants. If it occurs between any other categories, it
> is not in a valid context so the label is not well formed. Further, the
> consonant following it must not include ? KHMER LETTER LA (U+17A1), ...
>
> So, you are not alone in thinking that the COENG goes between consonants.
>
> Did they just make this up? No, they followed what is laid out in the
> standard:
>
> Page 621 in Unicode 9.0.0, you find (http://www.unicode.org/
> versions/Unicode9.0.0/ch16.pdf)
>
> *Subscript Consonants.* Subscript consonant signs differ from independent
> consonant
> characters and are called coeng (literally, ?foot, leg?) after their
> subscript position. While a
> consonant character can constitute an orthographic syllable by itself, a
> subscript consonant
> sign cannot. Note that U+17A1 C khmer letter la does not have a
> corresponding subscript
> consonant sign in standard Khmer.... Subscript consonant signs are used to
> represent any
> consonant following the first consonant in an orthographic syllable.
>
> and on page 624:
>
> .... each of these [subscript consonant] signs is represented by the
> sequence of two characters: a
> special control character (U+17D2 khmer sign coeng) and a corresponding
> consonant
> character.
>
> That text fixes the order MAIN CONSONANT + COENG OPERATOR + SUBSCRIPT
> CONSONANT
> with suffficient clarity (as do all the examples and tables).
>
>
>  However, on further investigation,
> I cannot find any text that says that <U+1781, U+17C6, U+17D2, U+1789,
> U+17BB> would not be compliant with the Unicode standard.  Have I
> missed anything?
>
>
> In this example, your coeng operator U+17D2 is out of order, while it is
> followed by
> a consonant, it does not in turn immediately follow the main consonant,
> because a
> sign NIKAHIT is inserted in your example.
>
> Again, from the Root Zone LGR document we find an explicit rule:
>
> *7.10 Context of NIKAHIT SIGN (U+17C6)*
> The sign ?? KHMER SIGN NIKAHIT (U+17C6) can only be preceded by a
> consonant or a shifter or one of the subset of dependent vowels tagged
> ?dependent-vowel-1? in the repertoire table (? ??), i.e. vowel signs AA and
> U.
>
> That would allow the NIKAHIT to be placed where you suggest, if it were
> not for the
> rule on the coeng operator (7.4).
>
> Now, it is a known fact that the label generation rules are slightly more
> restrictive than the rules for general text. (See also section 5 in that
> document).
>
> See the text on p. 622 in TUS 9.0.0 where the following *exception* is
> noted:
>
> "The subscript consonant signs in the Khmer script can be used to denote a
> final consonant,
> although this practice is uncommon."
>
> The associated example shows MAIN CONSONANT + VOWEL + NIKHAHIT + COENG +
> FINAL CONSONANT
>
> Another exception that is noted on p. 623 is the following:
>
> "While these subscript consonant signs are usually attached to a consonant
> character, they
> can also be attached to an independent vowel character. Although this
> practice is relatively
> rare, it is used in one very common word, meaning ?to give.?"
>
> Taken together, it would appear that, unless your example fits the first
> of these two exceptions,
> the NIKAHIT in it is out of order.
>
> (The label generation rules disallow both of these exceptions,
> in an attempt to streamline the rules, sacrificing a number of potential
> domain names. Equivelant
> rule sets for validating text would have to be more complete).
>
> One might hope that the subsection about 'logical order' in TUS 9.0
> Section 2.2 Unicode Design Principles would help, but:
>
> 1) Section 3 'Conformance' says nothing about logical order; and
> 2) The subsection about 'logical order' seems to assume that there
> exists a common practice; it does not actually place any requirement
> on this common practice.
>
> Richard.
>
>
>
> I don't think either of these general sections are intended to provide the
> correct
> or expected ordering of characters for complex scripts. Any preferred
> ordering that
> doesn't result by happenstance from normalization would presumably be
> describe
> in the text of the scrip section, such as Section 16.4 Khmer, in TUS 9.0.0.
>
> http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf
>
> A./
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170110/c14855d7/attachment.html>

From alastair at alastairs-place.net  Tue Jan 10 05:03:24 2017
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Tue, 10 Jan 2017 11:03:24 +0000
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
Message-ID: <A769184C-328B-49E9-A774-A91852E2DDE5@alastairs-place.net>

On 9 Jan 2017, at 22:34, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> 
> On 1/9/2017 1:39 PM, Marcel Schneider wrote:
>> I?m saddened to have fallen into a monologue. Thus I?ll quickly debrief 
>> the arguments opposed so far, to check whether I?m missing some point
>> 
> There's a good reason for that. You are advocating something that everyone else
> accepts as going against a settled principle of the standard,

That?s part of it, but I think also that the thread is increasingly verbose and hard to follow.

I still think that the idea of adding U+???? SUPERSCRIPT and U+???? SUBSCRIPT might be worth contemplating; it would seem to provide a good answer to both Marcel?s and the French standards body?s concerns (wrt their proposed new ordinal indicator) while only using up two code points, and it?d be much easier to explain to people that superscripts and subscripts were a presentational matter and that they needed to talk to their font supplier.  Plus it would work with existing platform rendering engines provided a font with an appropriate OpenType GSUB table was available.

Does anyone besides Marcel have any input on that idea?  Is it worth writing a proposal to add SUPERSCRIPT and SUBSCRIPT?  To give some examples:

  S^{t?}

  U+0053 LATIN CAPITAL LETTER S
  U+0074 LATIN SMALL LETTER T
  U+???? SUPERSCRIPT
  U+0065 LATIN SMALL LETTER E
  U+???? SUPERSCRIPT
  U+0301 COMBINING ACUTE ACCENT

  i_{j}

  U+0069 LATIN SMALL LETTER I
  U+0070 LATIN SMALL LETTER J
  U+???? SUBSCRIPT

Perhaps the code points U+209E and U+209F could be used for SUBSCRIPT and SUPERSCRIPT respectively?

Are there other things that should be considered?

These are not supposed to be a replacement for rich text, which after all would allow nesting and indeed non-character data in subscripts and superscripts, but more as a way to avoid requests to add further superscript and subscript characters to Unicode itself and for limited use in ?plain text?-only contexts (Twitter, for instance).

Kind regards,

Alastair.

--
http://alastairs-place.net


From frederic.grosshans at gmail.com  Tue Jan 10 08:40:39 2017
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Tue, 10 Jan 2017 15:40:39 +0100
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <A769184C-328B-49E9-A774-A91852E2DDE5@alastairs-place.net>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <A769184C-328B-49E9-A774-A91852E2DDE5@alastairs-place.net>
Message-ID: <7b699c09-2f89-decc-caa4-9a62e8c9876f@gmail.com>

Le 10/01/2017 ? 12:03, Alastair Houghton a ?crit :
> That?s part of it, but I think also that the thread is increasingly verbose and hard to follow.
>
> I still think that the idea of adding U+???? SUPERSCRIPT and U+???? SUBSCRIPT might be worth contemplating; it would seem to provide a good answer to both Marcel?s and the French standards body?s concerns (wrt their proposed new ordinal indicator) while only using up two code points, and it?d be much easier to explain to people that superscripts and subscripts were a presentational matter and that they needed to talk to their font supplier.  Plus it would work with existing platform rendering engines provided a font with an appropriate OpenType GSUB table was available.
>
> Does anyone besides Marcel have any input on that idea?  Is it worth writing a proposal to add SUPERSCRIPT and SUBSCRIPT?

No! Long story short: encoding the {super,sub}script characters one by 
one in unicode is a choice that was made more than two decades ago, and 
it is much too late to change this, even if it were a good idea.

One of the major problems of such a proposition is that it would be 
incompatible (or ambiguous) with earlier version of unicode, since the 
same character, let?s say ???, could be encoded in two differrent 
manners : SUPERSCRIPT + U+0033 DIGIT THREE vs the current U+00B3 
SUPESCRIPT THREE, and such things are a big no-no. It was problematic 
with accented characters and led to the definition of NFC / NFD 
normalization with strict stability policies enforced since the 1990s.

If you would manage to convince the Unicode comity that such an encoding 
would fit the plain-text model (good luck with that), without removing 
all the previously encoded superscript/modifier letters (it?s 
forbidden), you would need to define what happens in the various 
normalization models NFC / NFD, and probably a introduce new one (NFE ? 
E for exponent), which would be to say the least, a huge architectural 
change of the Unicode model, for a modest gain if any.


From asmusf at ix.netcom.com  Tue Jan 10 10:59:37 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 10 Jan 2017 08:59:37 -0800
Subject: Specification of Encoding of Plain Text
In-Reply-To: <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
Message-ID: <d0de489d-60a3-d289-7a46-d5fc0a3998f3@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170110/da023f43/attachment.html>

From richard.wordingham at ntlworld.com  Tue Jan 10 13:40:13 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 10 Jan 2017 19:40:13 +0000
Subject: Specification of Encoding of Plain Text
In-Reply-To: <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
Message-ID: <20170110194013.0476f15f@JRWUBU2>

On Tue, 10 Jan 2017 10:11:41 +0100
Mark Davis ?? <mark at macchiato.com> wrote:

> What I really wish we had would be a machine readable set of regexes
> for each complex script (and for each language-script combination
> that is different than the default for that script).

What would the status of these regexes be?  For example, the Khmer
script already has a regex for words sensu stricto, but there doesn't
seem to be any formal requirement to conform to it or, more
immediately usefully to users, attempt to support it if one claims to
support Khmer.

I like the idea, but it seems to have a lot of nits, which I shall now
pick.

The regexes should also be human-comprehensible.

I'm dubious of the concept of each language-script combination
potentially having a regex, or indeed of the script having a *default*
regex.  Would this be used to do the equivalent of saying that English
doesn't have the letter thorn, or, for example, prohibiting most complex
onsets from Lao? 

> Such a regex R could be used for determining the well-formed ordering
> of code points within words. The regex need not be for syllables, or
> grapheme clusters, or any other formal construct. The *only*
> requirement it would need to fulfill is that you could determine
> well-formed words with:

> word := (R)+
 
> That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV
> would pass the text, but CCV would fail. Ideally R would be as simple
> as possible (but no simpler).

Several Indian languages only allow independent vowels word initially.
You wouldn't be able to capture that with (R)+.

Would the regexes be on strings or on traces (strings modulo canonical
equivalence)?  The language recognised by the regex for the Universal
Shaping Engine (USE) is notoriously not closed under canonical
equivalence.

Most non-spacing marks should not occur double - though I think the
most significant trouble with them is with fonts that won't then show
them double.  Barring them could make for a tricky regex.  But, if we
applied that to the Latin script, should we allow f?? (the Fourier
transform of the Fourier transform of f) as a word?.  Tibetan allows
some non-spacing marks to occur triple.

Richard.


From richard.wordingham at ntlworld.com  Tue Jan 10 14:44:30 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 10 Jan 2017 20:44:30 +0000
Subject: Specification of Encoding of Plain Text
In-Reply-To: <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
Message-ID: <20170110204430.6e580f72@JRWUBU2>

On Tue, 10 Jan 2017 00:06:05 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:
> On 1/9/2017 2:24 PM, Richard Wordingham wrote:

I'll take your last point first.

>> One might hope that the subsection about 'logical order' in TUS 9.0
>> Section 2.2 Unicode Design Principles would help, but:
 
>> 1) Section 3 'Conformance' says nothing about logical order; and
>> 2) The subsection about 'logical order' seems to assume that there
>> exists a common practice; it does not actually place any requirement
>> on this common practice.
 
> I don't think either of these general sections are intended to
> provide the correct or expected ordering of characters for complex
> scripts. Any preferred ordering that doesn't result by happenstance
> from normalization would presumably be describe in the text of the
> scrip section, such as Section 16.4 Khmer, in TUS 9.0.0.

The key word here is 'preferred'.  Your reply, while not completely
clear, confirms my view that Unicode does not *specify* an overall
character ordering for Khmer, despite the section's having a BNF regexp
for Khmer syllables - B{R|C}{S{R}}*{{Z}V}{O}{S}.

>> For example, a naive reading of TUS 9.0 Section 16.4 Subsection
>> "Ordering of Syllable Components" would lead one to believe that the
>> word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA,
>> U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL
>> SIGN U, U+17C6 KHMER SIGN NIKAHIT>.

> Richard,
> the group of Khmer experts that developed the recent label generation
> rules for root zone domain names considers that ordering the only one
> supported,? a specification you find here:
> https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf

But as you acknowledge, the specification only covers a strict subset of
legitimate Khmer script text, even of text composed of encoded Khmer
characters. It excludes some text given in TUS Section 16.4.  Indeed,
Section 7.4 of the proposal to ICANN even excludes the new spelling of
the word ??? (ooy, give) - <U+17B1 KHMER INDEPENDENT VOWEL QOO TYPE ONE,
U+17D2 KHMER SIGN COENG, U+1799 KHMER LETTER YO>, for U+17B1 is not a
consonant!

I have ignored the logical gaps in your reply; nothing in the *Unicode*
standard prohibits or deprecates the sequence <U+1781, U+17C6, U+17D2,
U+1789, U+17BB>, even though it does not satisfy the regexp I quoted
above.

>> So, you are not alone in thinking that the COENG goes between
>> consonants.?

I do not support the heresy that COENG may only occur between
consonants.

I do wonder if the Khmer Generation Panel opened their Pali grammars.
How would they propose to write the accusative singular of nouns in
-i?  The accusative singular of non-neuter nouns ends in -i?, which I
would expect to be written <U+17B7 KHMER VOWEL SIGN I, U+17C6 KHMER SIGN
NIKAHIT>, which is what I perceive at the end of a line in the middle
of the second left-hand page at
http://watkhemararatanaram.org/tipitaka/viney_beidok_05b.php .  Do they
expect one to use U+17B9 KHMER VOWEL SIGN Y?  (Thai scholars once had
to resort to such an expedient.)

Richard.


From richard.wordingham at ntlworld.com  Tue Jan 10 14:51:21 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 10 Jan 2017 20:51:21 +0000
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <A769184C-328B-49E9-A774-A91852E2DDE5@alastairs-place.net>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <A769184C-328B-49E9-A774-A91852E2DDE5@alastairs-place.net>
Message-ID: <20170110205121.57634d57@JRWUBU2>

On Tue, 10 Jan 2017 11:03:24 +0000
Alastair Houghton <alastair at alastairs-place.net> wrote:

> Does anyone besides Marcel have any input on that idea?  Is it worth
> writing a proposal to add SUPERSCRIPT and SUBSCRIPT?  To give some
> examples:
> 
>   S^{t?}
> 
>   U+0053 LATIN CAPITAL LETTER S
>   U+0074 LATIN SMALL LETTER T
>   U+???? SUPERSCRIPT
>   U+0065 LATIN SMALL LETTER E
>   U+???? SUPERSCRIPT
>   U+0301 COMBINING ACUTE ACCENT
> 
>   i_{j}
> 
>   U+0069 LATIN SMALL LETTER I
>   U+0070 LATIN SMALL LETTER J
>   U+???? SUBSCRIPT
> 
> Perhaps the code points U+209E and U+209F could be used for SUBSCRIPT
> and SUPERSCRIPT respectively?

I would suggest using a pair of variation selectors instead.  It's no
messier than ideographic compatibility characters, and I think it is
actually less messy.  However, I would further suggest creating the
variation sequences only when the corresponding superscript or subscript
form does not exist.

Richard.


From asmusf at ix.netcom.com  Tue Jan 10 15:12:47 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 10 Jan 2017 13:12:47 -0800
Subject: Specification of Encoding of Plain Text
In-Reply-To: <20170110204430.6e580f72@JRWUBU2>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <20170110204430.6e580f72@JRWUBU2>
Message-ID: <7c945443-1d67-e4df-c475-b5ba3b5bc342@ix.netcom.com>

On 1/10/2017 12:44 PM, Richard Wordingham wrote:
> On Tue, 10 Jan 2017 00:06:05 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>> On 1/9/2017 2:24 PM, Richard Wordingham wrote:
> I'll take your last point first.
>
>>> One might hope that the subsection about 'logical order' in TUS 9.0
>>> Section 2.2 Unicode Design Principles would help, but:
>   
>>> 1) Section 3 'Conformance' says nothing about logical order; and
>>> 2) The subsection about 'logical order' seems to assume that there
>>> exists a common practice; it does not actually place any requirement
>>> on this common practice.
>   
>> I don't think either of these general sections are intended to
>> provide the correct or expected ordering of characters for complex
>> scripts. Any preferred ordering that doesn't result by happenstance
>> from normalization would presumably be describe in the text of the
>> scrip section, such as Section 16.4 Khmer, in TUS 9.0.0.
> The key word here is 'preferred'.  Your reply, while not completely
> clear, confirms my view that Unicode does not *specify* an overall
> character ordering for Khmer, despite the section's having a BNF regexp
> for Khmer syllables - B{R|C}{S{R}}*{{Z}V}{O}{S}.

You are possibly misreading my use of the word "preferred".
>
>>> For example, a naive reading of TUS 9.0 Section 16.4 Subsection
>>> "Ordering of Syllable Components" would lead one to believe that the
>>> word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA,
>>> U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL
>>> SIGN U, U+17C6 KHMER SIGN NIKAHIT>.
>> Richard,
>> the group of Khmer experts that developed the recent label generation
>> rules for root zone domain names considers that ordering the only one
>> supported,  a specification you find here:
>> https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf
> But as you acknowledge, the specification only covers a strict subset of
> legitimate Khmer script text, even of text composed of encoded Khmer
> characters.

The advantage of the text I brought to your attention is the way it is 
formalized and that it was created with local expertise. The 
disadvantage from your perspective is that the scope does not match with 
your intended use case.

> It excludes some text given in TUS Section 16.4.  Indeed,
> Section 7.4 of the proposal to ICANN even excludes the new spelling of
> the word ??? (ooy, give) - <U+17B1 KHMER INDEPENDENT VOWEL QOO TYPE ONE,
> U+17D2 KHMER SIGN COENG, U+1799 KHMER LETTER YO>, for U+17B1 is not a
> consonant!
>
> I have ignored the logical gaps in your reply; nothing in the *Unicode*
> standard prohibits or deprecates the sequence <U+1781, U+17C6, U+17D2,
> U+1789, U+17BB>, even though it does not satisfy the regexp I quoted
> above.
Unicode clearly doesn't forbid most sequences in complex scripts, even 
if they cannot be expected to render properly and otherwise would stump 
the native reader.

However, the descriptions are reasonably detailed to let you find out 
whether you are using characters as intended, or not.
>
>>> So, you are not alone in thinking that the COENG goes between
>>> consonants.
> I do not support the heresy that COENG may only occur between
> consonants.
Remember, I gave you the scope for that. Your example is well taken, but 
from a different scope, where explicitly accounting for some other 
sequences is necessary. No disagreement.

A./
>
> I do wonder if the Khmer Generation Panel opened their Pali grammars.
> How would they propose to write the accusative singular of nouns in
> -i?  The accusative singular of non-neuter nouns ends in -i?, which I
> would expect to be written <U+17B7 KHMER VOWEL SIGN I, U+17C6 KHMER SIGN
> NIKAHIT>, which is what I perceive at the end of a line in the middle
> of the second left-hand page at
> http://watkhemararatanaram.org/tipitaka/viney_beidok_05b.php .  Do they
> expect one to use U+17B9 KHMER VOWEL SIGN Y?  (Thai scholars once had
> to resort to such an expedient.)
>
> Richard.
>
>


From richard.wordingham at ntlworld.com  Tue Jan 10 16:54:57 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 10 Jan 2017 22:54:57 +0000
Subject: Specification of Encoding of Plain Text
In-Reply-To: <7c945443-1d67-e4df-c475-b5ba3b5bc342@ix.netcom.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <20170110204430.6e580f72@JRWUBU2>
 <7c945443-1d67-e4df-c475-b5ba3b5bc342@ix.netcom.com>
Message-ID: <20170110225457.56e581ca@JRWUBU2>

On Tue, 10 Jan 2017 13:12:47 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> Unicode clearly doesn't forbid most sequences in complex scripts,
> even if they cannot be expected to render properly and otherwise
> would stump the native reader.

Is this expectation based on sequence enforcement in the renderer?  The
main problem with getting text to render reasonably (not necessarily as
desired) is now anti-phishing.  The Unicode standard does define what
short sequences of characters mean.  The problem is that then, outside
the Apple world, it seems to be left to Microsoft to decide what longer
sequences they will allow.

> The advantage of the text I brought to your attention is the way it
> is formalized and that it was created with local expertise. The 
> disadvantage from your perspective is that the scope does not match
> with your intended use case.

Perhaps ICANN will be the industry-wide definer.  However, to stay with
Indic rendering, one may have cases where CVC and CCV orthographic
syllables have little to no visible difference.  The Khmer writing
system once made much greater use of CVC syllables.  For reproducing
older texts, one might be forced to encode phonetic CVC as though it
were CCV.

This is already the case, through error rather than design,
with the Thai script in Tai Tham.  This affects about 30% of the
Northern Thai lexicon*, and I believe even a higher proportion when
adjusted for word frequency. Now, to fight phishing, I have always
believed that some brutal folding would be required for Tai Tham, which
is why I suggested that the S.SA ligature be encoded (U+1A54 TAI THAM
LETTER GREAT SA).

*I've sampled the MFL dictionary.  I suspect a bias to untruncated forms
in loans from Pali, such as _kathina_ rather than _kathin_.  If my
suspicion is correct, the proportion would be even higher.

However, I believe there is some advantage in distinguishing CVC and
CCV at the code level, even where there is no visual difference.  To
display small visual differences, perhaps we will be forced to beg for
mark-up to make the distinction visible.

In Tai Tham, there are very few CCV-CVC visual homographs in native
words because of the phonological structure of Northern Thai, and one
can usually guess whether the word is CCV or CVC.  

Richard.

From asmusf at ix.netcom.com  Tue Jan 10 19:25:06 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 10 Jan 2017 17:25:06 -0800
Subject: Specification of Encoding of Plain Text
In-Reply-To: <20170110225457.56e581ca@JRWUBU2>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <20170110204430.6e580f72@JRWUBU2>
 <7c945443-1d67-e4df-c475-b5ba3b5bc342@ix.netcom.com>
 <20170110225457.56e581ca@JRWUBU2>
Message-ID: <c2267711-6adb-cda5-8ab0-d33b98912f94@ix.netcom.com>

On 1/10/2017 2:54 PM, Richard Wordingham wrote:
> On Tue, 10 Jan 2017 13:12:47 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>
>> Unicode clearly doesn't forbid most sequences in complex scripts,
>> even if they cannot be expected to render properly and otherwise
>> would stump the native reader.
> Is this expectation based on sequence enforcement in the renderer?  The
> main problem with getting text to render reasonably (not necessarily as
> desired) is now anti-phishing.

You mean anti-spoofing. There are many types of phishing attempts that 
do not
rely on spoofing identifiers.

There are many different tacks that can be taken to make spoofing more 
difficult.

Among them, for critical identifiers:
1)  allow only a restricted repertoire
2)  disallow certain sequences
3) use a registry and
    3a) define sets of labels that overlap (variant sets)
    3b) restrict actual labels to be in disjoint sets
           (one label blocks all others in the same variant set)

The ICANN work on creating label generation rules attempts to implement
these strategies (currently for 28 scripts in the Root Zone of the DNS). 
The
work on the first half dozen scripts is basically completed.

> The Unicode standard does define what
> short sequences of characters mean.  The problem is that then, outside
> the Apple world, it seems to be left to Microsoft to decide what longer
> sequences they will allow.

MS and Apple are not the only ones writing renderers.
>
>> The advantage of the text I brought to your attention is the way it
>> is formalized and that it was created with local expertise. The
>> disadvantage from your perspective is that the scope does not match
>> with your intended use case.
> Perhaps ICANN will be the industry-wide definer.  However, to stay with
> Indic rendering, one may have cases where CVC and CCV orthographic
> syllables have little to no visible difference.  The Khmer writing
> system once made much greater use of CVC syllables.  For reproducing
> older texts, one might be forced to encode phonetic CVC as though it
> were CCV.

The restriction on sequences appropriate as an anti-spoofing measure are 
not appropriate on  general encoded text! For one, the Root Zone 
explicitly disallows anything that's not in "widespread everyday" use. 
This covers most transcriptions of "historic" texts, as well as 
religious or technical (phonetic) notations and transcriptions.

But restriction of repertoire and sequences goes only so far. You will 
always have a residual set of labels that overlap to a degree that users 
do not reliably distinguish them. (Actually many disjoint sets of 
overlapping labels). The hard core of these are labels that appear 
(practically) identical. There's a further aura of more or less confusables.

Mathematically these two behave differently: a set of (practically) 
identical labels is symmetric and transitive, while a set of merely 
similar labels may be symmetric, but is not transitive. If A is 
equivalent to B and B to C then A is equivalent to C (transitivity). 
However, for merely similar labels there's a non-zero "similarity 
distance", if you will. If you try to chain similarity together via 
transitivity then you might exceed a similarity threshold and your end 
points (e.g. A and C above) may both be similar to B but not 
(sufficiently) to each other.

The project I'm involved in tackles only transitive forms of equivalence 
(whether visual or semantic).

Collisions based on these equivalences can be handled with label 
generation rulesets defined per RFC 7940, which allow registration 
policies that are automated.

The further "halo" of "merely" similar labels needs to be handled with 
additional technology that can handle concepts like similarity distance.

 From a Unicode perspective, there's a virtue in not over specifying 
sequences, because you don't want to be caught having to re-encode 
entire scripts should the conventions for the use of the elements making 
up the script change in an orthography reform!

That does not mean that Unicode (at all times) endorses all permutations 
of free-form sequences as equally valid.

A./
>
> This is already the case, through error rather than design,
> with the Thai script in Tai Tham.  This affects about 30% of the
> Northern Thai lexicon*, and I believe even a higher proportion when
> adjusted for word frequency. Now, to fight phishing, I have always
> believed that some brutal folding would be required for Tai Tham, which
> is why I suggested that the S.SA ligature be encoded (U+1A54 TAI THAM
> LETTER GREAT SA).
>
> *I've sampled the MFL dictionary.  I suspect a bias to untruncated forms
> in loans from Pali, such as _kathina_ rather than _kathin_.  If my
> suspicion is correct, the proportion would be even higher.
>
> However, I believe there is some advantage in distinguishing CVC and
> CCV at the code level, even where there is no visual difference.  To
> display small visual differences, perhaps we will be forced to beg for
> mark-up to make the distinction visible.
>
> In Tai Tham, there are very few CCV-CVC visual homographs in native
> words because of the phonological structure of Northern Thai, and one
> can usually guess whether the word is CCV or CVC.
>
> Richard.
>


From charupdate at orange.fr  Wed Jan 11 00:00:52 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 11 Jan 2017 07:00:52 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
Message-ID: <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>

On Mon, 9 Jan 2017 14:34:17 -0800, Asmus Freytag wrote:
[?]
> Just get over it [?]

We are facing a strong user demand since early standards. 
Actually I cannot. Sorry. 

Thank you however for all of your feedback.

On Tue, 10 Jan 2017 11:03:24 +0000, Alastair Houghton wrote:
[?]
> [?] I think also that the thread is increasingly verbose and hard to follow. 

It?s very hard for me too. But I?ll try to be concise. 

Thank you for involving in the issue.

> [?] for limited use in ?plain text?-only contexts (Twitter, for instance).

The phenomenon isn?t actually limited to plain text environments. See:

http://stackoverflow.com/questions/13878772/how-to-display-classic-fractions-in-css-javascript
| You can also use the straight unicode approach to render ?????:
| 
| &#x00B9;&#x2079;&#x2044;&#x2084;&#x2085;
| 
| (See the wikipedia article.)
https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts

On Tue, 10 Jan 2017 20:51:21 +0000, Richard Wordingham wrote:
[?]
> I would suggest using a pair of variation selectors instead. It's no 
> messier than ideographic compatibility characters, and I think it is 
> actually less messy. However, I would further suggest creating the 
> variation sequences only when the corresponding superscript or subscript 
> form does not exist. 

This clearly advocates the current use of the superscript and subscript forms.

Thank you for considering the issue.

Thanks to all who responded in these threads.

Converting preformatted to TeX formatted:
My conversion macro was too simplistic, it was made up hastily, sorry. 
An improved version (perhaps overkill now) is attached below.
I?d have liked to port it to Vim, too. 
The macros for productivity suites are sadly still missing.

Kind regards,

Marcel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: P-to-F_for_Notepad++.xml
Type: text/xml
Size: 91768 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170111/e61c6366/attachment.xml>

From richard.wordingham at ntlworld.com  Wed Jan 11 02:32:12 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 11 Jan 2017 08:32:12 +0000
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
Message-ID: <20170111083212.476f492e@JRWUBU2>

On Wed, 11 Jan 2017 07:00:52 +0100 (CET)
Marcel Schneider <charupdate at orange.fr> wrote:

> The phenomenon isn?t actually limited to plain text environments. See:
> 
> http://stackoverflow.com/questions/13878772/how-to-display-classic-fractions-in-css-javascript
> | You can also use the straight unicode approach to render ?????:
> | 
> | &#x00B9;&#x2079;&#x2044;&#x2084;&#x2085;
> | 
> | (See the wikipedia article.)
> https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts

If you follow the link from that page to
https://en.wikipedia.org/wiki/Subscript_and_superscript , you will
notice an immediate issue with the position of the subscripts.  This is
why the use of explicitly coded subscript and superscript digits for
vulgar fractions is not recommended.  Rather, one needs to hope that
the font one is using supports U+2044 FRACTION SLASH.  As not all fonts
support all superscript and subscript digits, text using them may
render badly, whereas U+2044 itself will usually be rendered at least
tolerably even if the glyph comes from a different font to the digits.

The truly straight Unicode approach in HTML is to use 19&#x2044;45.
Just entering those 5 characters into a text entry box in Firefox gave
me a properly formatted vulgar fraction.  That is how vulgar fractions
are supposed to work.  Unfortunately, one may need to avoid 'exciting
new fonts' in favour of those with a large, working repertoire. 

Richard.


From charupdate at orange.fr  Wed Jan 11 08:20:21 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 11 Jan 2017 15:20:21 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170111083212.476f492e@JRWUBU2>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
Message-ID: <841195008.11765.1484144421611.JavaMail.www@wwinf1p08>

On Wed, 11 Jan 2017 08:32:12 +0000, Richard Wordingham wrote:
> 
> On Wed, 11 Jan 2017 07:00:52 +0100 (CET) 
> Marcel Schneider  wrote: 
> 
> > The phenomenon isn?t actually limited to plain text environments. See: 
> > 
> > http://stackoverflow.com/questions/13878772/how-to-display-classic-fractions-in-css-javascript 
> > | You can also use the straight unicode approach to render ?????: 
> > | 
> > | &#x00B9;&#x2079;&#x2044;&#x2084;&#x2085; 
> > | 
> > | (See the wikipedia article.) 
> > https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts 
> 
> If you follow the link from that page to 
> https://en.wikipedia.org/wiki/Subscript_and_superscript , you will 
> notice an immediate issue with the position of the subscripts. This is 
> why the use of explicitly coded subscript and superscript digits for 
> vulgar fractions is not recommended. Rather, one needs to hope that 
> the font one is using supports U+2044 FRACTION SLASH. As not all fonts 
> support all superscript and subscript digits, text using them may 
> render badly, whereas U+2044 itself will usually be rendered at least 
> tolerably even if the glyph comes from a different font to the digits. 
> 
> The truly straight Unicode approach in HTML is to use 19&#x2044;45. 
> Just entering those 5 characters into a text entry box in Firefox gave 
> me a properly formatted vulgar fraction. That is how vulgar fractions 
> are supposed to work. Unfortunately, one may need to avoid 'exciting 
> new fonts' in favour of those with a large, working repertoire. 

Thank you for these hints! Too bad not to have checked this. I?m glad to see 
that browsers and some fonts already support the standard way of writing 
custom fractions in plain text, with correct glyph substitution. 
I?ve added this info on the fly to the following two articles:

https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Uses
https://en.wikipedia.org/wiki/Slash_(punctuation)#Fractions

Hence, one part of the issue is solved. 

As of the main part ? the use of modifier letters as ordinal indicators, and 
eventually in (other) abbreviations ?, the user demand reflects in the article 
that you cited: For Galician, Italian, Portuguese and Spanish, the preformatted 
ordinal indicators are used, while for French, formatting is applied:
https://en.wikipedia.org/wiki/Subscript_and_superscript#Unicode
If this use of formatting were straightforward, it would be constant. In practice, 
by contrast, it turns out to be actually a fallback, by (supposed) lack of the 
appropriate preformatted letters. 

I note too, that the used fonts from the body font stack is outdated since it 
doesn?t render the fractions properly (whereas the monospaced font in the editing
dialog does, as in the Unicode contact form that I?ve tried first). A frequent 
idea is then to use performatted digits to imitate proper rendering, as more as 
according to Wikipedia, this is so popular that current fonts have even repurposed 
the super/sub scripts. Arial Unicode MS would thus enter this category. As you point
it out, the straightforward action is to use a font with full support of U+2044, 
and thus to override the default font with e.g. Cambria.

It?s good to know about these issues, and this will help me a lot when writing up 
the documentation.

Best regards,

Marcel


From kenwhistler at att.net  Wed Jan 11 13:37:47 2017
From: kenwhistler at att.net (Ken Whistler)
Date: Wed, 11 Jan 2017 11:37:47 -0800
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <CAF5KyEyWvjzE2G6B_sForoVHnxKJO49trH5V7KMh0cuAz2xB8w@mail.gmail.com>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
 <CAF5KyEzrD6sJuDEZaPQei4LiwiuP3a=5VGdKMfBoKV5QFOyM+g@mail.gmail.com>
 <CAJ6uix6ScNN86RuWk0gfvRLoUZWMr1hSZRDc17eBUsv0N+=XDQ@mail.gmail.com>
 <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>
 <CAGa7JC2tZ_i_c6HELi45g-athN0gJgX+VeB8DtAr4qfdeQgC5A@mail.gmail.com>
 <CAJKta0xaP4+2kDjFZB9EiYQh0XBKe7C2xyzup0a4SwQJQ1EtXA@mail.gmail.com>
 <CAF5KyEyWvjzE2G6B_sForoVHnxKJO49trH5V7KMh0cuAz2xB8w@mail.gmail.com>
Message-ID: <4ab7d0ef-435a-964b-b3d3-71b328320997@att.net>

This is a character under ballot for Amendment 1 to the 5th edition. It 
isn't part of the repertoire planned for publication as part of Unicode 
10.0 in June.

So if you want to have any impact on the subhead used in the charts for 
A7AF, the correct mechanism now is to get a national body comment added 
in their vote on Amendment 1.

Either that, or just put in tickler in your calendar for February, 
201*8*, when the beta review for Unicode *11* will be starting, so you 
can then make a suggestion as part of the Unicode beta review period.

Otherwise, these suggestions are just going to end up lost under the 
pile of the subsequent 13 months worth of email on unrelated topics. ;-)

--Ken


On 12/27/2016 8:44 PM, Yif?n W?ng wrote:
> Now I start to wonder if the description would be "Letter for
> phonetics and Japanese phonology" or "Letter for scholarly
> transcription" etc.
>
> 2016-12-27 18:54 GMT+09:00 Denis Jacquerye <moyogo at gmail.com>:
>> For what it?s worth, the small capital q was used as an IPA symbol for a
>> while. It was used for the Arabic ?ayn as a ?consonne roule?e gutturale? in
>> the 1898 IPA chart (previously noted 3 in the 1894 IPA charts and ? in some
>> 1895 IPA charts and later charts) then as a ?consonne fricative bronchiale
>> sonore? in the 1905 and 1908 IPA charts, and in the notes after the IPA
>> chart in 1912. It was eventually replaced with the reversed glottal stop ?,
>> for example in the 1932 IPA chart or later charts.
>


From fabiang at radgametools.com  Wed Jan 11 15:56:26 2017
From: fabiang at radgametools.com (Fabian Giesen)
Date: Wed, 11 Jan 2017 13:56:26 -0800
Subject: UAX #9 (Bidirectional algorithm) reference implementations
In-Reply-To: <2b433250-fba1-7d7f-1785-694ef25f96bf@att.net>
References: <8f858db3-1f9e-e969-5dad-6aa26f0a577e@radgametools.com>
 <2b433250-fba1-7d7f-1785-694ef25f96bf@att.net>
Message-ID: <8617b5a8-579a-3437-de29-b5fd47e0ca3c@radgametools.com>

On 12/9/2016 7:04 AM, Ken Whistler wrote:
> About the bug you note in BidiReferenceC, I'll investigate.

Any news on this?

Thanks,

-Fabian

From richard.wordingham at ntlworld.com  Wed Jan 11 20:56:19 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 12 Jan 2017 02:56:19 +0000
Subject: Specification of Encoding of Plain Text
In-Reply-To: <c2267711-6adb-cda5-8ab0-d33b98912f94@ix.netcom.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <20170110204430.6e580f72@JRWUBU2>
 <7c945443-1d67-e4df-c475-b5ba3b5bc342@ix.netcom.com>
 <20170110225457.56e581ca@JRWUBU2>
 <c2267711-6adb-cda5-8ab0-d33b98912f94@ix.netcom.com>
Message-ID: <20170112025619.6b0bc28d@JRWUBU2>

On Tue, 10 Jan 2017 17:25:06 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> On 1/10/2017 2:54 PM, Richard Wordingham wrote:

> There are many different tacks that can be taken to make spoofing
> more difficult.
> 
> Among them, for critical identifiers:
> 1)  allow only a restricted repertoire
> 2)  disallow certain sequences
> 3) use a registry and
>     3a) define sets of labels that overlap (variant sets)
>     3b) restrict actual labels to be in disjoint sets
>            (one label blocks all others in the same variant set)
> 
> The ICANN work on creating label generation rules attempts to
> implement these strategies (currently for 28 scripts in the Root Zone
> of the DNS). The
> work on the first half dozen scripts is basically completed.
> 
> > The Unicode standard does define what
> > short sequences of characters mean.  The problem is that then,
> > outside the Apple world, it seems to be left to Microsoft to decide
> > what longer sequences they will allow.  
> 
> MS and Apple are not the only ones writing renderers.

HarfBuzz OpenType rendering tries to follow MS.  That includes dotted
circles.  However, it will challenge the MS lead when it is blatantly
wrong.  In particular, it has a policy of rendering canonically
equivalent text the same, though that is a challenge when emulating USE.

So far as I am aware, M17n is not in wide use.  It's tolerant, but
one's text won't go far if it relies on M17n.

Text can travel with a graphite font, but that is limiting.  Sooner or
later, one will want most text to work with different fonts.

I'm having trouble digging up hard facts about InDesign's rendering, so
I don't know how willing it is to be different to Microsoft's.

> > Perhaps ICANN will be the industry-wide definer.  However, to stay
> > with Indic rendering, one may have cases where CVC and CCV
> > orthographic syllables have little to no visible difference.  The
> > Khmer writing system once made much greater use of CVC syllables.
> > For reproducing older texts, one might be forced to encode phonetic
> > CVC as though it were CCV.  

> The restriction on sequences appropriate as an anti-spoofing measure
> are not appropriate on general encoded text!

So ICANN will at best serve to indicate sequences that should be
renderable.

> The project I'm involved in tackles only transitive forms of
> equivalence (whether visual or semantic).

> Collisions based on these equivalences can be handled with label 
> generation rulesets defined per RFC 7940, which allow registration 
> policies that are automated.

> The further "halo" of "merely" similar labels needs to be handled
> with additional technology that can handle concepts like similarity
> distance.

'Merely' similar CCV and CVC tend to differ when the vowel is
above the consonant and the subscript consonant is spacing, e.g. because
it rises to the hanging baseline. The difference, which is in vowel
placement, is comparable to the variation within one person's
handwriting.  However, the difference in mean position seems to be
statistically significant.  The inequivalence issue starts to arise with
spacing vowels, which is when one may find marks being applied to
syllables rather than to individual glyphs.

>  From a Unicode perspective, there's a virtue in not over specifying 
> sequences, because you don't want to be caught having to re-encode 
> entire scripts should the conventions for the use of the elements
> making up the script change in an orthography reform!

This seems to run counter to Mark's idea of regexes defining scripts'
words.
 
> That does not mean that Unicode (at all times) endorses all
> permutations of free-form sequences as equally valid.

Just as well, as such freedom runs counter to the principle of avoiding
inequivalent encodings of the same thing.

Richard.

From duerst at it.aoyama.ac.jp  Wed Jan 11 21:24:29 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Thu, 12 Jan 2017 12:24:29 +0900
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170111083212.476f492e@JRWUBU2>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
Message-ID: <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>

On 2017/01/11 17:32, Richard Wordingham wrote:

> The truly straight Unicode approach in HTML is to use 19&#x2044;45.
> Just entering those 5 characters into a text entry box in Firefox gave
> me a properly formatted vulgar fraction.  That is how vulgar fractions
> are supposed to work.  Unfortunately, one may need to avoid 'exciting
> new fonts' in favour of those with a large, working repertoire.

Just for the record: The vulgar fraction display also happened in 
Thunderbird (on Windows). Firefox and Thunderbird use the same display 
engine. I have switched HTML display off, because I prefer to read all 
my mail in plain text, but it still worked.

Regards,   Martin.

From 747.neutron at gmail.com  Wed Jan 11 23:39:49 2017
From: 747.neutron at gmail.com (=?UTF-8?B?WWlmw6FuIFfDoW5n?=)
Date: Thu, 12 Jan 2017 14:39:49 +0900
Subject: On the upcoming LATIN LETTER SMALL CAPITAL Q
In-Reply-To: <4ab7d0ef-435a-964b-b3d3-71b328320997@att.net>
References: <CAF5KyEwN-AC3_gdhBBsG+fihNJOYr+hXFbMGSdBiJ0UsnUnH7w@mail.gmail.com>
 <CAJ6uix5xcc=SuKpJ+o5hUcyx5Hg-08Y+aBWUy4QXiMyU9=zAMA@mail.gmail.com>
 <CAF5KyEzrD6sJuDEZaPQei4LiwiuP3a=5VGdKMfBoKV5QFOyM+g@mail.gmail.com>
 <CAJ6uix6ScNN86RuWk0gfvRLoUZWMr1hSZRDc17eBUsv0N+=XDQ@mail.gmail.com>
 <CAJ6uix7aj-uSOSJgzAaD1wyOkv3hXYNLYpf=cPMCsZCK-YG8KQ@mail.gmail.com>
 <CAGa7JC2tZ_i_c6HELi45g-athN0gJgX+VeB8DtAr4qfdeQgC5A@mail.gmail.com>
 <CAJKta0xaP4+2kDjFZB9EiYQh0XBKe7C2xyzup0a4SwQJQ1EtXA@mail.gmail.com>
 <CAF5KyEyWvjzE2G6B_sForoVHnxKJO49trH5V7KMh0cuAz2xB8w@mail.gmail.com>
 <4ab7d0ef-435a-964b-b3d3-71b328320997@att.net>
Message-ID: <CAF5KyEy-RqfNNdVSaEJtH0_Q-2DTbLDQ35+m44=wo0_D1zEEXg@mail.gmail.com>

> This is a character under ballot for Amendment 1 to the 5th edition. It
> isn't part of the repertoire planned for publication as part of Unicode 10.0
> in June.

I see. Thank you for the information.
I'll remember it until Unicode 11's term.


2017-01-12 4:37 GMT+09:00 Ken Whistler <kenwhistler at att.net>:
> This is a character under ballot for Amendment 1 to the 5th edition. It
> isn't part of the repertoire planned for publication as part of Unicode 10.0
> in June.
>
> So if you want to have any impact on the subhead used in the charts for
> A7AF, the correct mechanism now is to get a national body comment added in
> their vote on Amendment 1.
>
> Either that, or just put in tickler in your calendar for February, 201*8*,
> when the beta review for Unicode *11* will be starting, so you can then make
> a suggestion as part of the Unicode beta review period.
>
> Otherwise, these suggestions are just going to end up lost under the pile of
> the subsequent 13 months worth of email on unrelated topics. ;-)
>
> --Ken
>
>
>
> On 12/27/2016 8:44 PM, Yif?n W?ng wrote:
>>
>> Now I start to wonder if the description would be "Letter for
>> phonetics and Japanese phonology" or "Letter for scholarly
>> transcription" etc.
>>
>> 2016-12-27 18:54 GMT+09:00 Denis Jacquerye <moyogo at gmail.com>:
>>>
>>> For what it?s worth, the small capital q was used as an IPA symbol for a
>>> while. It was used for the Arabic ?ayn as a ?consonne roule?e gutturale?
>>> in
>>> the 1898 IPA chart (previously noted 3 in the 1894 IPA charts and ? in
>>> some
>>> 1895 IPA charts and later charts) then as a ?consonne fricative
>>> bronchiale
>>> sonore? in the 1905 and 1908 IPA charts, and in the notes after the IPA
>>> chart in 1912. It was eventually replaced with the reversed glottal stop
>>> ?,
>>> for example in the 1932 IPA chart or later charts.
>>
>>
>


From khaledhosny at eglug.org  Thu Jan 12 00:35:24 2017
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Thu, 12 Jan 2017 08:35:24 +0200
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
References: <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
Message-ID: <20170112063524.GF14923@macbook>

On Thu, Jan 12, 2017 at 12:24:29PM +0900, Martin J. D?rst wrote:
> On 2017/01/11 17:32, Richard Wordingham wrote:
> 
> > The truly straight Unicode approach in HTML is to use 19&#x2044;45.
> > Just entering those 5 characters into a text entry box in Firefox gave
> > me a properly formatted vulgar fraction.  That is how vulgar fractions
> > are supposed to work.  Unfortunately, one may need to avoid 'exciting
> > new fonts' in favour of those with a large, working repertoire.
> 
> Just for the record: The vulgar fraction display also happened in
> Thunderbird (on Windows). Firefox and Thunderbird use the same display
> engine. I have switched HTML display off, because I prefer to read all my
> mail in plain text, but it still worked.

This is done by HarfBuzz which automatically activates OpenType
frac/dnom/numr features for <digits><fraction slash><digits> sequences,
so if the font has the features one gets vulgar fractions out of box.
This works in Chrome as well since it uses HarfBuzz (older version of
Chrome didn?t enable HarfBuzz by default for Latin so the fractions
might not show there).

Regards,
Khaled

From mark at macchiato.com  Thu Jan 12 07:12:09 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 12 Jan 2017 14:12:09 +0100
Subject: Specification of Encoding of Plain Text
In-Reply-To: <20170110194013.0476f15f@JRWUBU2>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
Message-ID: <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>

On Tue, Jan 10, 2017 at 8:40 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Tue, 10 Jan 2017 10:11:41 +0100
> Mark Davis ?? <mark at macchiato.com> wrote:
>
> > What I really wish we had would be a machine readable set of regexes
> > for each complex script (and for each language-script combination
> > that is different than the default for that script).
>
> What would the status of these regexes be?  For example, the Khmer
> script already has a regex for words sensu stricto, but there doesn't
> seem to be any formal requirement to conform to it or, more
> immediately usefully to users, attempt to support it if one claims to
> support Khmer.
>

??I think the goal would be provide guidance on the preferred
ordering/choice of code points for representing a particular visual order
of glyphs. That is, help to guide the usage of characters in complex
scripts.

The target wouldn't even be all scripts, but rather complex ones, where it
may not be simple to determine the ordering of code points.

And as Asmus said, the goal would be sufficiently "detailed to let you find
out whether you are using characters as intended, or not"


> I like the idea, but it seems to have a lot of nits, which I shall now
> pick.
>

?I'm sure there are plenty; those are just an opener.
?

>
> The regexes should also be human-comprehensible.
>
?
I agree that comprehension is a goal. I'd imagine using a BNF regex, like
the following. This is simple, since I'm just doing Latin, but you can see
what I mean.

word = base* ;
base = (latinLetter latinMn*) ;
latinLetter = [[:scx=Latn:]&[:L:]] ;
latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ;

which turns into the single regex expression:

([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)*

See:
http://unicode.org/cldr/utility/bnf.jsp?a=word=base*;%0Dbase=(latinLetter+latinMn*);%0DlatinLetter=[[:scx=Latn:]%26[:L:]];%0DlatinMn=[[:scx=Latn:][:scx=Common:]%26[:Mn:]]
;

A more complex script might have:

word = prefix base* postfix ;
...

One could draw on the work done in Harfbuzz and the Universal Shaping
Engine to push this along for different scripts.


> I'm dubious of the concept of each language-script combination
> potentially having a regex,


?I think a language-script combination is only useful if it must vary from
the default for the script.


> or indeed of the script having a *default*
> regex.


> Would this be used to do the equivalent of saying that English
> doesn't have the letter thorn, or, for example, prohibiting most complex
> onsets from Lao?
>

And for those scripts, the goal would be to represent the core functioning
of the script. So it could be broader than what is needed for any
particular language using that script.


> > Such a regex R could be used for determining the well-formed ordering
> > of code points within words. The regex need not be for syllables, or
> > grapheme clusters, or any other formal construct. The *only*
> > requirement it would need to fulfill is that you could determine
> > well-formed words with:
>
> > word := (R)+?


> > That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV
> > would pass the text, but CCV would fail. Ideally R would be as simple
> > as possible (but no simpler).
>
> Several Indian languages only allow independent vowels word initially.
> You wouldn't be able to capture that with (R)+.
>

?That was a typo, should have been just R (which could have more complex
internal structure with repetition, as above).


>
> Would the regexes be on strings or on traces (strings modulo canonical
> equivalence)?  The language recognised by the regex for the Universal
> Shaping Engine (USE) is notoriously not closed under canonical
> equivalence.
>

?Unclear as yet to me what would be the most useful.?


> Most non-spacing marks should not occur double - though I think the
> most significant trouble with them is with fonts that won't then show
> them double.  Barring them could make for a tricky regex.  But, if we
> applied that to the Latin script, should we allow f?? (the Fourier
> transform of the Fourier transform of f) as a word?.  Tibetan allows
> some non-spacing marks to occur triple.
>

There is always a choice as to how strict to make them. The goal shouldn't
be so tight as to exclude legitimate words, and trying to be too
fine-grained can make the expressions overly complicated. Moreover there
isn't any question as to how "f?? (the Fourier
transform of the Fourier transform of f)" would be spelled, so no need to
exclude it. But preventing spoofing wouldn't be the goal.


> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170112/dcc17012/attachment.html>

From charupdate at orange.fr  Thu Jan 12 08:22:18 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 12 Jan 2017 15:22:18 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170112063524.GF14923@macbook>
References: <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
Message-ID: <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>

On 12 Jan 2017 08:35:24 +0200, Khaled Hosny wrote:
> 
> On Thu, Jan 12, 2017 at 12:24:29PM +0900, Martin J. D?rst wrote: 
> > On 2017/01/11 17:32, Richard Wordingham wrote: 
> > 
> > > The truly straight Unicode approach in HTML is to use 19&#x2044;45. 
> > > Just entering those 5 characters into a text entry box in Firefox gave 
> > > me a properly formatted vulgar fraction. That is how vulgar fractions 
> > > are supposed to work. Unfortunately, one may need to avoid 'exciting 
> > > new fonts' in favour of those with a large, working repertoire. 

Even Times New Roman turned out to be obsolete from this viewpoint, while 
Cambria (and Consolas) do work. I should make a comprehensive overview on all 
fonts, perhaps in a dedicated article ?Fraction Slash? on Wikipedia (that seems 
to have existed).

> > 
> > Just for the record: The vulgar fraction display also happened in 
> > Thunderbird (on Windows).

It doesn?t work for me.

> > Firefox and Thunderbird use the same display 
> > engine. I have switched HTML display off, because I prefer to read all my 
> > mail in plain text,

This is one more reason to make plain text more performative and comprehensive.
But when an e-mail is written in HTML, turning it to plain text usually doesn?t 
convert the HTML formatting to plain text markup. Usually, because just this is
what HyperMail does, which builds the Unicode Mailing List Archives. At least, 
the start of superscript is converted to a ^.

How do you deal with the loss of content information, such as stress, superscript, 
and so on?

> > but it still worked. 

It seems to me that in this use case, it will be even more likely to work, given 
that the plain text font of Firefox and Chrome (admitting that this is used in the 
text boxes of all websites) is up-to-date, while most font-families used for HTML 
aren?t. Though I must find a way to update my system fonts.

> 
> This is done by HarfBuzz which automatically activates OpenType 
> frac/dnom/numr features for  sequences, 
> so if the font has the features one gets vulgar fractions out of box. 

According to Wikipedia (
https://en.wikipedia.org/wiki/HarfBuzz#Major_users
), HarfBuzz is included in LibreOffice too, but being on Windows, despite of 
having just installed the brandnew version 5.2.4.2, I still don?t get it, since 
it comes with 5.3: 
https://wiki.documentfoundation.org/ReleaseNotes/5.3#Text_Layout

Thanks however!

Waiting for this, I shall probably stay inputting fractions as superscript-
subscript sequences, given that many fonts do have the appropriate glyphs for 
fractions mapped to the Unicode super/sub scripts, while application formatting 
for fractions (that I thought TUS is referring to) is available in desktop 
publishing software only, and the default super/sub formatting doesn?t match 
requirements for vulgar fractions.

> This works in Chrome as well since it uses HarfBuzz (older version of 
> Chrome didn?t enable HarfBuzz by default for Latin so the fractions 
> might not show there). 

This raises a compatibility issue. Having tested my page on Chrome where I get the 
fractions right in some fonts like Cambria, I?m about to switch the default typeface 
from Tahoma to Cambria (or some other one, if I find another proportional font 
out there that does work as intended). But what will happen when somebody charges 
the page into another browser (Edge, Safari, Opera, IE)? 

I guess that the collateral damage (of being tagged as a careless and sloppy 
typographer) is minimized when I use a proven and stable feature like composing 
fractions following the 'U+00B9 U+2079 U+2044 U+2084 U+2085' pattern, compared 
with using the?really straightforward?'19 U+2044 45' pattern in its stead.

Therefore, I?m interested in learning for what reasons the widespread and thorough
implementation of a feature like the Unicode behavior of U+2044 FRACTION SLASH 
takes more than fifteen years ? if it will ever be thoroughly implemented!

Regards,
Marcel


From khaledhosny at eglug.org  Thu Jan 12 09:01:41 2017
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Thu, 12 Jan 2017 17:01:41 +0200
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
Message-ID: <20170112150141.GG14923@macbook>

On Thu, Jan 12, 2017 at 03:22:18PM +0100, Marcel Schneider wrote:
> > This is done by HarfBuzz which automatically activates OpenType 
> > frac/dnom/numr features for  sequences, 
> > so if the font has the features one gets vulgar fractions out of box. 
> 
> According to Wikipedia (
> https://en.wikipedia.org/wiki/HarfBuzz#Major_users
> ), HarfBuzz is included in LibreOffice too, but being on Windows, despite of 
> having just installed the brandnew version 5.2.4.2, I still don?t get it, since 
> it comes with 5.3: 
> https://wiki.documentfoundation.org/ReleaseNotes/5.3#Text_Layout

LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is
not released yet.

Regards,
Khaled

From charupdate at orange.fr  Thu Jan 12 11:01:35 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 12 Jan 2017 18:01:35 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170112150141.GG14923@macbook>
References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
 <20170112150141.GG14923@macbook>
Message-ID: <1978028751.17843.1484240496199.JavaMail.www@wwinf1p25>

On Thu, 12 Jan 2017 17:01:41 +0200, Khaled Hosny wrote:
> 
> > According to Wikipedia (
> > https://en.wikipedia.org/wiki/HarfBuzz#Major_users
> > ), HarfBuzz is included in LibreOffice too, but being on Windows, despite of 
> > having just installed the brandnew version 5.2.4.2, I still don?t get it, since 
> > it comes with 5.3: 
> > https://wiki.documentfoundation.org/ReleaseNotes/5.3#Text_Layout
> 
> LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is
> not released yet.
> 

Thank you anyway!

If I were on Linux, I?d got it all the time (my previous 4.2.4.2 > 4.1, when 
HarfBuzz was first included in LibreOffice). On Windows 7, I have DirectWrite, and 
this is probably why Arabic glyphs are substituted at my eye-sight, but I can?t get 
the fractions displayed the standard way around in Internet Explorer 11, neither in 
a text box, nor in a web page, even when using Gabriola, DirectWrite?s demo font.

This is why, again, I cannot use the intended functioning of U+2044 FRACTION SLASH, 
given that when I make up a web page relying on this intended display feature, any 
visitors who will load it in any version of Internet Explorer on Windows 7, may 
consider that I?m doing bad typography.

Hence again: Can any (good) reasons be identified for the following two shortcomings:

1) The implementation of U+2044, while punctually thorough, still isn?t widespread;

2) The use of non-Galician-Italian-Portuguese-Spanish ordinal indicators is prohibited 
while they are de facto available in Unicode. [1]

Regards,
Marcel

[1] According to Wikipedia:
https://en.wikipedia.org/wiki/Subscript_and_superscript#Alignment_examples
https://en.wikipedia.org/wiki/Subscript_and_superscript#Desktop_publishing
they must be even better than generic superscripting in word processors, that 
is considered too high and too light from a typographical point of view.


From richard.wordingham at ntlworld.com  Thu Jan 12 12:42:42 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 12 Jan 2017 18:42:42 +0000
Subject: Specification of Encoding of Plain Text
In-Reply-To: <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
Message-ID: <20170112184242.1507f3a8@JRWUBU2>

On Thu, 12 Jan 2017 14:12:09 +0100
Mark Davis ?? <mark at macchiato.com> wrote:

> I agree that comprehension is a goal. I'd imagine using a BNF regex,
> like the following. This is simple, since I'm just doing Latin, but
> you can see what I mean.

> word = base* ;
> base = (latinLetter latinMn*) ;
> latinLetter = [[:scx=Latn:]&[:L:]] ;
> latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ;
> 
> which turns into the single regex expression:
> 
> ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)*

Ouch!  That's alarmingly wrong.  You've excluded the likes of
English 'Ca?esar' with ZWJ, Welsh 'Llan?gollen' with CGJ (the word
doesn't contain the letter 'ng') and the ISO-sanctioned transliteration
of Thai SO SUEA as 's?'.  Fixin? it isn't easy.  At least, I assume
Arabic harakat don't attach to Latin letters in your conception of
Latin script text, so replacing 'scx=Common' by 'sc=Inherited' doesn't
work well.

The problem may be conflicting requirements on the Script_Extensions
property.

Richard.


From mark at macchiato.com  Thu Jan 12 14:03:29 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 12 Jan 2017 21:03:29 +0100
Subject: Specification of Encoding of Plain Text
In-Reply-To: <20170112184242.1507f3a8@JRWUBU2>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
 <20170112184242.1507f3a8@JRWUBU2>
Message-ID: <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>

That was just an example off the top of my head of the format for using
with regex; I don't pretend that it is vetted. Latin is not a complex
script, so it was only an illustration. However, it was just brain freeze
on my part to not also include Inherited or ZWJ. A more serious effort
would look at some of the issues from http://unicode.org/reports/tr29/, for
example. On the other hand, CGJ is not a problem: it is Mn
<http://unicode.org/cldr/utility/character.jsp?a=034F>. And (say) U+064B
ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included.

Mark

On Thu, Jan 12, 2017 at 7:42 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Thu, 12 Jan 2017 14:12:09 +0100
> Mark Davis ?? <mark at macchiato.com> wrote:
>
> > I agree that comprehension is a goal. I'd imagine using a BNF regex,
> > like the following. This is simple, since I'm just doing Latin, but
> > you can see what I mean.
>
> > word = base* ;
> > base = (latinLetter latinMn*) ;
> > latinLetter = [[:scx=Latn:]&[:L:]] ;
> > latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ;
> >
> > which turns into the single regex expression:
> >
> > ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)*
>
> Ouch!  That's alarmingly wrong.  You've excluded the likes of
> English 'Ca?esar' with ZWJ, Welsh 'Llan?gollen' with CGJ (the word
> doesn't contain the letter 'ng') and the ISO-sanctioned transliteration
> of Thai SO SUEA as 's?'.  Fixin? it isn't easy.  At least, I assume
> Arabic harakat don't attach to Latin letters in your conception of
> Latin script text, so replacing 'scx=Common' by 'sc=Inherited' doesn't
> work well.
>
> The problem may be conflicting requirements on the Script_Extensions
> property.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170112/31d2d6d1/attachment.html>

From charupdate at orange.fr  Thu Jan 12 15:04:12 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 12 Jan 2017 22:04:12 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <1978028751.17843.1484240496199.JavaMail.www@wwinf1p25>
References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
 <20170112150141.GG14923@macbook>
 <1978028751.17843.1484240496199.JavaMail.www@wwinf1p25>
Message-ID: <1651851807.24444.1484255052680.JavaMail.www@wwinf1p25>

What typically happens with the correct use of fraction slash on a collaborative 
website like Wikipedia, is that the superscripts and subscripts are restored, I?ve 
just found while trying to share the section:

https://en.wikipedia.org/w/index.php?title=Slash_(punctuation)&diff=prev&oldid=759542943
| 
| (??Fractions: Removed browser-specific information, restored hack that works on most browsers)
| 
| [?]
| 
| [?] (e.g., display of {{not a typo|11?12}} as <span style="font-family: Cambria;">11?12</span>),<ref>{{citation 
|title=The Unicode Standard, [?]

Restored by somebody to:
|
| [?] (e.g., display of {{not a typo|11?12}} as {{not a typo|?????}}),{{citation |title=The Unicode Standard, [?]
| 

Thus, OK for the ?hack.? Whether that hack is undisciplined or not, becomes now 
a better question. 

In my opinion, the lack of dicipline is rather found in editors of persistently 
non-conformant software. Though I wouldn?t bother them, if only Unicode could 
accept that the users who need to work with the software, need to work around it.
?Couldn?t Unicode follow Microsoft?? And follow their users, please.

Consequently, one ought to remember what a keyboard layout really is: a facility 
to help people input the characters they need and use. Therefore, complete ones 
should support the input of fractions composed with super/sub scripts and U+2044, 
and as of Unicode, the Consortium should allow people to write fractions this way 
around if they cannot afford to write them in the standard way. Mentioning this 
in the relevant section of the Standard would avoid tagging these keyboard layout 
developers as hackers. (I?m not a hacker, nor am I a programmer.)

Extrapolating from this to ordinal indicators, one could consider that all the 
reasons opposed so far are based only on the lack of updated fonts and on the 
will of the UTC. This is why I cannot consider them as good reasons without some 
additional arguments.

? Fonts: The *true* FRACTION SLASH U+2044 turns out to be even less common than 
the superscript small letters, and we can hope that when facing the real use, 
font-vendors will agree to update the typefaces. 

? Formatting: This has ended up as inappropriate whenever no fine-tuning (CSS) 
can be performed, so that the superscript small letters are finally less bad, 
and even more appropriate in many circumstances.

? Unicode design principles: They are biased. Cf. the naming policy of the 
superscript small letters, declared as 'MODIFIER LETTER SMALL .', while all 
other instances show more straightforward identifiers and headings:

@ Latin superscript modifier letters
x (superscript latin small letter i - 2071) // (These conform to early standards)
x (superscript latin small letter n - 207F)
02B0 MODIFIER LETTER SMALL H // (Should be: LATIN SUPERSCRIPT SMALL LETTER H)
* aspiration
#  0068
[?]
@ Latin subscript modifier letters
1D62 LATIN SUBSCRIPT SMALL LETTER I
#  0069
[?]
@ Subscripts
[?]
2090 LATIN SUBSCRIPT SMALL LETTER A
#  0061
2091 LATIN SUBSCRIPT SMALL LETTER E
#  0065
[?]

Regards,
Marcel


From richard.wordingham at ntlworld.com  Thu Jan 12 15:26:02 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 12 Jan 2017 21:26:02 +0000
Subject: Specification of Encoding of Plain Text
In-Reply-To: <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
 <20170112184242.1507f3a8@JRWUBU2>
 <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
Message-ID: <20170112212602.00511354@JRWUBU2>

On Thu, 12 Jan 2017 21:03:29 +0100
Mark Davis ?? <mark at macchiato.com> wrote:

> That was just an example off the top of my head of the format for
> using with regex; I don't pretend that it is vetted. Latin is not a
> complex script, so it was only an illustration. However, it was just
> brain freeze on my part to not also include Inherited or ZWJ. A more
> serious effort would look at some of the issues from
> http://unicode.org/reports/tr29/, for example. On the other hand, CGJ
> is not a problem: it is Mn
> <http://unicode.org/cldr/utility/character.jsp?a=034F>. And (say)
> U+064B ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included.

Ah, I had not appreciated that sc=Inherited does not imply
scx=Inherited. Using Script_Extensions to document the international
combining characters that are used, for example, with Thai bases could
have all sorts of undesirable knock-on effects.

Richard.


From richard.wordingham at ntlworld.com  Fri Jan 13 03:02:32 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 13 Jan 2017 09:02:32 +0000
Subject: Specification of Encoding of Plain Text
In-Reply-To: <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
 <20170112184242.1507f3a8@JRWUBU2>
 <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
Message-ID: <20170113090232.536a0d12@JRWUBU2>

On Thu, 12 Jan 2017 21:03:29 +0100
Mark Davis ?? <mark at macchiato.com> wrote:

> Latin is not a complex script,...

Unlike the common script, which notably has U+2044 FRACTION SLASH.

That statement is actually dubious from a typographical point of view.

> ...so it was only an illustration.

But it's good for looking for the non-obvious issues.

> A more serious effort would look at some of the issues from
> http://unicode.org/reports/tr29/, for example.

I don't think we want to have to repeat them all for each script.
Putting common-script punctuation and numbers in the regex will add
obscurity, and possibly be a maintainability issue.

Richard.


From asmusf at ix.netcom.com  Fri Jan 13 03:34:48 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 13 Jan 2017 01:34:48 -0800
Subject: Specification of Encoding of Plain Text
In-Reply-To: <20170113090232.536a0d12@JRWUBU2>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
 <20170112184242.1507f3a8@JRWUBU2>
 <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
 <20170113090232.536a0d12@JRWUBU2>
Message-ID: <c35403d7-dd78-2dff-8ab5-84e3a6ab14f3@ix.netcom.com>

I believe that any attempt to define a "regex" that describes *all legal 
text* in a given script is a-priori doomed to failure.

Part of the problem is that writing systems work not unlike human 
grammars in a curious mixture of pretty firm rules coupled to lists of 
exceptions. (Many texts by competent authors will contain 
"ungrammatical" sentences that somehow work despite or because of not 
following the standard rules). The Khmer issue that started the 
discussion showed that there can a be a single word that needs to be 
handled exceptionally.

If you try to capture all the exceptions in the general rules, the set 
of rules gets complicated, but is also likely to be way too permissive 
to be useful.

The Khmer LGR for the Root Zone, for example, deliberately disallows the 
exception (in the word for "give") so that it can be stated (a) more 
compactly and (b) does not allow the exceptional sequencing of certain 
characters to become applicable outside the single exception.

An LGR is concerned with *single* instances of each word. Even the most 
common word in a language can only be registered once in each zone. 
Therefore, such a drastic treatment is a perfectly good solution. For a 
rendering engine, you'd want to be much more permissive, perhaps even 
attempt to display patently "wrong" sequences. For a validation tool 
(spell checker) you would strike for some other sweet spot. Finally, to 
determine "first word" or "first syllable" for formatting purposes (such 
as "drop caps") there may yet be a different selection.

As a result, I believe it would be most useful if a regex or BNF could 
be created for the "typical" / "idealized" description of a "word" in 
the various scripts.

Then, depending on the facts in question, the BNF could be augmented 
with more or less formalized descriptions of variations, exceptions, etc.

The idea would be to provide "building blocks" that can be used to 
assemble rules tailored to various scenarios by the reader of the 
standard. (Because of that, they should be part of the description 
section, not a data file...)

Even if the BNFs did nothing more than capture succinctly the 
information presented in text and tables, they would be useful.

For scripts where things like ZWJ and CGJ are optional, it doesn't make 
sense to run them into the standard BNF - that just messes things up. It 
is much more useful to provide generic context information of how to add 
them to existing text.

For example, the CGJ is really intended to go between letters. So, 
describe that context.

Overall, describing the local contexts for a given character or class of 
characters has proven to be more useful in the LGR project than 
attempting to write global rules.

A./

On 1/13/2017 1:02 AM, Richard Wordingham wrote:
> On Thu, 12 Jan 2017 21:03:29 +0100
> Mark Davis ?? <mark at macchiato.com> wrote:
>
>> Latin is not a complex script,...
> Unlike the common script, which notably has U+2044 FRACTION SLASH.
>
> That statement is actually dubious from a typographical point of view.
>
>> ...so it was only an illustration.
> But it's good for looking for the non-obvious issues.
>
>> A more serious effort would look at some of the issues from
>> http://unicode.org/reports/tr29/, for example.
> I don't think we want to have to repeat them all for each script.
> Putting common-script punctuation and numbers in the regex will add
> obscurity, and possibly be a maintainability issue.
>
> Richard.
>
>


From mark at macchiato.com  Fri Jan 13 03:38:30 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Fri, 13 Jan 2017 10:38:30 +0100
Subject: Specification of Encoding of Plain Text
In-Reply-To: <20170112212602.00511354@JRWUBU2>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
 <20170112184242.1507f3a8@JRWUBU2>
 <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
 <20170112212602.00511354@JRWUBU2>
Message-ID: <CAJ2xs_GNN-DOKtXP8VohHtsXdWDECKxMwcaLszfpwYpCVDKEvg@mail.gmail.com>

If you know of combining marks whose scx values should include Thai, please
let us know.

Also, by "Latin is not a complex script" I mean it in the narrow sense I
stated, where the goal is the ordering of characters. That is, nobody would
normally wonder whether 0.5 when expressed by a sequence with U+2044
FRACTION SLASH should be written as the sequence <2, U+2044 FRACTION SLASH,
1>!

There will always be some edge cases, but the target is Tibetan or Myanmar,
not Latin or Cyrillic.

Mark

On Thu, Jan 12, 2017 at 10:26 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Thu, 12 Jan 2017 21:03:29 +0100
> Mark Davis ?? <mark at macchiato.com> wrote:
>
> > That was just an example off the top of my head of the format for
> > using with regex; I don't pretend that it is vetted. Latin is not a
> > complex script, so it was only an illustration. However, it was just
> > brain freeze on my part to not also include Inherited or ZWJ. A more
> > serious effort would look at some of the issues from
> > http://unicode.org/reports/tr29/, for example. On the other hand, CGJ
> > is not a problem: it is Mn
> > <http://unicode.org/cldr/utility/character.jsp?a=034F>. And (say)
> > U+064B ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included.
>
> Ah, I had not appreciated that sc=Inherited does not imply
> scx=Inherited. Using Script_Extensions to document the international
> combining characters that are used, for example, with Thai bases could
> have all sorts of undesirable knock-on effects.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170113/90c4af7e/attachment.html>

From richard.wordingham at ntlworld.com  Fri Jan 13 11:47:24 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 13 Jan 2017 17:47:24 +0000
Subject: Specification of Encoding of Plain Text
In-Reply-To: <c35403d7-dd78-2dff-8ab5-84e3a6ab14f3@ix.netcom.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
 <20170112184242.1507f3a8@JRWUBU2>
 <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
 <20170113090232.536a0d12@JRWUBU2>
 <c35403d7-dd78-2dff-8ab5-84e3a6ab14f3@ix.netcom.com>
Message-ID: <20170113174724.4839b668@JRWUBU2>

On Fri, 13 Jan 2017 01:34:48 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> I believe that any attempt to define a "regex" that describes *all
> legal text* in a given script is a-priori doomed to failure.
> 
> Part of the problem is that writing systems work not unlike human 
> grammars in a curious mixture of pretty firm rules coupled to lists
> of exceptions. (Many texts by competent authors will contain 
> "ungrammatical" sentences that somehow work despite or because of not 
> following the standard rules). The Khmer issue that started the 
> discussion showed that there can a be a single word that needs to be 
> handled exceptionally.

It's a single word in the *current* orthography for the Khmer language
in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire
provisoire des caract?res et divers signes des ?critures khm?res
pr?-modernes et modernes employ?s pour la notation du khmer, du
siamois, des dialectes tha?s m?ridionaux, du sanskrit et du p?li"
(http://aefek.free.fr/iso_album/antelme_bis.pdf), this manner
of writing was much commoner until it was largely eliminated by a
spelling reform in the first half of the 20th century.  The Thai
Wikipedia page on the use of the script for Thai
(https://th.wikipedia.org/wiki/???????????) gives examples for final
consonants with COENG VO (???? = ????), COENG NO (???? = ????) and
COENG NGO (????? = ????).

> If you try to capture all the exceptions in the general rules, the
> set of rules gets complicated, but is also likely to be way too
> permissive to be useful.

If it is checking for proper use of code points, overgeneration is far
preferable to undergeneration.

> The Khmer LGR for the Root Zone, for example, deliberately disallows
> the exception (in the word for "give") so that it can be stated (a)
> more compactly and (b) does not allow the exceptional sequencing of
> certain characters to become applicable outside the single exception.
> 
> An LGR is concerned with *single* instances of each word. Even the
> most common word in a language can only be registered once in each
> zone.

A label does not have to be a single word.  For example, there are
several, if not many, domain names matching give*.com, where the first
element is clearly the word 'give'.

> Even if the BNFs did nothing more than capture succinctly the 
> information presented in text and tables, they would be useful.

> For scripts where things like ZWJ and CGJ are optional, it doesn't
> make sense to run them into the standard BNF - that just messes
> things up. It is much more useful to provide generic context
> information of how to add them to existing text.

> For example, the CGJ is really intended to go between letters. So, 
> describe that context.

It can be quite useful next to combining marks.  For example, it may be
used to distinguish a diaeresis from an umlaut mark in Fraktur.

Richard.


From richard.wordingham at ntlworld.com  Fri Jan 13 12:19:21 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 13 Jan 2017 18:19:21 +0000
Subject: Specification of Encoding of Plain Text
In-Reply-To: <CAJ2xs_GNN-DOKtXP8VohHtsXdWDECKxMwcaLszfpwYpCVDKEvg@mail.gmail.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
 <20170112184242.1507f3a8@JRWUBU2>
 <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
 <20170112212602.00511354@JRWUBU2>
 <CAJ2xs_GNN-DOKtXP8VohHtsXdWDECKxMwcaLszfpwYpCVDKEvg@mail.gmail.com>
Message-ID: <20170113181921.16967374@JRWUBU2>

On Fri, 13 Jan 2017 10:38:30 +0100
Mark Davis ?? <mark at macchiato.com> wrote:

> On Thu, Jan 12, 2017 at 10:26 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:  

> > Using Script_Extensions to document the international
> > combining characters that are used, for example, with Thai bases
> > could have all sorts of undesirable knock-on effects.

> If you know of combining marks whose scx values should include Thai,
> please let us know.

If you refer to the end of TUS 9.0 Section 16.1 you will find mention
of U+0331 COMBINING MACRON BELOW and U+0303 COMBINING TILDE, which are
thus candidates for scx ? Latn.  One might also consider U+0359
COMBINING ASTERISK BELOW; I have seen the combination ?? <U+0E0A THAI
CHARACTER CHO CHANG, U+0359> used in a phonetic symbol for English,
representing [?].

As their scx values are 'Inherited', should their values not be treated
as though they already included Thai?  I suppose, though, that they
do not in fact match "p(scx=Thai)".  There does seem to be a view that
scx=inherited is shorthand for some list of European scripts.  

Richard.


From asmusf at ix.netcom.com  Fri Jan 13 12:27:35 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 13 Jan 2017 10:27:35 -0800
Subject: Specification of Encoding of Plain Text
In-Reply-To: <20170113174724.4839b668@JRWUBU2>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
 <20170112184242.1507f3a8@JRWUBU2>
 <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
 <20170113090232.536a0d12@JRWUBU2>
 <c35403d7-dd78-2dff-8ab5-84e3a6ab14f3@ix.netcom.com>
 <20170113174724.4839b668@JRWUBU2>
Message-ID: <d3868921-d675-aec3-c48c-77082270c990@ix.netcom.com>

On 1/13/2017 9:47 AM, Richard Wordingham wrote:
> On Fri, 13 Jan 2017 01:34:48 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>
>> I believe that any attempt to define a "regex" that describes *all
>> legal text* in a given script is a-priori doomed to failure.
>>
>> Part of the problem is that writing systems work not unlike human
>> grammars in a curious mixture of pretty firm rules coupled to lists
>> of exceptions. (Many texts by competent authors will contain
>> "ungrammatical" sentences that somehow work despite or because of not
>> following the standard rules). The Khmer issue that started the
>> discussion showed that there can a be a single word that needs to be
>> handled exceptionally.
> It's a single word in the *current* orthography for the Khmer language
> in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire
> provisoire des caract?res et divers signes des ?critures khm?res
> pr?-modernes et modernes employ?s pour la notation du khmer, du
> siamois, des dialectes tha?s m?ridionaux, du sanskrit et du p?li"
> (http://aefek.free.fr/iso_album/antelme_bis.pdf), this manner
> of writing was much commoner until it was largely eliminated by a
> spelling reform in the first half of the 20th century.

This points to another interesting issue. A number of languages have 
seen orthographic reforms that affect the use of complex scripts.

Now then, a decision: do you support both the old and the new style in 
the same rule-set? If vestiges remain in general use, you may not have a 
choice, but, what if the rules for old and new (or for different 
languages in the same script) actually conflict?

>   The Thai
> Wikipedia page on the use of the script for Thai
> (https://th.wikipedia.org/wiki/???????????) gives examples for final
> consonants with COENG VO (???? = ????), COENG NO (???? = ????) and
> COENG NGO (????? = ????).

In the case that I cited, that combination of language/script was taken 
as out of scope for other reasons; now, for general text, are there 
situations where you'd want separate sets of rules for each language?
>
>> If you try to capture all the exceptions in the general rules, the
>> set of rules gets complicated, but is also likely to be way too
>> permissive to be useful.
> If it is checking for proper use of code points, overgeneration is far
> preferable to undergeneration.

Agreed. For modeling general text you don't want to actually exclude 
anything that can occur. However, what can you exclude?

If you think of spell-checking as a scenario, overgeneration is not 
acceptable. Instead, you have a standard dictionary that deals with 
"general vocabulary" and there's a well defined mechanism to allow the 
user to add "exceptions".

My point is that you cannot design a ruleset without having a very 
well-defined use-case. If you divide the rule sets into "building 
blocks" then it may be easier to address different use cases than if you 
simply provide a "maximally permissive" set of rules.

I'm skeptical that a one size fits all sets of rules can be devised and 
be useful.

For rules that strongly err on the side of overgeneration, it might make 
more sense to simply define the few contexts that are deemed 
impermissible and set the rest to "anything goes".
>
>> The Khmer LGR for the Root Zone, for example, deliberately disallows
>> the exception (in the word for "give") so that it can be stated (a)
>> more compactly and (b) does not allow the exceptional sequencing of
>> certain characters to become applicable outside the single exception.
>>
>> An LGR is concerned with *single* instances of each word. Even the
>> most common word in a language can only be registered once in each
>> zone.
> A label does not have to be a single word.  For example, there are
> several, if not many, domain names matching give*.com, where the first
> element is clearly the word 'give'.

Correct, but each compound can still occur only once. I cite this 
example only because the local body that drafted the rules decided that 
there was a reasonable tradeoff (complexity vs. generality) for the 
purpose of top level domain names (i.e. ".give*" not "give*.com").

For that application, complexity has a relatively high negative weight 
associated with it, and complete coverage, while desirable, is not given 
the same high positive weight that it would have in describing ordinary 
text.

>> Even if the BNFs did nothing more than capture succinctly the
>> information presented in text and tables, they would be useful.
>> For scripts where things like ZWJ and CGJ are optional, it doesn't
>> make sense to run them into the standard BNF - that just messes
>> things up. It is much more useful to provide generic context
>> information of how to add them to existing text.
>> For example, the CGJ is really intended to go between letters. So,
>> describe that context.

(Forgot to make clear that this was a bit of a hypothetical)

> It can be quite useful next to combining marks.  For example, it may be
> used to distinguish a diaeresis from an umlaut mark in Fraktur.

Even if it is intended to go anywhere, even between digits, symbols and 
punctuation, it's much easier to describe that behavior separately 
rather than trying to insert it in every location in every regex. What 
I'm thinking is a description that gives a "skeleton word" and then you 
state, that this skeleton can be decorated (or whatever your preferred 
term) by inserting a CGJ anywhere.

The same goes for ZWJ /ZWNJ for any script where they don't have a 
recognized specific effect in particular sequences.


From charupdate at orange.fr  Fri Jan 13 18:19:30 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 14 Jan 2017 01:19:30 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170111083212.476f492e@JRWUBU2>
References: <20170104173306.665a7a7059d7ee80bb4d670165c8327d.7a92080546.wbe@email03.godaddy.com>
 <2039959835.345.1483595818937.JavaMail.www@wwinf1k39>
 <CAJKta0yUtjLxjjEn5BMWDmLNJmuonEwqp+e7gYXTwp54J-CejQ@mail.gmail.com>
 <104200337.5858.1483616029614.JavaMail.www@wwinf1k39>
 <CY4PR03MB279071362D8201D75D89F2F1D5600@CY4PR03MB2790.namprd03.prod.outlook.com>
 <538089927.246.1483681334517.JavaMail.www@wwinf1p15>
 <fc5fda79-b144-4050-700f-280d9a9c1c98@ix.netcom.com>
 <828252581.19365.1483729345941.JavaMail.www@wwinf1p15>
 <1112132817.50.1483847284449.JavaMail.www@wwinf1p17>
 <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
Message-ID: <2133262187.20678.1484353170497.JavaMail.www@wwinf1p26>

On Wed, 11 Jan 2017 08:32:12 +0000, Richard Wordingham wrote:
[?]
> The truly straight Unicode approach in HTML is to use 19&#x2044;45. 
> Just entering those 5 characters into a text entry box in Firefox gave 
> me a properly formatted vulgar fraction. That is how vulgar fractions 
> are supposed to work. Unfortunately, one may need to avoid 'exciting 
> new fonts' in favour of those with a large, working repertoire.

A new ?Fraction Slash and Fonts? thread in the B?PO community has brought up 
that this works mainly with new and ambitious fonts:
? Carlito
? Fira Sans
? Linux Biolinum
? Linux Libertine
? Roboto
? Source Sans Pro
? Source Serif Pro
? Ubuntu

By contrast, the typefaces not supporting U+2044 correctly include:
- FreeSans
- FreeSerif
- Open Sans
- Dej?Vu
- Droid
- Liberation
- TeX Gyre

BTW, the Times New Roman font that the Mailing List Archives specify, belongs to 
this latter category, so that the fractions with U+2044 and normal size digits 
display in fallback mode.

Software support is mainly found in open projects as we have seen:
? HarfBuzz, and its users:
? LibreOffice
? Firefox
? Chrome

In the meantime, Microsoft products not supporting U+2044 correctly include:
- DirectWrite
- Internet Explorer including its last version 11

? Does anybody know why Microsoft is reluctant in supporting U+2044?

? And why on the other hand, the widespread and popular way of writing fractions 
???as <superscript>U+2044<subscript> sequences is discouraged and even ridiculized?

Regards,
Marcel


From charupdate at orange.fr  Fri Jan 13 19:18:01 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 14 Jan 2017 02:18:01 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170112150141.GG14923@macbook>
References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
 <20170112150141.GG14923@macbook>
Message-ID: <850127614.20750.1484356681023.JavaMail.www@wwinf1p26>

On Thu, 12 Jan 2017 17:01:41 +0200, Khaled Hosny wrote:
> 
> LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is 
> not released yet. 

Is the integration of HarfBuzz limited to free software? 
And what might be the reason of the delayed integration of HarfBuzz in the 
Windows version of LibreOffice?

While the lastly cited ?Fraction slash U+2044 and fonts? thread on the 
B?PO community (Ergodis association) mailing list reviewed mainly free fonts, 
I find on my netbook on Windows 7 the following fonts that work correctly:
? Calibri
? Calibri Light
? Cambria
? Cambria Math
? Candara
? Consolas
? Constantia
? Corbel
? Gabriola
? Palatino Linotype
besides:
? Source Code Pro
? Source Sans Pro

? Is there any evidence about on-going efforts to update Times New Roman?

I believe that an outdated typeface, as is actually Times New Roman, is 
inappropriate for the Unicode Mail Archives.

Regards,
Marcel


From richard.wordingham at ntlworld.com  Fri Jan 13 19:18:09 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 14 Jan 2017 01:18:09 +0000
Subject: Specification of Encoding of Plain Text
In-Reply-To: <d3868921-d675-aec3-c48c-77082270c990@ix.netcom.com>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
 <20170112184242.1507f3a8@JRWUBU2>
 <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
 <20170113090232.536a0d12@JRWUBU2>
 <c35403d7-dd78-2dff-8ab5-84e3a6ab14f3@ix.netcom.com>
 <20170113174724.4839b668@JRWUBU2>
 <d3868921-d675-aec3-c48c-77082270c990@ix.netcom.com>
Message-ID: <20170114011809.78042ba4@JRWUBU2>

On Fri, 13 Jan 2017 10:27:35 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> This points to another interesting issue. A number of languages have 
> seen orthographic reforms that affect the use of complex scripts.

> Now then, a decision: do you support both the old and the new style
> in the same rule-set? If vestiges remain in general use, you may not
> have a choice, but, what if the rules for old and new (or for
> different languages in the same script) actually conflict?

What we have seen in Khmer is a change that almost prohibits CVC
orthographic clusters.  (I don't count nikahits, visargas or fragments
of vowels as C.)  However, that is a rule of the language; it does not
need to be a rule of the script.

I am not sure that the old and new rules should conflict.  We are
presumably taking about a change made before the script was soundly
encoded; it seems unreasonable that renderers should suddenly refuse to
handle text that was previously valid.

Now, I can think of a potential problem with Northern Thai ???????
<U+1A34 TAI THAM LOW TA, U+1A58 MAI KANG LAI, U+1A60 SAKOT, U+1A43 LA,
U+1A63 SIGN AA, U+1A60 SAKOT, U+1A3F LOW YA> 'all'.  It is a single,
chained orthographic syllable.  This appears to be contrary to Tai Kh?n
grammar, and is not clear to me how a modern Tai Kh?n font should render
it. (It's also contrary to USE, but so is most of the language.)  The
problem is that U+1A58 is a final, spacing mark in Tai Kh?n, while
further east it is a repha-like mark - it corresponds to kinzi in
Burmese.

The solution I anticipate is that it must be rendered as a
non-spacing mark even in Tai Kh?n when it cannot be interpreted as a
spacing mark.  Has anyone handled this issue?  My intended solution will
allow a common sequence of code points for both the old style (U+1A58
as kinzi), the intermediate Northern Thai styles, and the new style
(U+1A58 as a final consonant).

> In the case that I cited, that combination of language/script was
> taken as out of scope for other reasons; now, for general text, are
> there situations where you'd want separate sets of rules for each
> language?

For determining which language a text might belong to, different rules
would be appropriate.  However, for deciding whether to render text,
that seems ridiculous.  Converting renderable multilingual text to plain
text would make it unrenderable, which is surely undesirable.

Having said that, there do appear to be potential problems in the Lanna
script arising from interactions of spelling and layout style.  In some
styles, the consonant (and vowel) stack turns right at a certain depth,
and therefore can reasonably contain more items that a strictly
vertical stack.  As both styles appear in material published in Chiang
Mai, I'd be loath to declare different validity rules.  I'd rather
treat any problems as the surfacing of a renderer limitation.

Richard.


From mark at macchiato.com  Sat Jan 14 04:56:10 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Sat, 14 Jan 2017 11:56:10 +0100
Subject: Specification of Encoding of Plain Text
In-Reply-To: <20170113181921.16967374@JRWUBU2>
References: <20170109222414.72f83204@JRWUBU2>
 <b591d2d6-ee25-cdfc-572c-46cf7fceef2d@ix.netcom.com>
 <CAJ2xs_G2NUJRjf43BQGqQM2=JCFffAEUzTPm5AnFujrOEfjwEQ@mail.gmail.com>
 <20170110194013.0476f15f@JRWUBU2>
 <CAJ2xs_EzGSUqW_3GATQn1D6rrx0QT3FcY9FDvTVst51N2K-GyQ@mail.gmail.com>
 <20170112184242.1507f3a8@JRWUBU2>
 <CAJ2xs_GVghiRyQE1Uhm_g8FnbCLoCyXcJA-=Owzhki9h5NW=Vg@mail.gmail.com>
 <20170112212602.00511354@JRWUBU2>
 <CAJ2xs_GNN-DOKtXP8VohHtsXdWDECKxMwcaLszfpwYpCVDKEvg@mail.gmail.com>
 <20170113181921.16967374@JRWUBU2>
Message-ID: <CAJ2xs_EbSi3PgQk3uY+41Di6Znm_w-jcOOZCmZ+xu+8_wUMOgg@mail.gmail.com>

Mark

On Fri, Jan 13, 2017 at 7:19 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Fri, 13 Jan 2017 10:38:30 +0100
> Mark Davis ?? <mark at macchiato.com> wrote:
>
> > On Thu, Jan 12, 2017 at 10:26 PM, Richard Wordingham <
> > richard.wordingham at ntlworld.com> wrote:
>
> > > Using Script_Extensions to document the international
> > > combining characters that are used, for example, with Thai bases
> > > could have all sorts of undesirable knock-on effects.
>
> > If you know of combining marks whose scx values should include Thai,
> > please let us know.
>
> If you refer to the end of TUS 9.0 Section 16.1 you will find mention
> of U+0331 COMBINING MACRON BELOW and U+0303 COMBINING TILDE, which are
> thus candidates for scx ? Latn.  One might also consider U+0359
> COMBINING ASTERISK BELOW; I have seen the combination ?? <U+0E0A THAI
> CHARACTER CHO CHANG, U+0359> used in a phonetic symbol for English,
> representing [?].
>
> As their scx values are 'Inherited', should their values not be treated
> as though they already included Thai?  I suppose, though, that they
> do not in fact match "p(scx=Thai)".  There does seem to be a view that
> scx=inherited is shorthand for some list of European scripts.
>

?The distinction between sc=inherited and sc=common is an unfortunate one,
a remnant from when we first added the script data. The distinction for a
character C is purely derivable from whether gc(C) ? [[:mn:][:me:]] or not,
so it is of little value ? and with the advantage of hindsight, mostly just
gets in the way.

scx=inherited is *not* a shorthand for some list of European scripts.

Rather, C ? [
[:
scx=inherited:]
?[:
scx=inherited:]
?]?
 means that either

   1. we don't have enough information about usage to be able to list the
   scripts that C is used with, or
   2. C can be used with so many scripts that it is not particularly
   productive to list them all.


> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170114/4de30b88/attachment.html>

From charupdate at orange.fr  Sun Jan 15 10:46:13 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Sun, 15 Jan 2017 17:46:13 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <850127614.20750.1484356681023.JavaMail.www@wwinf1p26>
References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
 <20170112150141.GG14923@macbook>
 <850127614.20750.1484356681023.JavaMail.www@wwinf1p26>
Message-ID: <295090107.9846.1484498773837.JavaMail.www@wwinf1p26>

On Sat, 14 Jan 2017 02:18:01 +0100 (CET), I wrote:
> 
> I believe that an outdated typeface, as is actually Times New Roman, is 
> inappropriate for the Unicode Mail Archives.

I?ve been kindly informed off-list that the Archives don?t specify any font, 
and are viewed in the default font customizable in the browser preferences.

I apologize for this statement of mine.

Regards,
Marcel


From asmusf at ix.netcom.com  Sun Jan 15 12:15:33 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 15 Jan 2017 10:15:33 -0800
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <295090107.9846.1484498773837.JavaMail.www@wwinf1p26>
References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
 <20170112150141.GG14923@macbook>
 <850127614.20750.1484356681023.JavaMail.www@wwinf1p26>
 <295090107.9846.1484498773837.JavaMail.www@wwinf1p26>
Message-ID: <6535a19d-f880-6e0a-0b18-167d1ea6890f@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170115/5f1079b1/attachment.html>

From christoph.paeper at crissov.de  Mon Jan 16 19:02:02 2017
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Tue, 17 Jan 2017 02:02:02 +0100
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <1651851807.24444.1484255052680.JavaMail.www@wwinf1p25>
References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
 <20170112150141.GG14923@macbook>
 <1978028751.17843.1484240496199.JavaMail.www@wwinf1p25>
 <1651851807.24444.1484255052680.JavaMail.www@wwinf1p25>
Message-ID: <D18E6C0E-CE1D-42F6-A3EF-E95767634FF5@crissov.de>

Marcel Schneider <charupdate at orange.fr>:
> 
> What typically happens with the correct use of fraction slash on a collaborative 
> website like Wikipedia, is that the superscripts and subscripts are restored, 

JFTR, <http://en.wikipedia.org/wiki/Template:Frac> has been using the fraction slash for many years, but (still) pairs it with HTML/CSS super- and subscripts.

From charupdate at orange.fr  Tue Jan 17 02:25:46 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Tue, 17 Jan 2017 09:25:46 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <6535a19d-f880-6e0a-0b18-167d1ea6890f@ix.netcom.com>
References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
 <20170112150141.GG14923@macbook>
 <850127614.20750.1484356681023.JavaMail.www@wwinf1p26>
 <295090107.9846.1484498773837.JavaMail.www@wwinf1p26>
 <6535a19d-f880-6e0a-0b18-167d1ea6890f@ix.netcom.com>
Message-ID: <2048160763.2450.1484641546911.JavaMail.www@wwinf1p10>

I?m aware that this thread is getting lengthy and (supposedly) tiresome. 
Therefore, I wouldn?t have sent this to the List today. 
I really wanted to make a break and come back later.

However, with respect to the consequences of the result of this issue for millions 
of end-users, and the imminence of the French keyboard standardization these months, 
I acknowledge to be given the opportunity to keep discussing on-list. 

***Disclaimer*** I?m not a part of the French keyboard standard WG, and 
I?m talking on my own behalf, in civic responsiveness.

On Sun, 15 Jan 2017 10:15:33 -0800, Asmus Freytag wrote:
> 
[Quoted mail]
> 
> Contrary to your assertion about fonts elsewhere, the poor rendering of 
> subscripts/superscripts that I reported to you is based on the fact that 
> the characters are missing, but that the glyphs are not laid out as 
> running text.

To date, as far as I know, the only domain where superscripts and subscripts are 
mandatory in general text are abbreviations of numerals, titles, entities, 
measurement units, chemical compounds and so on, using Western Arabic digits and 
Latin superscript lowercase. I?m quite sure that no other scripts do have this 
typographical convention, that is a part of an old discipline called 
?orthotypography.? While I was wrong mixing it up with orthography, the outstanding 
importance of these rules for unambiguous representation of text calls for special 
treatment in practice and in the Unicode Standard.

In these ranges, one character is still missing because the UTC has refused to 
encode *LATIN SUPERSCRIPT SMALL LETTER Q, aka *MODIFIER LETTER SMALL Q. 
This has little incidence on general practice.

The main challenge outside Unicode is the availability of the related glyphs in 
current fonts, as well as their consistency. To date, almost all webmails propose 
only fonts where they are designed in an intentionally inconsistent way, supposedly 
to make them unusable for accurate display: The '?' is always far too high, and 
the '?' is too bold and with random vertical alignment. In my opinion, the legacy 
status of these two is used as a fake explanation; compare with the inconsistent 
design of '?' and '?' in some fonts, along with that of '?', while there is no 
excuse of ?legacy,? unlike for '?', '?' and '?', where ?legacy? is equally abused 
to mess up the typefaces. This applies as well to most other fonts. The only 
correct font-family I?ve found so far is Calibri. Consistently, this is the body 
font in the default template of Microsoft Word.

> 
> When viewing with monospaced fonts, the separation between glyphs 
> corresponds to the spacing of the full-size characters. When using 
> formatting (styling) the superscripted text is in a smaller font size, 
> reducing the spacing between characters, so that strings of them look 
> like ordinary text again and not s p a c e d o u t.

I?m facing this issue when writing drafts in my text editor, where however I?m able 
to set the font to any value, including Calibri. Displaying this in Calibri allows 
to appreciate the consistent and running-text-like display of the superscripts:

// This is ???????????????????????????????????????????????????????????????????????????????????? 
// This is the range: ???????????????? ^q_unavailable?????????????????????????????

This is how a complete and Unicode conformant typeface is supposed to work.
In practice, this turns out to be implemented far, far more than U+2044.

> 
> I'm not going to spend much more time on this discussion.

When I launched this discussion on December 28, 2016, I naively believed that 
this time, the matter would be quickly settled, and I could go on being more 
productive on developing the keyboard layouts and documenting them. 
Now that this thread still hasn?t come to an even halfway useful result, 
I need to make one more attempt. 

The goal is to get Unicode accept the fact that people use superscript letters 
in French, and super/sub scripts in vulgar fractions, and have them on their 
keyboards, and that these people are not considered as hackers, but as making 
a reasonable, thoughtful and responsive use of the Standard. 

That is not a matter of ?value inversion,? but of correcting a particular 
design principle that was misled and biased under a (hypothetically) strong 
influence of *extrinsic* factors from the beginning on (see point 3 below).

It?s good to know about the counter-arguments that may be figured out, so I?m 
grateful to all who were so kind to respond. What bothers me, is that there is 
still so much persistent opposition; and what makes me fear the worse, is that 
the arguments raised against the general use of preformatted characters are 
so biased and fallacious, unlike any normal-time reasoning: 

1) Missing font support as an argument against the use of a character has never, 
? ? never been the way Unicode worked, so far as I?ve been given the opportunity 
? ? to understand something of Unicode till now.

2) This missing font support is mostly a consequence of the Unicode strategy on 
? ? these characters: Discouraging their use and even misnaming them intentionally 
? ? in an inconsistent manner (from an overall point of view), Unicode drove
? ? a significant part of the font designers away from adding them completely and 
? ? with a consistent design, and from implementing combining marks support for 
? ? these characters.

3) This strategy is biased from the beginning on, as it goes against the user 
? ? preferences of Latin script using countries, while AFAIK all countries 
? ? using other scripts are unconcerned because they actually don?t *use* 
? ? superscripting in such an *extensive* way. Please correct me If I?m wrong.
? ? Consequently, there would be *nobody* asking for more (except the already 
? ? discussed completion of some ranges of Latin script). This strategy of shooing 
? ? users (and their developers) away from using preformatted letters and digits 
? ? seems to aim nothing serious than support of software vendors? marketing 
? ? strategies, despite of the software not needing poor character support based 
? ? (and poor keyboard layout based) marketing.


> Using code points 
> "against the grain" that is, in contradiction to the way their use was 
> intended when they were encoded means that you are going to run into many 
> issues based on font vendors and implementers expectations on how users 
> would follow the conventions suggested in the Unicode Standard.

I?ve got the news that Edge neither still doesn?t make OpenType fonts work for 
U+2044. One might wonder however whether the users should conform to the Standard 
litterally while even Microsoft don?t. I?m not here to post feature requests to 
the attention of Microsoft any longer. My actual suggestions are perhaps a bit 
more complex than that. I just wish that the Unicode policy wrt superscripts 
become more user-centered, more user-friendly.

The core issue is the use of these letters in current text in some languages 
that need them to apply a typographic convention that is close to orthography. 
Superscripting is a far, far stronger requirement than all other formatting 
conventions, as it can affect the spelling of the grammatical entity.

We?re facing strong demands on user side relayed by standards bodies from the 
early times on, when ordinal indicators were first encoded as a part of Latin-1. 
Today most users still type a degree sign to emulate a superscript o, and the 
French NB (that I?m not a part of, nor am y a member of the keyboard standard WG) 
wishes an ordinal indicator on the keyboard to represent the most common ordinal 
indicator in French: "?".

> 
> Your discussion of the support of the fraction slash (with regular digits) 
> across fonts is potentially more useful -- bringing attention to this issue 
> could bring font vendors to perhaps update earlier fonts to support the 
> correct conventions for 2044 (which incidentally post dated the design of 
> many popular fonts).

This is relatively important, but it is far outweighed by the ordinal indicator 
issue, and along with it, the need to stabilize superscript abbreviations. 

> 
> In other words, there's no need to "fix" the character encoding, but much 
> need to make sure that what's in the character encoding (and its associated 
> conventions) is actually supported as intended.

Additionally, I now suggest to add an informative alias to each one of the 
(intentionally) misnamed characters. This ?MODIFIER LETTER? disguise of the true 
*LATIN SUPERSCRIPT LETTERs seems to me a twisted trick to make inadvertant people 
believe that here?s a thing to insiders that is completely useless to other people.

The truth happens to show up wherever the editorial committee (as well as 
anybody else) can afford to feel free to write their own, unbiased language:
[I?m highlighting with uppercase]

@ Latin superscript modifier letters
@+ See also SUPERSCRIPT LATIN LETTERS in the Spacing Modifier Letters block starting at 02B0.
1D2C MODIFIER LETTER CAPITAL A
...

I think that the "MODIFIER LETTER" labeling of these characters is not 
straightforward enough for a standard who claims that the character names are 
mere identifiers. This is an example of how the identifiers were (ab)used as 
descriptors, to carry prescriptions and corporate preferences on how to use or 
not to use the repertoire.

When I?m back writing up some keyboard documentation, I really would like to 
be able to deliver a better image of Unicode ? and of Microsoft ? than that one.
Please help me improve my communication, and make Unicode a user-centered standard.

Below are the proposed additions, that I?d like to submit to your kind review 
prior to posting them with the Contact Form.

Regards,
Marcel

NamesList snippets with additional informative aliases providing straightforward 
character identifiers, and some comment lines:
(Original file:
http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt
)

@@ 02B0 Spacing Modifier Letters 02FF
@+ Superscript and subscript letters were not intended to replace markup, but they are for use where super/sub scripting is important in 
plain text, or formatting is inappropriate.
@ Latin superscript modifier letters
@+ "modifier letter small" stands for "latin superscript small letter", and "modifier letter small capital" for "latin letter small capital".
x (superscript latin small letter i - 2071)
x (superscript latin small letter n - 207F)
02B0 MODIFIER LETTER SMALL H
= latin superscript small letter h
* aspiration
# <super> 0068
02B1 MODIFIER LETTER SMALL H WITH HOOK
= latin superscript small letter h with hook
* breathy voiced, murmured
x (latin small letter h with hook - 0266)
x (combining diaeresis below - 0324)
# <super> 0266
02B2 MODIFIER LETTER SMALL J
= latin superscript small letter j
* palatalization
x (combining palatalized hook below - 0321)
# <super> 006A
02B3 MODIFIER LETTER SMALL R
= latin superscript small letter r
# <super> 0072
02B4 MODIFIER LETTER SMALL TURNED R
= latin superscript small letter turned r
x (latin small letter turned r - 0279)
# <super> 0279
02B5 MODIFIER LETTER SMALL TURNED R WITH HOOK
= latin superscript small letter turned r with hook
x (latin small letter turned r with hook - 027B)
# <super> 027B
02B6 MODIFIER LETTER SMALL CAPITAL INVERTED R
= latin letter small capital inverted r
* preceding four used for r-coloring or r-offglides
x (latin letter small capital inverted r - 0281)
# <super> 0281
02B7 MODIFIER LETTER SMALL W
= latin superscript small letter w
* labialization
x (combining inverted double arch below - 032B)
# <super> 0077
02B8 MODIFIER LETTER SMALL Y
= latin superscript small letter y
* palatalization
* common Americanist usage for 02B2
# <super> 0079
[?]
@ Additions based on 1989 IPA
02DE MODIFIER LETTER RHOTIC HOOK
* rhotacization in vowel
* often ligated: 025A = 0259 + 02DE; 025D = 025C + 02DE
02DF MODIFIER LETTER CROSS ACCENT
* Swedish grave accent
02E0 MODIFIER LETTER SMALL GAMMA
= latin superscript small letter gamma
* these modifier letters are occasionally used in transcription of affricates
# <super> 0263
02E1 MODIFIER LETTER SMALL L
= latin superscript small letter l
# <super> 006C
02E2 MODIFIER LETTER SMALL S
= latin superscript small letter s
# <super> 0073
02E3 MODIFIER LETTER SMALL X
= latin superscript small letter x
# <super> 0078
02E4 MODIFIER LETTER SMALL REVERSED GLOTTAL STOP
= latin superscript letter reversed glottal stop
# <super> 0295
[?]
@ Latin superscript modifier letters
@+ See also superscript Latin letters in the Spacing Modifier Letters block starting at 02B0.
1D2C MODIFIER LETTER CAPITAL A
= latin superscript capital letter a
# <super> 0041
1D2D MODIFIER LETTER CAPITAL AE
= latin superscript capital letter ae
# <super> 00C6
1D2E MODIFIER LETTER CAPITAL B
= latin superscript capital letter b
# <super> 0042
1D2F MODIFIER LETTER CAPITAL BARRED B
= latin superscript capital letter barred b
1D30 MODIFIER LETTER CAPITAL D
= latin superscript capital letter d
# <super> 0044
1D31 MODIFIER LETTER CAPITAL E
= latin superscript capital letter e
# <super> 0045
1D32 MODIFIER LETTER CAPITAL REVERSED E
= latin superscript capital letter reversed e
# <super> 018E
1D33 MODIFIER LETTER CAPITAL G
= latin superscript capital letter g
# <super> 0047
1D34 MODIFIER LETTER CAPITAL H
= latin superscript capital letter h
# <super> 0048
1D35 MODIFIER LETTER CAPITAL I
= latin superscript capital letter i
# <super> 0049
1D36 MODIFIER LETTER CAPITAL J
= latin superscript capital letter j
# <super> 004A
1D37 MODIFIER LETTER CAPITAL K
= latin superscript capital letter k
# <super> 004B
1D38 MODIFIER LETTER CAPITAL L
= latin superscript capital letter l
# <super> 004C
1D39 MODIFIER LETTER CAPITAL M
= latin superscript capital letter m
# <super> 004D
1D3A MODIFIER LETTER CAPITAL N
= latin superscript capital letter n
# <super> 004E
1D3B MODIFIER LETTER CAPITAL REVERSED N
= latin superscript capital letter reversed n
1D3C MODIFIER LETTER CAPITAL O
= latin superscript capital letter o
# <super> 004F
1D3D MODIFIER LETTER CAPITAL OU
= latin superscript capital letter ou
# <super> 0222
1D3E MODIFIER LETTER CAPITAL P
= latin superscript capital letter p
# <super> 0050
1D3F MODIFIER LETTER CAPITAL R
= latin superscript capital letter r
# <super> 0052
1D40 MODIFIER LETTER CAPITAL T
= latin superscript capital letter t
# <super> 0054
1D41 MODIFIER LETTER CAPITAL U
= latin superscript capital letter u
# <super> 0055
1D42 MODIFIER LETTER CAPITAL W
= latin superscript capital letter w
# <super> 0057
1D43 MODIFIER LETTER SMALL A
= latin superscript small letter a
# <super> 0061
1D44 MODIFIER LETTER SMALL TURNED A
= latin superscript small letter turned a
# <super> 0250
1D45 MODIFIER LETTER SMALL ALPHA
= latin superscript small letter alpha
# <super> 0251
1D46 MODIFIER LETTER SMALL TURNED AE
= latin superscript small letter turned ae
# <super> 1D02
1D47 MODIFIER LETTER SMALL B
= latin superscript small letter b
# <super> 0062
1D48 MODIFIER LETTER SMALL D
= latin superscript small letter d
# <super> 0064
1D49 MODIFIER LETTER SMALL E
= latin superscript small letter e
# <super> 0065
1D4A MODIFIER LETTER SMALL SCHWA
= latin superscript small letter schwa
# <super> 0259
1D4B MODIFIER LETTER SMALL OPEN E
= latin superscript small letter open e
# <super> 025B
1D4C MODIFIER LETTER SMALL TURNED OPEN E
= latin superscript small letter turned open e
* more appropriate equivalence would be to 1D08
# <super> 025C
1D4D MODIFIER LETTER SMALL G
= latin superscript small letter g
# <super> 0067
1D4E MODIFIER LETTER SMALL TURNED I
= latin superscript small letter i
1D4F MODIFIER LETTER SMALL K
= latin superscript small letter k
# <super> 006B
1D50 MODIFIER LETTER SMALL M
= latin superscript small letter m
# <super> 006D
1D51 MODIFIER LETTER SMALL ENG
= latin superscript small letter eng
# <super> 014B
1D52 MODIFIER LETTER SMALL O
= latin superscript small letter o
# <super> 006F
1D53 MODIFIER LETTER SMALL OPEN O
= latin superscript small letter open o
# <super> 0254
1D54 MODIFIER LETTER SMALL TOP HALF O
= latin superscript small letter top half o
# <super> 1D16
1D55 MODIFIER LETTER SMALL BOTTOM HALF O
= latin superscript small letter bottom half o
# <super> 1D17
1D56 MODIFIER LETTER SMALL P
= latin superscript small letter p
# <super> 0070
1D57 MODIFIER LETTER SMALL T
= latin superscript small letter t
# <super> 0074
1D58 MODIFIER LETTER SMALL U
= latin superscript small letter u
# <super> 0075
1D59 MODIFIER LETTER SMALL SIDEWAYS U
= latin superscript small letter sideways u
# <super> 1D1D
1D5A MODIFIER LETTER SMALL TURNED M
= latin superscript small letter turned m
# <super> 026F
1D5B MODIFIER LETTER SMALL V
= latin superscript small letter v
# <super> 0076
1D5C MODIFIER LETTER SMALL AIN // (a misnomer also as it should be MODIFIER LETTER AIN; cf. 1D25 LATIN LETTER AIN, A724 LATIN CAPITAL 
LETTER EGYPTOLOGICAL AIN, A725 LATIN SMALL LETTER EGYPTOLOGICAL AIN)
= latin superscript letter ain
# <super> 1D25
@ Greek superscript modifier letters
1D5D MODIFIER LETTER SMALL BETA
= greek superscript small letter beta
# <super> 03B2
1D5E MODIFIER LETTER SMALL GREEK GAMMA
= greek superscript small letter gamma
# <super> 03B3
1D5F MODIFIER LETTER SMALL DELTA // (a misnomer also as it should be MODIFIER LETTER SMALL GREEK DELTA, cf. 1E9F LATIN SMALL LETTER 
DELTA)
= greek superscript small letter delta
# <super> 03B4
1D60 MODIFIER LETTER SMALL GREEK PHI
= greek superscript small letter phi
# <super> 03C6
1D61 MODIFIER LETTER SMALL CHI
= greek superscript small letter chi
# <super> 03C7
@ Latin subscript modifier letters
1D62 LATIN SUBSCRIPT SMALL LETTER I
# <sub> 0069
1D63 LATIN SUBSCRIPT SMALL LETTER R
# <sub> 0072
1D64 LATIN SUBSCRIPT SMALL LETTER U
# <sub> 0075
1D65 LATIN SUBSCRIPT SMALL LETTER V
# <sub> 0076
@ Greek subscript modifier letters
1D66 GREEK SUBSCRIPT SMALL LETTER BETA
# <sub> 03B2
1D67 GREEK SUBSCRIPT SMALL LETTER GAMMA
# <sub> 03B3
1D68 GREEK SUBSCRIPT SMALL LETTER RHO
# <sub> 03C1
1D69 GREEK SUBSCRIPT SMALL LETTER PHI
# <sub> 03C6
1D6A GREEK SUBSCRIPT SMALL LETTER CHI
# <sub> 03C7
[?]
@ Modifier letters
@+ Other modifier letters can be found in the Spacing Modifier Letters, Phonetic Extensions, as well as Superscripts and Subscripts blocks.
1D9B MODIFIER LETTER SMALL TURNED ALPHA
= latin superscript small letter turned alpha
# <super> 0252
1D9C MODIFIER LETTER SMALL C
= latin superscript small letter c
# <super> 0063
1D9D MODIFIER LETTER SMALL C WITH CURL
= latin superscript small letter c with curl
# <super> 0255
1D9E MODIFIER LETTER SMALL ETH
= latin superscript small letter eth
# <super> 00F0
1D9F MODIFIER LETTER SMALL REVERSED OPEN E
= latin superscript small letter reversed open e
# <super> 025C
1DA0 MODIFIER LETTER SMALL F
= latin superscript small letter f
# <super> 0066
1DA1 MODIFIER LETTER SMALL DOTLESS J WITH STROKE
= latin superscript small letter dotless j with stroke
# <super> 025F
1DA2 MODIFIER LETTER SMALL SCRIPT G
= latin superscript small letter script g
# <super> 0261
1DA3 MODIFIER LETTER SMALL TURNED H
= latin superscript small letter turned h
# <super> 0265
1DA4 MODIFIER LETTER SMALL I WITH STROKE
= latin superscript small letter i with stroke
# <super> 0268
1DA5 MODIFIER LETTER SMALL IOTA
= latin superscript small letter iota
# <super> 0269
1DA6 MODIFIER LETTER SMALL CAPITAL I
= latin letter small capital i
* not for use in UPA
x (modifier letter capital i - 1D35)
# <super> 026A
1DA7 MODIFIER LETTER SMALL CAPITAL I WITH STROKE
= latin letter small capital i with stroke
# <super> 1D7B
1DA8 MODIFIER LETTER SMALL J WITH CROSSED-TAIL
= latin superscript small letter j with crossed-tail
# <super> 029D
1DA9 MODIFIER LETTER SMALL L WITH RETROFLEX HOOK
= latin superscript small letter l with retroflex hook
# <super> 026D
1DAA MODIFIER LETTER SMALL L WITH PALATAL HOOK
= latin superscript small letter l with palatal hook
# <super> 1D85
1DAB MODIFIER LETTER SMALL CAPITAL L
= latin letter small capital l
* not for use in UPA
x (modifier letter capital l - 1D38)
# <super> 029F
1DAC MODIFIER LETTER SMALL M WITH HOOK
= latin superscript small letter m with hook
# <super> 0271
1DAD MODIFIER LETTER SMALL TURNED M WITH LONG LEG
= latin superscript small letter turned m with long leg
# <super> 0270
1DAE MODIFIER LETTER SMALL N WITH LEFT HOOK
= latin superscript small letter n with left hook
# <super> 0272
1DAF MODIFIER LETTER SMALL N WITH RETROFLEX HOOK
= latin superscript small letter n with retroflex hook
# <super> 0273
1DB0 MODIFIER LETTER SMALL CAPITAL N
= latin letter small capital n
* not for use in UPA
x (modifier letter capital n - 1D3A)
# <super> 0274
1DB1 MODIFIER LETTER SMALL BARRED O
= latin superscript small letter barred o
# <super> 0275
1DB2 MODIFIER LETTER SMALL PHI
= latin superscript small letter phi
# <super> 0278
1DB3 MODIFIER LETTER SMALL S WITH HOOK
= latin superscript small letter s with hook
# <super> 0282
1DB4 MODIFIER LETTER SMALL ESH
= latin superscript small letter esh
# <super> 0283
1DB5 MODIFIER LETTER SMALL T WITH PALATAL HOOK
= latin superscript small letter small t with palatal hook
# <super> 01AB
1DB6 MODIFIER LETTER SMALL U BAR
= latin superscript small letter u bar
# <super> 0289
1DB7 MODIFIER LETTER SMALL UPSILON
= latin superscript small letter upsilon
# <super> 028A
1DB8 MODIFIER LETTER SMALL CAPITAL U
= latin letter small capital u
* not for use in UPA
x (modifier letter capital u - 1D41)
# <super> 1D1C
1DB9 MODIFIER LETTER SMALL V WITH HOOK
= latin superscript small letter v with hook
# <super> 028B
1DBA MODIFIER LETTER SMALL TURNED V
= latin superscript small letter turned v
# <super> 028C
1DBB MODIFIER LETTER SMALL Z
= latin superscript small letter z
# <super> 007A
1DBC MODIFIER LETTER SMALL Z WITH RETROFLEX HOOK
= latin superscript small letter z with retroflex hook
# <super> 0290
1DBD MODIFIER LETTER SMALL Z WITH CURL
= latin superscript small letter z with curl
# <super> 0291
1DBE MODIFIER LETTER SMALL EZH
= latin superscript small letter ezh
# <super> 0292
1DBF MODIFIER LETTER SMALL THETA
= latin superscript small letter theta
# <super> 03B8
[?]
@ Additions for Extended IPA
A7F8 MODIFIER LETTER CAPITAL H WITH STROKE
= latin superscript capital letter h with stroke
* faucalized
# <super> 0126
A7F9 MODIFIER LETTER SMALL LIGATURE OE
= latin superscript small ligature oe
* labialized: open-rounded
# <super> 0153
[?]
@ Modifier letters for German dialectology
AB5B MODIFIER BREVE WITH INVERTED BREVE
x (breve - 02D8)
x (close up - 2050)
x (metrical breve - 23D1)
AB5C MODIFIER LETTER SMALL HENG
= latin superscript small letter heng
# <super> A727
AB5D MODIFIER LETTER SMALL L WITH INVERTED LAZY S
= latin superscript small letter l with inverted lazy s
# <super> AB37
AB5E MODIFIER LETTER SMALL L WITH MIDDLE TILDE
= latin superscript small letter l with middle tilde
# <super> 026B
AB5F MODIFIER LETTER SMALL U WITH LEFT HOOK
= latin superscript small letter u with left hook
# <super> AB52


From richard.wordingham at ntlworld.com  Tue Jan 17 17:04:05 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 17 Jan 2017 23:04:05 +0000
Subject: Misspelling or Miscoding?
Message-ID: <20170117230405.44f7601e@JRWUBU2>

When someone enters text with the code points in the wrong order but
the text renders to give the appearance that should have been intended,
has the typist misspelt, miscoded or what?  I am talking about
sequences that are *not* even compatibility equivalent to what should
have been entered.

Richard.

From charupdate at orange.fr  Wed Jan 18 05:31:44 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 18 Jan 2017 12:31:44 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <D18E6C0E-CE1D-42F6-A3EF-E95767634FF5@crissov.de>
References: <1650945202.4739.1483875782165.JavaMail.www@wwinf1p17>
 <652345901.17960.1483909510323.JavaMail.www@wwinf1p19>
 <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
 <20170112150141.GG14923@macbook>
 <1978028751.17843.1484240496199.JavaMail.www@wwinf1p25>
 <1651851807.24444.1484255052680.JavaMail.www@wwinf1p25>
 <D18E6C0E-CE1D-42F6-A3EF-E95767634FF5@crissov.de>
Message-ID: <988413375.8331.1484739105259.JavaMail.www@wwinf1p07>

On Tue, 17 Jan 2017 02:02:02 +0100, Christoph P?per wrote:
> 
> Marcel Schneider <charupdate_at_orange.fr>: 
> > 
> > What typically happens with the correct use of fraction slash on a collaborative 
> > website like Wikipedia, is that the superscripts and subscripts are restored, 
> 
> JFTR, <http://en.wikipedia.org/wiki/Template:Frac> has been using the fraction 
> slash for many years, but (still) pairs it with HTML/CSS super- and subscripts. 

Thank you for drawing our attention to this. That has the potential to help in 
Unicode education, and to spread the word about the full nature of U+2044 as it 
is intended in the Standard and implemented in HarfBuzz.

Though what this template actually does is apply generic HTML super/sub scripting:

<sup>{{{1}}}</sup>&frasl;<sub>{{{2}}}</sub>

Given that this displays worse than when fractions are hard-coded with 
Unicode super/sub scripts, I?ve added some CSS but in the French template only, 
that is not locked:
https://fr.wikipedia.org/wiki/Mod%C3%A8le:Fraction
(BTW when I?ve come on it, it didn?t use U+2044, so I?ve imported the template 
from en-wiki, and then from de-wiki:
https://de.wikipedia.org/wiki/Vorlage:Bruch
), and added some notes to the template documentation.

I?ll have to track the issue on the talk page of en-wiki:
https://en.wikipedia.org/wiki/Template_talk:Frac#Template-protected_edit_request_on_17_January_2017

You are welcome to add to it. Please feel free. Thanks.

Regards,
Marcel


From doug at ewellic.org  Wed Jan 18 10:44:34 2017
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 18 Jan 2017 09:44:34 -0700
Subject: Misspelling or =?UTF-8?Q?Miscoding=3F?=
Message-ID: <20170118094434.665a7a7059d7ee80bb4d670165c8327d.3f543ef6a7.wbe@email03.godaddy.com>

Richard Wordingham wrote:

> When someone enters text with the code points in the wrong order but 
> the text renders to give the appearance that should have been intended, 
> has the typist misspelt, miscoded or what? I am talking about 
> sequences that are *not* even compatibility equivalent to what should 
> have been entered. 

I'd say the person misspelled the word, or made a typographical error.
The fact that the rendering software displayed it as if it were spelled
correctly is immaterial.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From richard.wordingham at ntlworld.com  Wed Jan 18 12:49:55 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 18 Jan 2017 18:49:55 +0000
Subject: Misspelling or Miscoding?
In-Reply-To: <20170118094434.665a7a7059d7ee80bb4d670165c8327d.3f543ef6a7.wbe@email03.godaddy.com>
References: <20170118094434.665a7a7059d7ee80bb4d670165c8327d.3f543ef6a7.wbe@email03.godaddy.com>
Message-ID: <20170118184955.74cffc8f@JRWUBU2>

On Wed, 18 Jan 2017 09:44:34 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> I'd say the person misspelled the word, or made a typographical error.
> The fact that the rendering software displayed it as if it were
> spelled correctly is immaterial.

If someone made such a mistake with typical English, I could accuse
them of not reading what they typed.  That line of defence is not
available.  One of the purposes of the dotted circles introduced in
complex script processing is to warn the writer that such an error has
been made.  However, there are cases where homographic anagrams
indicate different pronunciations.

I think it is not a 'typographical error' if it renders as it should!

Richard.

From doug at ewellic.org  Wed Jan 18 14:35:55 2017
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 18 Jan 2017 13:35:55 -0700
Subject: Misspelling or =?UTF-8?Q?Miscoding=3F?=
Message-ID: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>

Richard Wordingham wrote:

> I think it is not a 'typographical error' if it renders as it should!

What if it renders correctly on some systems but not on others?

I do see your point, though. Writing systems that permit different
spellings of the same glyph (cluster), only one of which is 'correct'
even after normalization, can be tricky like this. I think this would
still be a matter of 'misspelling' rather than 'miscoding' because a
typist should not have to be concerned with character codes per se.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From richard.wordingham at ntlworld.com  Wed Jan 18 19:12:50 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 19 Jan 2017 01:12:50 +0000
Subject: Misspelling or Miscoding?
In-Reply-To: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
Message-ID: <20170119011250.01bd6a96@JRWUBU2>

On Wed, 18 Jan 2017 13:35:55 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> Richard Wordingham wrote:
> 
> > I think it is not a 'typographical error' if it renders as it
> > should!  
> 
> What if it renders correctly on some systems but not on others?

> I do see your point, though. Writing systems that permit different
> spellings of the same glyph (cluster), only one of which is 'correct'
> even after normalization, can be tricky like this. I think this would
> still be a matter of 'misspelling' rather than 'miscoding' because a
> typist should not have to be concerned with character codes per se.

As you've put it, it sounds like the way things were with a simple Thai
typewriter.  A vowel below, a vowel above and a tone mark could be
typed in any order, as though they had three different non-zero
combining classes.  Thais were trained to type into computers by input
routines only accepting the marks in the correct order - this was
before the days of canonical combining classes.

In the case of greatest concern to me, there can be two different
orders, but only one is appropriate for a given word.  In most cases,
only one word of that appearance exists, and one can usually guess which
one does exist. (That is why the system works despite the occasional
ambiguity.)  It's not unlike how Thai would work had phonetic order
been successfully insisted upon, except that there is no evidence that
sorting should be by appearance, whereas in Thai as it was encoded
before Unicode (and is now, after normalisation), encoding and sorting
are based purely on appearance.  (Well, officially - in practice, Thais
appear to sort by doing syllable-by-syllable comparisons.)

In this case of concern, the range of renderings is occasionally
different, which is another reason that two different encodings for the
same appearance must be tolerated.

Richard.

From asmusf at ix.netcom.com  Thu Jan 19 01:24:21 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 18 Jan 2017 23:24:21 -0800
Subject: Misspelling or Miscoding?
In-Reply-To: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
Message-ID: <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>

On 1/18/2017 12:35 PM, Doug Ewell wrote:
> Richard Wordingham wrote:
>
>> I think it is not a 'typographical error' if it renders as it should!
> What if it renders correctly on some systems but not on others?
>
> I do see your point, though. Writing systems that permit different
> spellings of the same glyph (cluster), only one of which is 'correct'
> even after normalization, can be tricky like this. I think this would
> still be a matter of 'misspelling' rather than 'miscoding' because a
> typist should not have to be concerned with character codes per se.
>   
The sequence of character codes isn't necessarily determined by the 
typist's choice of keystrokes.

For example, autocorrection and similar support can result in a 
substitution of character codes. For scripts with this issue, it would 
be useful if such mechanisms were more widespread; effectively 
normalizing to a preferred input order.

Arguing over whether this is called mistyping or miscoding or 
misspelling is perhaps less helpful than trying to get the word out that 
some scripts could strongly benefit from that additional software layer.

A./


From gwalla at gmail.com  Thu Jan 19 01:52:12 2017
From: gwalla at gmail.com (Garth Wallace)
Date: Wed, 18 Jan 2017 23:52:12 -0800
Subject: Misspelling or Miscoding?
In-Reply-To: <20170117230405.44f7601e@JRWUBU2>
References: <20170117230405.44f7601e@JRWUBU2>
Message-ID: <CA+p4_H0+SUP9nFhhEQHLS+=u7qf_6n3DgiVSMv-1Ra33y2mZww@mail.gmail.com>

On Tue, Jan 17, 2017 at 3:04 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> When someone enters text with the code points in the wrong order but
> the text renders to give the appearance that should have been intended,
> has the typist misspelt, miscoded or what?  I am talking about
> sequences that are *not* even compatibility equivalent to what should
> have been entered.
>
> Richard.
>

You mean like when a font puts combining diacritics over the following base
character, and people type it in that order so it looks right on their
screen?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170118/78bb57c9/attachment.html>

From mark at macchiato.com  Thu Jan 19 01:52:05 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 19 Jan 2017 08:52:05 +0100
Subject: Misspelling or Miscoding?
In-Reply-To: <20170119011250.01bd6a96@JRWUBU2>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <20170119011250.01bd6a96@JRWUBU2>
Message-ID: <CAJ2xs_HkUjcwUMZnyeXwe6NfShdXTG8LSLcRrXfL2dRSSW7G7g@mail.gmail.com>

We don't have any set terminology for what you're talking about.

We've often just used 'misspelling' in a broad sense, which can include
visually confusable or identical glyphs. For example, spelling 'of' with an
omicron would be one, as well as a word  in a complex script with swapped
marks. And cases of the former occur surprisingly often in web pages:
probably to do with people switching keyboards in mid-stride. They are in
(say) a Greek keyboard, hit omicron and then the Greek character in the 'f'
position, notice it is wrong, and backspace ? but just over the character
that 'looks' wrong ? then type 'f'.

The problem with using the term "miscoding" is that it is overloaded. It
can be used as having something to do with the character encoding level:
for example, interpreting a string of UTF-8 bytes as Latin-1. The sequence
<omicron, f> is a perfectly valid Unicode string, not ? in that sense ?
miscoded.

Mark

On Thu, Jan 19, 2017 at 2:12 AM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Wed, 18 Jan 2017 13:35:55 -0700
> "Doug Ewell" <doug at ewellic.org> wrote:
>
> > Richard Wordingham wrote:
> >
> > > I think it is not a 'typographical error' if it renders as it
> > > should!
> >
> > What if it renders correctly on some systems but not on others?
>
> > I do see your point, though. Writing systems that permit different
> > spellings of the same glyph (cluster), only one of which is 'correct'
> > even after normalization, can be tricky like this. I think this would
> > still be a matter of 'misspelling' rather than 'miscoding' because a
> > typist should not have to be concerned with character codes per se.
>
> As you've put it, it sounds like the way things were with a simple Thai
> typewriter.  A vowel below, a vowel above and a tone mark could be
> typed in any order, as though they had three different non-zero
> combining classes.  Thais were trained to type into computers by input
> routines only accepting the marks in the correct order - this was
> before the days of canonical combining classes.
>
> In the case of greatest concern to me, there can be two different
> orders, but only one is appropriate for a given word.  In most cases,
> only one word of that appearance exists, and one can usually guess which
> one does exist. (That is why the system works despite the occasional
> ambiguity.)  It's not unlike how Thai would work had phonetic order
> been successfully insisted upon, except that there is no evidence that
> sorting should be by appearance, whereas in Thai as it was encoded
> before Unicode (and is now, after normalisation), encoding and sorting
> are based purely on appearance.  (Well, officially - in practice, Thais
> appear to sort by doing syllable-by-syllable comparisons.)
>
> In this case of concern, the range of renderings is occasionally
> different, which is another reason that two different encodings for the
> same appearance must be tolerated.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170119/7ebe6429/attachment.html>

From richard.wordingham at ntlworld.com  Thu Jan 19 02:45:08 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 19 Jan 2017 08:45:08 +0000
Subject: Misspelling or Miscoding?
In-Reply-To: <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
Message-ID: <20170119084508.706f4774@JRWUBU2>

On Wed, 18 Jan 2017 23:24:21 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> The sequence of character codes isn't necessarily determined by the 
> typist's choice of keystrokes.

Wow!  ESP for input?

> For example, autocorrection and similar support can result in a 
> substitution of character codes. For scripts with this issue, it
> would be useful if such mechanisms were more widespread; effectively 
> normalizing to a preferred input order.

That's not the problem I have in mind.  Dotted circles can help, but
for Northern Thai in the Lanna script, USE has accidentally (I hope)
banned 17% of the vocabulary and demanded that a further 37% be
misspelt.  It will be much the same for Tai Khuen.  Once USE is
fixed, the problem is that the encodings of */hi:m/ and /mi:/ may be
different but render identically; it so happens that words like the
former are rare. Are you aware of predictive input causing havoc with
intellectual content?  

> Arguing over whether this is called mistyping or miscoding or 
> misspelling is perhaps less helpful than trying to get the word out
> that some scripts could strongly benefit from that additional
> software layer.

Enabling that may require some tools to update to Unicode 5.1.
(Hunspell, I'm looking at you.)

One thing that would be helpful is some way of showing the difference
between distinctly encoded homographs if a spell-checker can help.  (I
fear it may not be quite the right tool - different suggestion logic is
needed.) Coloured fonts may help once support for them has spread, but
we're probably still looking at bespoke tools to switch such hints on
and off.  In the past I've used transliteration fonts to check what I've
actually typed.

One problem with getting the message out is choosing the right words.
That's why I came here for advice on the terminology for such issues.

Richard.

From richard.wordingham at ntlworld.com  Thu Jan 19 03:05:29 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 19 Jan 2017 09:05:29 +0000
Subject: Misspelling or Miscoding?
In-Reply-To: <CA+p4_H0+SUP9nFhhEQHLS+=u7qf_6n3DgiVSMv-1Ra33y2mZww@mail.gmail.com>
References: <20170117230405.44f7601e@JRWUBU2>
 <CA+p4_H0+SUP9nFhhEQHLS+=u7qf_6n3DgiVSMv-1Ra33y2mZww@mail.gmail.com>
Message-ID: <20170119090529.045d2f86@JRWUBU2>

On Wed, 18 Jan 2017 23:52:12 -0800
Garth Wallace <gwalla at gmail.com> wrote:

> On Tue, Jan 17, 2017 at 3:04 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:  
> 
> > When someone enters text with the code points in the wrong order but
> > the text renders to give the appearance that should have been
> > intended, has the typist misspelt, miscoded or what?  I am talking
> > about sequences that are *not* even compatibility equivalent to
> > what should have been entered.
> >
> > Richard.
> >  
> 
> You mean like when a font puts combining diacritics over the
> following base character, and people type it in that order so it
> looks right on their screen?

No.  I particularly have in mind Tai Tham script pairs like ???? /h??m/
<HIGH HA, SIGN UUE, SAKOT, MA> (MFL p831) and ???? /m??/ <HIGH HA,
SAKOT, MA, SIGN UUE> (MFL p793), which look identical with a competent
renderer but are sorted differently.  The page references are to a
Northern Thai to Thai dictionary.

?Richard.


From asmusf at ix.netcom.com  Thu Jan 19 16:25:14 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 19 Jan 2017 14:25:14 -0800
Subject: Misspelling or Miscoding?
In-Reply-To: <20170119084508.706f4774@JRWUBU2>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
Message-ID: <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>

OK, I was first thinking you had something more in mind like ordering of 
(e.g. Lao?) tone marks that normally do not render exactly the same, but 
close, and where some font/rendering engine could go and make them 
identical in an effort to be helpful. In those cases one can presume a 
preferred ordering, and, in principle, that can be imposed upon a text, 
whether via autocorrrect or spell check.

Now I'm thinking your focus was more on cases the like two Khmer 
subjoined consonant sequences:
U+17D2 U+178A     ??         KHMER CONSONANT SIGN COENG DA
U+17D2 U+178F     ??         KHMER CONSONANT SIGN COENG TA
that apparently have identical appearance, even though one is a 'd' and 
the other a 't'. (That's the only example that I'm personally familiar 
with).

Unless some fonts ever make a distinction, this seems to be a case where 
"miscoding" might be an appropriate term. As far as the user is 
concerned, the issue only arises because of the encoding scheme used. (A 
hypothetical different scheme that had one of these precomposed with a 
name containing something like DA OR TA would have not surfaced an 
invisible distinction).

Are your examples likewise legitimate duplications or merely the case 
that one could type something else and have it look the same (accidentally).

The Khmer example would seem fairly resistant to automated correction if 
it is a free choice. If, instead, the immediately preceding consonant 
comes from two disjoined sets, for example if TA COENG TA was possible, 
but not TA COENG DA, then there's scope for spell check.

In designing label generation rules for domain names, one clearly 
doesn't want two labels that cannot be distinguished other than on the 
encoding level. For Khmer, the decision was to allow both, but not 
simultaneously (by allowing only one member of each minimal pair to be 
registered, which one is decided by the order of application).

A./

On 1/19/2017 12:45 AM, Richard Wordingham wrote:
> On Wed, 18 Jan 2017 23:24:21 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>
>> The sequence of character codes isn't necessarily determined by the
>> typist's choice of keystrokes.
> Wow!  ESP for input?
>
>> For example, autocorrection and similar support can result in a
>> substitution of character codes. For scripts with this issue, it
>> would be useful if such mechanisms were more widespread; effectively
>> normalizing to a preferred input order.
> That's not the problem I have in mind.  Dotted circles can help, but
> for Northern Thai in the Lanna script, USE has accidentally (I hope)
> banned 17% of the vocabulary and demanded that a further 37% be
> misspelt.  It will be much the same for Tai Khuen.  Once USE is
> fixed, the problem is that the encodings of */hi:m/ and /mi:/ may be
> different but render identically; it so happens that words like the
> former are rare. Are you aware of predictive input causing havoc with
> intellectual content?
>
>> Arguing over whether this is called mistyping or miscoding or
>> misspelling is perhaps less helpful than trying to get the word out
>> that some scripts could strongly benefit from that additional
>> software layer.
> Enabling that may require some tools to update to Unicode 5.1.
> (Hunspell, I'm looking at you.)
>
> One thing that would be helpful is some way of showing the difference
> between distinctly encoded homographs if a spell-checker can help.  (I
> fear it may not be quite the right tool - different suggestion logic is
> needed.) Coloured fonts may help once support for them has spread, but
> we're probably still looking at bespoke tools to switch such hints on
> and off.  In the past I've used transliteration fonts to check what I've
> actually typed.
>
> One problem with getting the message out is choosing the right words.
> That's why I came here for advice on the terminology for such issues.
>
> Richard.
>


From khaledhosny at eglug.org  Thu Jan 19 17:36:07 2017
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Fri, 20 Jan 2017 01:36:07 +0200
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <850127614.20750.1484356681023.JavaMail.www@wwinf1p26>
References: <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
 <20170112150141.GG14923@macbook>
 <850127614.20750.1484356681023.JavaMail.www@wwinf1p26>
Message-ID: <20170119233607.GG30817@macbook>

On Sat, Jan 14, 2017 at 02:18:01AM +0100, Marcel Schneider wrote:
> On Thu, 12 Jan 2017 17:01:41 +0200, Khaled Hosny wrote:
> > 
> > LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is 
> > not released yet. 
> 
> Is the integration of HarfBuzz limited to free software? 

HarfBuzz has a fairly liberal license, so in theory it can be used in
anywhere.

> And what might be the reason of the delayed integration of HarfBuzz in the 
> Windows version of LibreOffice?

Nothing specific, LibreOffice and OpenOffice.org before it and most like
StarOffice before them just used what API the platform provides to do
text layout, which is not an uncommon practice, it even seemed to be the
best practice back in time. The reasons it finally switched to HarfBuzz
are outlined in:

https://bugs.documentfoundation.org/show_bug.cgi?id=89870

Regards,
Khaled

From richard.wordingham at ntlworld.com  Thu Jan 19 19:04:06 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 20 Jan 2017 01:04:06 +0000
Subject: Misspelling or Miscoding?
In-Reply-To: <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
Message-ID: <20170120010406.03929f9e@JRWUBU2>

On Thu, 19 Jan 2017 14:25:14 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> Now I'm thinking your focus was more on cases the like two Khmer 
> subjoined consonant sequences:
> U+17D2 U+178A     ??         KHMER CONSONANT SIGN COENG DA
> U+17D2 U+178F     ??         KHMER CONSONANT SIGN COENG TA
> that apparently have identical appearance, even though one is a 'd'
> and the other a 't'. (That's the only example that I'm personally
> familiar with).

> Unless some fonts ever make a distinction, this seems to be a case
> where "miscoding" might be an appropriate term. As far as the user is 
> concerned, the issue only arises because of the encoding scheme used.
> (A hypothetical different scheme that had one of these precomposed
> with a name containing something like DA OR TA would have not
> surfaced an invisible distinction).

Such a font might be KHOM2004 mentioned by Michel Antelme in his paper
aefek.free.fr/iso_album/antelme_bis.pdf.  On p25 he makes the point
that a distinct COENG DA was still on its last legs in Cambodia in the
1920's; it's still distinct in the Khom variety of the script.  This
situation makes a good case for the Tibetan model.  We might end up
making the Khmer script a mixed system like Tai Tham by adding a
character KHMER CONSONANT SIGN ARCHAIC COENG DA.

There seem to be some Arabic script analogues, where only one or two
forms differ between a pair of letters.

This is not the situation I was interested in, but it's clearly related.

> Are your examples likewise legitimate duplications or merely the case 
> that one could type something else and have it look the same
> (accidentally).

They're mostly legitimate duplications, though some may stretch
phonological credulity.  For example, in Tai Tham, <NA, SAKOT, HIGH TA,
SIGN I> is part of a common Pali verb inflection and <NA, SIGN I, SAKOT,
HIGH TA> is a valid Northern Thai word (apparently not a Pali loan,
despite its spelling), but <MA, SAKOT, HIGH TA, SIGN I> would probably
be a miscoding of <MA, SIGN I, SAKOT, HIGH TA> (an attested final
syllable) if the language were Northern Thai.  I suppose
it's just conceivable that the former might be the name of a fruit, but
I'm not aware of the syllabic nasal being written that way.

A spell checker would pick up most such errors, though getting the
underlying problem explained to the user might be difficult.

> The Khmer example would seem fairly resistant to automated correction
> if it is a free choice. If, instead, the immediately preceding
> consonant comes from two disjoined sets, for example if TA COENG TA
> was possible, but not TA COENG DA, then there's scope for spell check.

It's supposed to be based on the phonetics, so a spell check could be
used, but not a grammar rule.  However, I can imagine someone writing
in accordance with a rule restricting them to certain bases.

Richard.


From asmusf at ix.netcom.com  Thu Jan 19 20:41:07 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 19 Jan 2017 18:41:07 -0800
Subject: Misspelling or Miscoding?
In-Reply-To: <20170120010406.03929f9e@JRWUBU2>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
 <20170120010406.03929f9e@JRWUBU2>
Message-ID: <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>

On 1/19/2017 5:04 PM, Richard Wordingham wrote:
> On Thu, 19 Jan 2017 14:25:14 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>
>> Now I'm thinking your focus was more on cases the like two Khmer
>> subjoined consonant sequences:
>> U+17D2 U+178A     ??         KHMER CONSONANT SIGN COENG DA
>> U+17D2 U+178F     ??         KHMER CONSONANT SIGN COENG TA
>> that apparently have identical appearance, even though one is a 'd'
>> and the other a 't'. (That's the only example that I'm personally
>> familiar with).
>> Unless some fonts ever make a distinction, this seems to be a case
>> where "miscoding" might be an appropriate term. As far as the user is
>> concerned, the issue only arises because of the encoding scheme used.
>> (A hypothetical different scheme that had one of these precomposed
>> with a name containing something like DA OR TA would have not
>> surfaced an invisible distinction).
> Such a font might be KHOM2004 mentioned by Michel Antelme in his paper
> aefek.free.fr/iso_album/antelme_bis.pdf.  On p25 he makes the point
> that a distinct COENG DA was still on its last legs in Cambodia in the
> 1920's; it's still distinct in the Khom variety of the script.  This
> situation makes a good case for the Tibetan model.  We might end up
> making the Khmer script a mixed system like Tai Tham by adding a
> character KHMER CONSONANT SIGN ARCHAIC COENG DA.
>
> There seem to be some Arabic script analogues, where only one or two
> forms differ between a pair of letters.
Yes, and these are treated similarly to the Khmer case in label 
generation rulesets for domain names.
>
> This is not the situation I was interested in, but it's clearly related.
Funny thing is, not actually knowing Khmer, I hadn't thought of the 
COENG DA as a "form of DA", but had considered the sequence it's own entity.

In Latin you have to characters that look like reverse e but have 
different upper cases so that they have a distinct encoding. (You could 
argue that picking the wrong member of a disunified set is a miscoding, 
but I think "misspelling" works fine -- in another context we limit the 
term "misspelling" to phono-something or typo/grapho-something 
*possible* spellings, and try to not restrict them for that purpose. The 
"impossible" ones, are ones that we expect some font or renderer to not 
support on the basis that they are not needed, and those we do restrict; 
wouldn't use the name "miscoding" for those, just "invalid" does nicely 
for us in that context).

The case where something (=member of or associated with an alphabet) is 
simply and fully identical in appearance in all contexts (and I regard 
script as a context) is fortunately quite rare in Unicode. Your examples 
may be the closest thing.
>
>> Are your examples likewise legitimate duplications or merely the case
>> that one could type something else and have it look the same
>> (accidentally).
> They're mostly legitimate duplications, though some may stretch
> phonological credulity.  For example, in Tai Tham, <NA, SAKOT, HIGH TA,
> SIGN I> is part of a common Pali verb inflection and <NA, SIGN I, SAKOT,
> HIGH TA> is a valid Northern Thai word (apparently not a Pali loan,
> despite its spelling), but <MA, SAKOT, HIGH TA, SIGN I> would probably
> be a miscoding of <MA, SIGN I, SAKOT, HIGH TA> (an attested final
> syllable) if the language were Northern Thai.  I suppose
> it's just conceivable that the former might be the name of a fruit, but
> I'm not aware of the syllabic nasal being written that way.
>
> A spell checker would pick up most such errors, though getting the
> underlying problem explained to the user might be difficult.
>
>> The Khmer example would seem fairly resistant to automated correction
>> if it is a free choice. If, instead, the immediately preceding
>> consonant comes from two disjoined sets, for example if TA COENG TA
>> was possible, but not TA COENG DA, then there's scope for spell check.
> It's supposed to be based on the phonetics, so a spell check could be
> used, but not a grammar rule.  However, I can imagine someone writing
> in accordance with a rule restricting them to certain bases.
Your last sentence reads as if you might equally well meant "can't" 
instead of "can" (?)

Having agreement in consonants or vowels across syllables or words isn't 
necessarily unheard of; spell checkers tend to go on the basis of 
existing lexical items, not necessarily purely productive rules. At 
least the ones I use for European languages have this annoying habit of 
not having a productive rule for compounds - even for languages that do 
allow arbitrary compound formation.

Anyway, digressing from your point.

A./

From charupdate at orange.fr  Fri Jan 20 00:59:39 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 20 Jan 2017 07:59:39 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use
In-Reply-To: <20170119233607.GG30817@macbook>
References: <1236918755.9879.1483965771868.JavaMail.www@wwinf1p19>
 <572373065.27540.1483997980333.JavaMail.www@wwinf1p25>
 <b7491d71-0c35-a72d-4588-d01fd20d0b55@ix.netcom.com>
 <1469162895.239.1484114453124.JavaMail.www@wwinf1h11>
 <20170111083212.476f492e@JRWUBU2>
 <93666313-a432-c3b0-44dc-6edd43c19eda@it.aoyama.ac.jp>
 <20170112063524.GF14923@macbook>
 <1634058514.13007.1484230938837.JavaMail.www@wwinf1p25>
 <20170112150141.GG14923@macbook>
 <850127614.20750.1484356681023.JavaMail.www@wwinf1p26>
 <20170119233607.GG30817@macbook>
Message-ID: <905739506.715.1484895580643.JavaMail.www@wwinf1p07>

On Fri, 20 Jan 2017 01:36:07 +0200, Khaled Hosny wrote:
> 
> On Sat, Jan 14, 2017 at 02:18:01AM +0100, Marcel Schneider wrote: 
> > On Thu, 12 Jan 2017 17:01:41 +0200, Khaled Hosny wrote: 
> > > 
> > > LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is 
> > > not released yet. 
> > 
> > Is the integration of HarfBuzz limited to free software? 
> 
> HarfBuzz has a fairly liberal license, so in theory it can be used in 
> anywhere. 
> 
> > And what might be the reason of the delayed integration of HarfBuzz in the 
> > Windows version of LibreOffice? 
> 
> Nothing specific, LibreOffice and OpenOffice.org before it and most like 
> StarOffice before them just used what API the platform provides to do 
> text layout, which is not an uncommon practice, it even seemed to be the 
> best practice back in time. The reasons it finally switched to HarfBuzz 
> are outlined in: 
> 
> https://bugs.documentfoundation.org/show_bug.cgi?id=89870 

Thank you for the great job you are doing for cross-platform text layout and 
Unicode implementation!

Now we?re just missing a good reason to bring to the users why Edge still doesn?t 
support the Unicode fraction slash specific text rendering. Doubtlessly if it did, 
users would expect Word to do the same. Then if Word did, continuing this way would 
make it a clone of Publisher. Is that what we shall tell people when they wonder 
why the fraction slash?that may be in a prominent position on the keyboard, such as 
on Shift + AltGr/Option/0x10 + 7?doesn?t work for them when they?re on Word? 
Thus we?ll end up recommending to use LibreOffice throughout. IMO that?s fair.
(Though we?ll have to get the NNBSP displayed. Quite easy, but deliberately discarded.)

On the other hand, Microsoft?s way of writing good-looking vulgar fractions seems to be 
with super/sub scripts. That could expand to use superscripts for ordinals and 
abbreviations, too. Is that what we?re supposed to do? The answer has been ?no!? 
But in a user-centered approach, we?ve to provide both and let the user choose 
what?s the most appropriate for the actual task. I think that not doing so is 
to overstate the separation between text encoding and typography, that has been 
questioned anyway. [1]

Regards,
Marcel

[1] http://www.cairn.info/article.php?ID_REVUE=DN&ID_NUMPUBLIE=DN_063&ID_ARTICLE=DN_063_0089&FRM=B


From richard.wordingham at ntlworld.com  Fri Jan 20 02:37:01 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 20 Jan 2017 08:37:01 +0000
Subject: Misspelling or Miscoding?
In-Reply-To: <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
 <20170120010406.03929f9e@JRWUBU2>
 <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>
Message-ID: <20170120083701.56d86075@JRWUBU2>

On Thu, 19 Jan 2017 18:41:07 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> On 1/19/2017 5:04 PM, Richard Wordingham wrote:
> > On Thu, 19 Jan 2017 14:25:14 -0800
> > Asmus Freytag <asmusf at ix.netcom.com> wrote:

> >> The Khmer example would seem fairly resistant to automated
> >> correction if it is a free choice. If, instead, the immediately
> >> preceding consonant comes from two disjoined sets, for example if
> >> TA COENG TA was possible, but not TA COENG DA, then there's scope
> >> for spell check.  
> > It's supposed to be based on the phonetics, so a spell check could
> > be used, but not a grammar rule.  However, I can imagine someone
> > writing in accordance with a rule restricting them to certain
> > bases.  
> Your last sentence reads as if you might equally well meant "can't" 
> instead of "can" (?)

I meant 'can'.  According to Huffman's 'Cambodian System of Writing',
initial TA is to be read as /d/ in compounds formed by infixes.  (The
spelling may have changed since then.)  Suffixed to ? NNO (which is in
the retroflex series), the subscript is to be read as /d/, while
subscripted to ? NO, it is usually /t/ but occasionally /d/.  I would be
tempted to apply the Pali & Sanskrit rule of place agreement and
use COENG DA below ? NNO and COENG TA below ? NO.  I would expect
similar agreement with ? DA and ? TA.

Interestingly, such a discordance in the use of the nasals also occurs
in Northern Thai; DA (= Indic DDA) may be written subscript to NA
whereas the Indic place agreement rule would dictate NNA.  This
increases the visual ambiguity of subscripts on the ligature NAA - both
/-n da?/ <NA, SAKOT, DA, SIGN AA> and /na?t/ <NA, SIGN AA, SAKOT, DA>
occur, but there are no anagrammatic homographs in the
dictionary.  The example ?????? of /-n da?/ shows every sign of having
been borrowed via Khmer.

Richard.


From marc at keyman.com  Sat Jan 21 04:18:18 2017
From: marc at keyman.com (Marc Durdin)
Date: Sat, 21 Jan 2017 17:18:18 +0700
Subject: Misspelling or Miscoding?
In-Reply-To: <20170120083701.56d86075@JRWUBU2>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
 <20170120010406.03929f9e@JRWUBU2>
 <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>
 <20170120083701.56d86075@JRWUBU2>
Message-ID: <CAGK-wAJ4K0Oy4q25RbtQaYoBhyt6-Sq=X5-wrfc6e9KdnTmW5Q@mail.gmail.com>

On 20 January 2017 at 15:37, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Thu, 19 Jan 2017 18:41:07 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>
> > On 1/19/2017 5:04 PM, Richard Wordingham wrote:
> > > On Thu, 19 Jan 2017 14:25:14 -0800
> > > Asmus Freytag <asmusf at ix.netcom.com> wrote:
>
> > >> The Khmer example would seem fairly resistant to automated
> > >> correction if it is a free choice. If, instead, the immediately
> > >> preceding consonant comes from two disjoined sets, for example if
> > >> TA COENG TA was possible, but not TA COENG DA, then there's scope
> > >> for spell check.
> > > It's supposed to be based on the phonetics, so a spell check could
> > > be used, but not a grammar rule.  However, I can imagine someone
> > > writing in accordance with a rule restricting them to certain
> > > bases.
> > Your last sentence reads as if you might equally well meant "can't"
> > instead of "can" (?)
>
> I meant 'can'.  According to Huffman's 'Cambodian System of Writing',
> initial TA is to be read as /d/ in compounds formed by infixes.  (The
> spelling may have changed since then.)  Suffixed to ? NNO (which is in
> the retroflex series), the subscript is to be read as /d/, while
> subscripted to ? NO, it is usually /t/ but occasionally /d/.  I would be
> tempted to apply the Pali & Sanskrit rule of place agreement and
> use COENG DA below ? NNO and COENG TA below ? NO.  I would expect
> similar agreement with ? DA and ? TA.
>

Khmer spelling is inconsistent enough that attempts to leverage this kind
of rule are in my opinion of limited utility. This kind of knowledge is
better embedded in dictionaries where it is accessible to readers, than in
an encoding where it just introduces ambiguity and confusion to the average
user.

Presentation is identical in modern Khmer. From what I've observed, most
Khmer users type the subscript which is most obvious to them, that is COENG
+ TA as the major form is visually similar.

The online dictionaries I've consulted are somewhat inconsistent in their
use of COENG DA/TA (and do not normalise searches). The rule regarding
suffixing to ? NNO seems consistent as far as I can tell, but suffixed to
other letters, the pronunciation is less consistent. In my current Khmer
language learning, My tutors have suggested that the pronunciation is
inconsistent and in some cases can be pronounced either way. Some examples
of words using COENG DA/TA:

?????? /b?ndaal/ giving rise to
?????? /b?nt?c/ a little
???? /pd?y/ husband
????? /kattaa/ agent, factor
?????? /cendaa/ thought, thinking
???????? /vi?c?ntaa/ or /vi?c?ndaa/ daydreaming
???? /staa/ arrogantly

Marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170121/b02a3082/attachment.html>

From opokupongkyekyeku at yahoo.com  Sat Jan 21 12:56:39 2017
From: opokupongkyekyeku at yahoo.com (Kyekyeku Opoku-Pong)
Date: Sat, 21 Jan 2017 18:56:39 +0000 (UTC)
Subject: Encoding West African Adinkra sysmbols
In-Reply-To: <CAGK-wAJ4K0Oy4q25RbtQaYoBhyt6-Sq=X5-wrfc6e9KdnTmW5Q@mail.gmail.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
 <20170120010406.03929f9e@JRWUBU2>
 <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>
 <20170120083701.56d86075@JRWUBU2>
 <CAGK-wAJ4K0Oy4q25RbtQaYoBhyt6-Sq=X5-wrfc6e9KdnTmW5Q@mail.gmail.com>
Message-ID: <1028605960.1637145.1485024999338@mail.yahoo.com>

Hello,I hope this is the right forum to seek help.I am looking for the possibility of encoding Adinkra symbols used extensively in Ghana and West Africa. There is information on Adinkra symbols and their meanings at:?http://www.adinkra.org/htmls/adinkra_index.htm
How do I go about the process of getting the symbols approved for Unicode.
Thank you,Kyekyeku
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170121/053ef4c1/attachment.html>

From everson at evertype.com  Sun Jan 22 11:21:33 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 22 Jan 2017 17:21:33 +0000
Subject: Encoding West African Adinkra sysmbols
In-Reply-To: <1028605960.1637145.1485024999338@mail.yahoo.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
 <20170120010406.03929f9e@JRWUBU2>
 <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>
 <20170120083701.56d86075@JRWUBU2>
 <CAGK-wAJ4K0Oy4q25RbtQaYoBhyt6-Sq=X5-wrfc6e9KdnTmW5Q@mail.gmail.com>
 <1028605960.1637145.1485024999338@mail.yahoo.com>
Message-ID: <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com>

Are they used in plain text? How?

> On 21 Jan 2017, at 18:56, Kyekyeku Opoku-Pong <opokupongkyekyeku at yahoo.com> wrote:
> 
> Hello,
> I hope this is the right forum to seek help.
> I am looking for the possibility of encoding Adinkra symbols used extensively in Ghana and West Africa. There is information on Adinkra symbols and their meanings at: http://www.adinkra.org/htmls/adinkra_index.htm
> 
> How do I go about the process of getting the symbols approved for Unicode.
> 
> Thank you,
> Kyekyeku


From verdy_p at wanadoo.fr  Sun Jan 22 11:42:40 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 22 Jan 2017 18:42:40 +0100
Subject: Encoding West African Adinkra sysmbols
In-Reply-To: <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
 <20170120010406.03929f9e@JRWUBU2>
 <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>
 <20170120083701.56d86075@JRWUBU2>
 <CAGK-wAJ4K0Oy4q25RbtQaYoBhyt6-Sq=X5-wrfc6e9KdnTmW5Q@mail.gmail.com>
 <1028605960.1637145.1485024999338@mail.yahoo.com>
 <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com>
Message-ID: <CAGa7JC0pmu60+5+qecYNPuFW85p+WH2RjZ5bA2As+GMeA3KPsQ@mail.gmail.com>

I think the book listed on the web site is a perfect example:
https://www.amazon.com/exec/obidos/ASIN/B000059TIM/adinkrasymbol-20

If it locally has cultural meaning, it should be encoded for allowing
interchange also in encoded texts, rather than just artistic creations in
architecture, or handwritten books and displays.

That information site is interesting as it is currently collecting the
usages (including photos) and meanings.

2017-01-22 18:21 GMT+01:00 Michael Everson <everson at evertype.com>:

> Are they used in plain text? How?
>
> > On 21 Jan 2017, at 18:56, Kyekyeku Opoku-Pong <
> opokupongkyekyeku at yahoo.com> wrote:
> >
> > Hello,
> > I hope this is the right forum to seek help.
> > I am looking for the possibility of encoding Adinkra symbols used
> extensively in Ghana and West Africa. There is information on Adinkra
> symbols and their meanings at: http://www.adinkra.org/htmls/
> adinkra_index.htm
> >
> > How do I go about the process of getting the symbols approved for
> Unicode.
> >
> > Thank you,
> > Kyekyeku
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170122/af018e90/attachment.html>

From haberg-1 at telia.com  Sun Jan 22 13:47:11 2017
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Sun, 22 Jan 2017 20:47:11 +0100
Subject: Encoding West African Adinkra sysmbols
In-Reply-To: <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
 <20170120010406.03929f9e@JRWUBU2>
 <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>
 <20170120083701.56d86075@JRWUBU2>
 <CAGK-wAJ4K0Oy4q25RbtQaYoBhyt6-Sq=X5-wrfc6e9KdnTmW5Q@mail.gmail.com>
 <1028605960.1637145.1485024999338@mail.yahoo.com>
 <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com>
Message-ID: <84A82C14-238E-494D-B2A3-0D9CC49A5927@telia.com>


> On 22 Jan 2017, at 18:21, Michael Everson <everson at evertype.com> wrote:
> 
> Are they used in plain text? How?

On textiles and walls in a similar fashion as emoji, it seems [1]. Known since the beginning of the 19th century.

1. https://en.wikipedia.org/wiki/Adinkra_symbols


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170122/8cec28af/attachment.html>

From verdy_p at wanadoo.fr  Sun Jan 22 15:31:50 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 22 Jan 2017 22:31:50 +0100
Subject: Encoding West African Adinkra sysmbols
In-Reply-To: <84A82C14-238E-494D-B2A3-0D9CC49A5927@telia.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
 <20170120010406.03929f9e@JRWUBU2>
 <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>
 <20170120083701.56d86075@JRWUBU2>
 <CAGK-wAJ4K0Oy4q25RbtQaYoBhyt6-Sq=X5-wrfc6e9KdnTmW5Q@mail.gmail.com>
 <1028605960.1637145.1485024999338@mail.yahoo.com>
 <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com>
 <84A82C14-238E-494D-B2A3-0D9CC49A5927@telia.com>
Message-ID: <CAGa7JC0Fto6ZMd-7iEokX=n8kLNCPXsCetT5q0rpny5ixRBWhg@mail.gmail.com>

I read that there are similar sets of symbols in Polynesian/Melanesian
cultures. There are possibly others in native Amerindian cultures, often
related to religious features, nature.

These symbols look in fact very similar to the initial creation of our
modern alphabets we all know, just a step behing ideograms as those used in
Mayan, Han, Egyptian and proto-Indo-European scripts, or runes in Europe,
or today's very active creation of emojis and lots of icons and logograms
created everywhere, by the industry and by various standard bodies: they
encode more than just a letter or identifiable word, but instead a
concept/idea which could be "spelled" orally by various sentences in modern
languages. Their properties would be complex to design to to their complex
meaning/associations and usage rules.


2017-01-22 20:47 GMT+01:00 Hans ?berg <haberg-1 at telia.com>:

>
> On 22 Jan 2017, at 18:21, Michael Everson <everson at evertype.com> wrote:
>
> Are they used in plain text? How?
>
>
> On textiles and walls in a similar fashion as emoji, it seems [1]. Known
> since the beginning of the 19th century.
>
> 1. https://en.wikipedia.org/wiki/Adinkra_symbols
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170122/3f118dee/attachment.html>

From duerst at it.aoyama.ac.jp  Sun Jan 22 20:37:32 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Mon, 23 Jan 2017 11:37:32 +0900
Subject: Encoding West African Adinkra sysmbols
In-Reply-To: <1028605960.1637145.1485024999338@mail.yahoo.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
 <20170120010406.03929f9e@JRWUBU2>
 <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>
 <20170120083701.56d86075@JRWUBU2>
 <CAGK-wAJ4K0Oy4q25RbtQaYoBhyt6-Sq=X5-wrfc6e9KdnTmW5Q@mail.gmail.com>
 <1028605960.1637145.1485024999338@mail.yahoo.com>
Message-ID: <f8de20c7-42ed-e833-cec0-b35c9d666221@it.aoyama.ac.jp>

On 2017/01/22 03:56, Kyekyeku Opoku-Pong wrote:
> Hello,I hope this is the right forum to seek help.I am looking for the possibility of encoding Adinkra symbols used extensively in Ghana and West Africa. There is information on Adinkra symbols and their meanings at: http://www.adinkra.org/htmls/adinkra_index.htm
> How do I go about the process of getting the symbols approved for Unicode.
> Thank you,Kyekyeku

The two main things would be:

1) Write a proposal
2) Document usage (this is part of the proposal, but it is important to 
show that these symbols are actually used in running texts, not e.g. 
just as decorations on walls,...)

Regards,   Martin.

From opokupongkyekyeku at yahoo.com  Sun Jan 22 20:44:06 2017
From: opokupongkyekyeku at yahoo.com (Kyekyeku Opoku-Pong)
Date: Mon, 23 Jan 2017 02:44:06 +0000 (UTC)
Subject: Encoding West African Adinkra sysmbols
In-Reply-To: <CAGa7JC0Fto6ZMd-7iEokX=n8kLNCPXsCetT5q0rpny5ixRBWhg@mail.gmail.com>
References: <20170118133555.665a7a7059d7ee80bb4d670165c8327d.23bcccc767.wbe@email03.godaddy.com>
 <1775052c-27eb-43d9-da26-c8715ab83720@ix.netcom.com>
 <20170119084508.706f4774@JRWUBU2>
 <daa4e939-db69-a8fc-0eb4-40506cb151b6@ix.netcom.com>
 <20170120010406.03929f9e@JRWUBU2>
 <bae68a98-2d32-3192-5661-3d3450f7637f@ix.netcom.com>
 <20170120083701.56d86075@JRWUBU2>
 <CAGK-wAJ4K0Oy4q25RbtQaYoBhyt6-Sq=X5-wrfc6e9KdnTmW5Q@mail.gmail.com>
 <1028605960.1637145.1485024999338@mail.yahoo.com>
 <9265FB90-FF41-4CF4-B46D-CA0E5CC1A0E8@evertype.com>
 <84A82C14-238E-494D-B2A3-0D9CC49A5927@telia.com>
 <CAGa7JC0Fto6ZMd-7iEokX=n8kLNCPXsCetT5q0rpny5ixRBWhg@mail.gmail.com>
Message-ID: <278493006.2506185.1485139446746@mail.yahoo.com>

Thank you all for the responses to my email.
Adinkra symbols are unique symbols of cultural expressions and proverbs that have been used in wax printing, royal symbols, carvings and jewelry in West Africa. They originate in the Ivory Coast and Ghana but the symbols are popular in West Africa and with Africans in diaspora.?
Some of the popular adinkra symbols are "Sankofa" and "Gye Nyame". This link http://www.adinkra.org/htmls/adinkra_index.htm?gives a good list of the symbols and their meanings.
Encoding the symbols would make it easy to use them as emojis and icons and in printing of all forms.?
     Thank you,Kyekyeku   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170123/23365837/attachment.html>

From charupdate at orange.fr  Mon Jan 23 03:30:17 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 23 Jan 2017 10:30:17 +0100 (CET)
Subject: Superscript and Subscript Characters in General Use / Re:
 French Superscript Abbreviations Fit Plain Text Requirements
In-Reply-To: <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com>
References: <FBE88121-00F0-4538-864C-4B0974F97FE0@northwestern.edu>
 <717069d1-2cdc-fe61-6c2e-b8e3f12f1dfc@cs.tut.fi>
 <20161227000522.7bb95f3e@JRWUBU2>
 <230214984.10599.1482854625210.JavaMail.www@wwinf1p10>
 <c90ec0e5-a67f-cf7e-9438-e8187eee427f@ix.netcom.com>
 <1877734343.9096.1482938748638.JavaMail.www@wwinf1p14>
 <72c27940-ca8e-f616-48d5-5ee58835b354@ix.netcom.com>
Message-ID: <1621294794.4399.1485163818350.JavaMail.www@wwinf1p26>

Gladly this thread comes now to a far better and very useful result.
A set of Unicode super- and subscripts are proven to be already promoted by Microsoft 
in a fully validated way. From this we can expand to promote the use of a set of 
Latin superscript letters. Connectedly, Microsoft?s position of unsupporting the 
OpenType rendering properties of U+2044 FRACTION SLASH (at least in a Latin script 
context in Edge) turns out to be a fairly user-frienly, practice-oriented option. 
That helps, too, to get around of holding people?s feet to the fire about U+2044.

On Wed, 28 Dec 2016 13:47:00 -0800, Asmus Freytag wrote:
[?]
> 
> Mathematical notation is a good example of such a mixed case: while 
> ordinary variables can be expressed in plain text with the help of 
> mathematical alphabets, the proper display of formulas requires markup. 
> Even Murray Sargent's plain text math is markup, albeit a very clever one 
> that re-uses conventions used for the inline presentation of mathematical 
> expression. (Where that is insufficient, it introduces additional 
> conventions, clearly extraneous to the content, and hence markup).
> 
http://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0119.html

Murray Sargent?s Nearly Plain-Text Encoding of Mathematics (UnicodeMath) is in my 
opinion a key gateway to the understanding of Unicode, and thus becomes a key point 
in my communication about Unicode-supporting keyboard layouts. See version 3.1:
http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.1.pdf

Thanks to Asmus Freytag for drawing our attention to it!

What makes this notation so important to this thread?s issue, is in that it uses 
Unicode superscripts and subscripts as a valid and parseable alternative to the 
[La]TeX-style notation that uses markup ('^' and '_'), ?since Unicode has a full set 
of decimal subscripts and superscripts. As a practical matter, numeric subscripts 
are typically entered using an underscore and the number followed by a space or 
an operator? (p. 7).

These Unicode superscript and subscript characters are parseable and are converted 
to formatted digits at build. Hence they are unambiguous, not random characters as 
sometimes alleged. They ?should be rendered the same way that scripts of the 
corresponding script nesting level would be rendered.? (p. 18)

Although fractions are ordinarily written with ASCII digits and slash, U+2044 can 
be used to get skewed fractions (p. 5) built up in Microsoft Word (where fractions 
can also be formatted using the math features). Combining both schemes, the user 
may feel free to write fractions using super/sub scripts around U+2044, as suggested 
in the already cited wiki proposing to add a huge autocorrect list for quick input:
https://answers.microsoft.com/en-us/msoffice/wiki/msoffice_word-mso_other/styled-fractions-in-windows/4a07d5fa-2484-4e39-b1f3-70bb3eb0c332

This is practice-oriented and user-friendly because relying only on the OpenType font 
feature specified for U+2044 would dramatically restrict the number of usable fonts, 
that in Latin script is traditionally several thousands, as opposed to complex scripts 
for which HarfBuzz is primarily intended, where the number of available typefaces is 
much smaller, so that full conversion to OpenType is feasible. So I think that the 
correct rendering of U+2044 in HarfBuzz targets mainly these complex scripts. In 
other scripts like Latin, the feature would then be a nice fall-off, that potentially 
raises user expectations about professional (typographical) ligature rendering.

At the other end, for drafts and even ?for simple documentation purposes?, 
?plain-text linearly formatted mathematical expressions can be used ?as is?? (p.29). 
That can be extended to vulgar fractions in current text, and abbreviations.

This helps to understand that any font with inconsistent glyphs for Unicode subscript 
and superscript digits is not Unicode conformant. 
The same applies to superscript i and n (as mentioned in:
http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0093.html
). These inconsistent fonts don?t conform to the Unicode Standard specifying that 
there is no functional difference between those characters that have the word 
SUPERSCRIPT in their name, and those that don?t:

TUS 9.0, ?7.8, p. 327:
| The superscript forms of the i and n letters can be found in the
| Superscripts and Subscripts block (U+2070..U+209F). The fact that the latter 
| two letters contain the word ?superscript? in their names instead of ?modifier 
| letter? is an historical artifact of original sources for the characters, and 
| is not intended to convey a functional distinction in the use of these 
| characters in the Unicode Standard.
http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G24762

Moreover, the Code Charts contain comment lines to these two characters, connecting 
them to the set of Unicode superscript Latin letters named ?MODIFIER LETTER?:

2071 SUPERSCRIPT LATIN SMALL LETTER I
* functions as a modifier letter
# <super> 0069
[?]
207F SUPERSCRIPT LATIN SMALL LETTER N
* functions as a modifier letter
# <super> 006E

Accordingly, the user can count on a whole small alphabet ? except q, that has been 
rejected arguing invented imaginary allegations on behalf of the UTC ? displaying in 
a consistent way in all complete, conformant fonts, with a running-text like layout 
so far as the fonts have proportional advance width. To run a test, see example in:
http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0093.html (again).

Trying to conclude so far (please feel free to correct), I now believe and will 
spread the word that following Microsoft ? a user-friendly corporation eager to help 
everybody make the most of Unicode ? the users of any word processor and text editor 
are welcome to use the Unicode repertoire as they need and like, while on the other 
hand, the recommendations in TUS may be considered a mere official discourse for 
encoding process management purposes, but with little through no real impact on 
actual practice. Hence, National Bodies and user communities as well as developers 
may issue usage recommendations of their own, to meet user expectations and propose 
working methods additionally?or alternatively?to those provided by the Standard.

Regards,
Marcel


From eric.muller at efele.net  Tue Jan 24 00:43:56 2017
From: eric.muller at efele.net (Eric Muller)
Date: Mon, 23 Jan 2017 22:43:56 -0800
Subject: how would you state requirements involving sorting?
Message-ID: <3231824a-2f9b-a6a3-6e1b-23cea04e9aec@efele.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170123/46a4d064/attachment.html>

From Shawn.Steele at microsoft.com  Tue Jan 24 02:52:38 2017
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Tue, 24 Jan 2017 08:52:38 +0000
Subject: how would you state requirements involving sorting?
In-Reply-To: <3231824a-2f9b-a6a3-6e1b-23cea04e9aec@efele.net>
References: <3231824a-2f9b-a6a3-6e1b-23cea04e9aec@efele.net>
Message-ID: <MWHPR03MB2813CD2730D1DD2F8FF77A0F82750@MWHPR03MB2813.namprd03.prod.outlook.com>

That requirement will probably really annoy speakers of some languages.

-Shawn

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Eric Muller
Sent: Monday, January 23, 2017 10:44 PM
To: unicode at unicode.org
Subject: how would you state requirements involving sorting?

Suppose you help somebody write requirements for a piece of software and you see an item:
Sorting. Diacritic marks need to be stripped when sorting titles

You know that sorting is a lot more complicated than removing diacritics, and that giving the directive above to a naive developer is going to lead to trouble. You know you want to end up with an implementation involving the UCA with a tailoring based on the locale. How would you suggest to reword the requirement?

Thanks,
Eric.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/e3b7316a/attachment.html>

From mark at macchiato.com  Tue Jan 24 06:38:11 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 24 Jan 2017 13:38:11 +0100
Subject: how would you state requirements involving sorting?
In-Reply-To: <3231824a-2f9b-a6a3-6e1b-23cea04e9aec@efele.net>
References: <3231824a-2f9b-a6a3-6e1b-23cea04e9aec@efele.net>
Message-ID: <CAJ2xs_HEEi1qdi3adM=mrRDjcW0J07BUVfp=SDfC8TZpAKmeSw@mail.gmail.com>

Perhaps suggest something along the following lines.

Sorting. Unicode-conformant collation (http://unicode.org/reports/tr10/)
must be used when sorting titles. The collation must follow the user's
locale, such as using ICU APIs (http://site.icu-project.org/).

Mark

On Tue, Jan 24, 2017 at 7:43 AM, Eric Muller <eric.muller at efele.net> wrote:

> Suppose you help somebody write requirements for a piece of software and
> you see an item:
>
> Sorting. Diacritic marks need to be stripped when sorting titles
>
>
> You know that sorting is a lot more complicated than removing diacritics,
> and that giving the directive above to a naive developer is going to lead
> to trouble. You know you want to end up with an implementation involving
> the UCA with a tailoring based on the locale. How would you suggest to
> reword the requirement?
>
> Thanks,
> Eric.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/888e4a0c/attachment.html>

From andrea.giammarchi at gmail.com  Tue Jan 24 10:39:11 2017
From: andrea.giammarchi at gmail.com (Andrea Giammarchi)
Date: Tue, 24 Jan 2017 16:39:11 +0000
Subject: Curly Lips Code Point Proposal
Message-ID: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>

I'd like to bring to your attention a request, about a common emoticon,
that has apparently no equivalent yet in the Emoji standard.

This was a PR to the Twemoji project:
https://github.com/twitter/twemoji/issues/199

The author also created a proper PDF explaining all the reasons:
Proposal for CURLY LIPS Emoji.pdf
<https://github.com/twitter/twemoji/files/727077/Proposal.for.CURLY.LIPS.Emoji.pdf>

I hope this can be considered in the near future as possible extra face.

Thanks in advance for any sort of outcome.

Best Regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/984d2804/attachment.html>

From gwalla at gmail.com  Tue Jan 24 11:05:19 2017
From: gwalla at gmail.com (Garth Wallace)
Date: Tue, 24 Jan 2017 09:05:19 -0800
Subject: Curly Lips Code Point Proposal
In-Reply-To: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>
References: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>
Message-ID: <CA+p4_H2u84FHh41UYLmXxgyGfBDxSdxQ3Jb2W24G8Y6mu=d7Cg@mail.gmail.com>

AIUI that's a "catlike face" smiley. "Homer eating a donut" is not what I
would associate with it at all, IME it's usually used to express something
on the order of "mischievous", "playful", or "acting cute". The closest
kaomoji equivalent, I think, is (???) or (???).

On Tue, Jan 24, 2017 at 8:39 AM, Andrea Giammarchi <
andrea.giammarchi at gmail.com> wrote:

> I'd like to bring to your attention a request, about a common emoticon,
> that has apparently no equivalent yet in the Emoji standard.
>
> This was a PR to the Twemoji project:
> https://github.com/twitter/twemoji/issues/199
>
> The author also created a proper PDF explaining all the reasons:
> Proposal for CURLY LIPS Emoji.pdf
> <https://github.com/twitter/twemoji/files/727077/Proposal.for.CURLY.LIPS.Emoji.pdf>
>
> I hope this can be considered in the near future as possible extra face.
>
> Thanks in advance for any sort of outcome.
>
> Best Regards
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/e88552a8/attachment.html>

From leoboiko at namakajiri.net  Tue Jan 24 11:12:03 2017
From: leoboiko at namakajiri.net (Leonardo Boiko)
Date: Tue, 24 Jan 2017 15:12:03 -0200
Subject: Curly Lips Code Point Proposal
In-Reply-To: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>
References: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>
Message-ID: <CAJ6uix6Fc-Q0MU+=KNBti99RtTQ5TgLahkbtFoCpRTqQmTYVKw@mail.gmail.com>

I find it curious that this community defines the ":3" emoji as "mmmm" or
"om nom nom".  In my circles it's quite the frequent emoticon/emoji, but
I've never seen it used this way.  Instead, they usually employ it as "cat
mouth" or "cat face", implying  the mood of cuteness, perkiness or
mischievousness. (This is distinct from U+1F431 CAT FACE in that it
represents a human making a cat-like mouth, not an actual cat.) Here are a
few images found through a web search for "cat face":


?
?


?
?
Here's the relevant TVTropes article:
http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile

(TVTropes, incidentally, is one of the many web forums which has a :3
textual emoji.)

And the KnowYourMeme page:
http://knowyourmeme.com/memes/3-cat-face


2017-01-24 14:39 GMT-02:00 Andrea Giammarchi <andrea.giammarchi at gmail.com>:

> I'd like to bring to your attention a request, about a common emoticon,
> that has apparently no equivalent yet in the Emoji standard.
>
> This was a PR to the Twemoji project:
> https://github.com/twitter/twemoji/issues/199
>
> The author also created a proper PDF explaining all the reasons:
> Proposal for CURLY LIPS Emoji.pdf
> <https://github.com/twitter/twemoji/files/727077/Proposal.for.CURLY.LIPS.Emoji.pdf>
>
> I hope this can be considered in the near future as possible extra face.
>
> Thanks in advance for any sort of outcome.
>
> Best Regards
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/7201b23c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-0.jpg
Type: image/jpeg
Size: 29144 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/7201b23c/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-1.jpg
Type: image/jpeg
Size: 15782 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/7201b23c/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-3.jpg
Type: image/jpeg
Size: 15171 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/7201b23c/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-5.jpg
Type: image/jpeg
Size: 5179 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/7201b23c/attachment-0003.jpg>

From gwalla at gmail.com  Tue Jan 24 11:37:10 2017
From: gwalla at gmail.com (Garth Wallace)
Date: Tue, 24 Jan 2017 09:37:10 -0800
Subject: Curly Lips Code Point Proposal
In-Reply-To: <CAJ6uix6Fc-Q0MU+=KNBti99RtTQ5TgLahkbtFoCpRTqQmTYVKw@mail.gmail.com>
References: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>
 <CAJ6uix6Fc-Q0MU+=KNBti99RtTQ5TgLahkbtFoCpRTqQmTYVKw@mail.gmail.com>
Message-ID: <CA+p4_H1yrjS4cRQu3z24qipgSf-AiuV1b2rCavdQ1UBzHY6pbA@mail.gmail.com>

On Tue, Jan 24, 2017 at 9:12 AM, Leonardo Boiko <leoboiko at namakajiri.net>
wrote:

> I find it curious that this community defines the ":3" emoji as "mmmm" or
> "om nom nom".  In my circles it's quite the frequent emoticon/emoji, but
> I've never seen it used this way.
>

I can kind of see how someone might get that impression. For example,
someone writing "om nom nom :3", and somebody who is unfamiliar with the
smiley's usage interpreting the meanings as linked, when the intent was
originally to express "I'm being silly".


>   Instead, they usually employ it as "cat mouth" or "cat face", implying
> the mood of cuteness, perkiness or mischievousness. (This is distinct from
> U+1F431 CAT FACE in that it represents a human making a cat-like mouth, not
> an actual cat.) Here are a few images found through a web search for "cat
> face":
>
> ?
>
>
>
I think the expression from manga and anime is probably the origin of the
smiley. That's consistent with the communities and contexts where it's
found most often. And the "catlike expression" I think is meant to be more
metaphorical, rather than a depiction of an actual facial expression. Like
how ?? is not meant to be an actual throbbing vein in the forehead.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/8ffcb966/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-5.jpg
Type: image/jpeg
Size: 5179 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/8ffcb966/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-1.jpg
Type: image/jpeg
Size: 15782 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/8ffcb966/attachment-0001.jpg>

From andrea.giammarchi at gmail.com  Tue Jan 24 11:39:36 2017
From: andrea.giammarchi at gmail.com (Andrea Giammarchi)
Date: Tue, 24 Jan 2017 17:39:36 +0000
Subject: Curly Lips Code Point Proposal
In-Reply-To: <CAJ6uix6Fc-Q0MU+=KNBti99RtTQ5TgLahkbtFoCpRTqQmTYVKw@mail.gmail.com>
References: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>
 <CAJ6uix6Fc-Q0MU+=KNBti99RtTQ5TgLahkbtFoCpRTqQmTYVKw@mail.gmail.com>
Message-ID: <CADA77mgJNr7JUE8nKvzv93gb6QYSPgMEqeQTLY=9mPXjN_RJHQ@mail.gmail.com>

I wouldn't stereotype "this community" already, as it's a single person
request and maybe a single person common use case.

However, I have seen mostly on Twitter the usage of :3 to indicate
"engagement" in the sense of "interest", or "I'm digging it" but if there's
a meaning widely recognised already internationally, I guess there's no
point in using the proposed name, yet there's no code point to represent :3

isn't it?

Whatever it means, do we have a code point for it already?

If we do, maybe that'd be already enough.

There are indeed already many emoji misused here and there due different
visual meaning in different cultures (the triumph face, as example, the one
with steam from nose which is used as "furious face" in some culture)

If there's no code point, being apparently this popular, should Unicode
consider one?

Regards


On Tue, Jan 24, 2017 at 5:12 PM, Leonardo Boiko <leoboiko at namakajiri.net>
wrote:

> I find it curious that this community defines the ":3" emoji as "mmmm" or
> "om nom nom".  In my circles it's quite the frequent emoticon/emoji, but
> I've never seen it used this way.  Instead, they usually employ it as "cat
> mouth" or "cat face", implying  the mood of cuteness, perkiness or
> mischievousness. (This is distinct from U+1F431 CAT FACE in that it
> represents a human making a cat-like mouth, not an actual cat.) Here are a
> few images found through a web search for "cat face":
>
>
>
> ?
> ?
>
>
>
> ?
> ?
> Here's the relevant TVTropes article:
> http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile
>
> (TVTropes, incidentally, is one of the many web forums which has a :3
> textual emoji.)
>
> And the KnowYourMeme page:
> http://knowyourmeme.com/memes/3-cat-face
>
>
>
>
>
>
> 2017-01-24 14:39 GMT-02:00 Andrea Giammarchi <andrea.giammarchi at gmail.com>
> :
>
>> I'd like to bring to your attention a request, about a common emoticon,
>> that has apparently no equivalent yet in the Emoji standard.
>>
>> This was a PR to the Twemoji project:
>> https://github.com/twitter/twemoji/issues/199
>>
>> The author also created a proper PDF explaining all the reasons:
>> Proposal for CURLY LIPS Emoji.pdf
>> <https://github.com/twitter/twemoji/files/727077/Proposal.for.CURLY.LIPS.Emoji.pdf>
>>
>> I hope this can be considered in the near future as possible extra face.
>>
>> Thanks in advance for any sort of outcome.
>>
>> Best Regards
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/19b3ad8c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-0.jpg
Type: image/jpeg
Size: 29144 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/19b3ad8c/attachment-0004.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-5.jpg
Type: image/jpeg
Size: 5179 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/19b3ad8c/attachment-0005.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-3.jpg
Type: image/jpeg
Size: 15171 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/19b3ad8c/attachment-0006.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-1.jpg
Type: image/jpeg
Size: 15782 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170124/19b3ad8c/attachment-0007.jpg>

From leoboiko at namakajiri.net  Tue Jan 24 22:44:20 2017
From: leoboiko at namakajiri.net (Leonardo Boiko)
Date: Wed, 25 Jan 2017 02:44:20 -0200
Subject: Curly Lips Code Point Proposal
In-Reply-To: <CAEfiL0GbJiL_EHjdvBG8cxxg9Y1+9V=kNnmxTK24EnwhUCNzpw@mail.gmail.com>
References: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>
 <CAJ6uix6Fc-Q0MU+=KNBti99RtTQ5TgLahkbtFoCpRTqQmTYVKw@mail.gmail.com>
 <CADA77mgJNr7JUE8nKvzv93gb6QYSPgMEqeQTLY=9mPXjN_RJHQ@mail.gmail.com>
 <CAEfiL0GbJiL_EHjdvBG8cxxg9Y1+9V=kNnmxTK24EnwhUCNzpw@mail.gmail.com>
Message-ID: <CAJ6uix6Yap6osjfvjrphd=6eOgf9biSKfJxFC4Dh6TvmYYnqSQ@mail.gmail.com>

Undoubtedly so.  That's why U+1F481 INFORMATION DESK PERSON ?? is listed
with the keyword "sassy" in the Unicode emoji table (besides "tipping
hand").  Which helps a lot, because the keywords are used by input methods
to search characters; if no one bothered to keep track of how people are
using emoji, then people would try looking for the "sassy" gesture and find
nothing, and they'd have to learn that it's called "information desk
person", even though no one uses it with this meaning.

Precisely because language (and symbolic systems like emoji) are in flux,
it's a good idea trying to document how it's used.


2017-01-25 2:35 GMT-02:00 Fritz Gheen <fgheen at gmail.com>:

> "There are indeed already many emoji misused here and there..."
>
> I'd venture to say most emoji are divorced from their original intent.
> Help Desk Lady is one of the most popular emoji...and I can't recall ever
> seeing someone use it for that reason.  I personally use Rocket emoji
> mostly to mean, "I'm taking-off from home."  And then there's aubergine =)
>
> I'd like to think no emoji is "misused."  People employ emoji outside of
> their original or intended meaning, and that's beautiful: language is
> fluid; it evolves.
>
>
>
>
>
> On Wed, Jan 25, 2017 at 12:39 AM, Andrea Giammarchi <
> andrea.giammarchi at gmail.com> wrote:
>
>> I wouldn't stereotype "this community" already, as it's a single person
>> request and maybe a single person common use case.
>>
>> However, I have seen mostly on Twitter the usage of :3 to indicate
>> "engagement" in the sense of "interest", or "I'm digging it" but if there's
>> a meaning widely recognised already internationally, I guess there's no
>> point in using the proposed name, yet there's no code point to represent :3
>>
>> isn't it?
>>
>> Whatever it means, do we have a code point for it already?
>>
>> If we do, maybe that'd be already enough.
>>
>> There are indeed already many emoji misused here and there due different
>> visual meaning in different cultures (the triumph face, as example, the one
>> with steam from nose which is used as "furious face" in some culture)
>>
>> If there's no code point, being apparently this popular, should Unicode
>> consider one?
>>
>> Regards
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jan 24, 2017 at 5:12 PM, Leonardo Boiko <leoboiko at namakajiri.net>
>> wrote:
>>
>>> I find it curious that this community defines the ":3" emoji as "mmmm"
>>> or "om nom nom".  In my circles it's quite the frequent emoticon/emoji, but
>>> I've never seen it used this way.  Instead, they usually employ it as "cat
>>> mouth" or "cat face", implying  the mood of cuteness, perkiness or
>>> mischievousness. (This is distinct from U+1F431 CAT FACE in that it
>>> represents a human making a cat-like mouth, not an actual cat.) Here are a
>>> few images found through a web search for "cat face":
>>>
>>>
>>>
>>> ?
>>> ?
>>>
>>>
>>>
>>> ?
>>> ?
>>> Here's the relevant TVTropes article:
>>> http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile
>>>
>>> (TVTropes, incidentally, is one of the many web forums which has a :3
>>> textual emoji.)
>>>
>>> And the KnowYourMeme page:
>>> http://knowyourmeme.com/memes/3-cat-face
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2017-01-24 14:39 GMT-02:00 Andrea Giammarchi <
>>> andrea.giammarchi at gmail.com>:
>>>
>>>> I'd like to bring to your attention a request, about a common emoticon,
>>>> that has apparently no equivalent yet in the Emoji standard.
>>>>
>>>> This was a PR to the Twemoji project:
>>>> https://github.com/twitter/twemoji/issues/199
>>>>
>>>> The author also created a proper PDF explaining all the reasons:
>>>> Proposal for CURLY LIPS Emoji.pdf
>>>> <https://github.com/twitter/twemoji/files/727077/Proposal.for.CURLY.LIPS.Emoji.pdf>
>>>>
>>>> I hope this can be considered in the near future as possible extra face.
>>>>
>>>> Thanks in advance for any sort of outcome.
>>>>
>>>> Best Regards
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/c02caceb/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-1.jpg
Type: image/jpeg
Size: 15782 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/c02caceb/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-5.jpg
Type: image/jpeg
Size: 5179 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/c02caceb/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-0.jpg
Type: image/jpeg
Size: 29144 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/c02caceb/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-3.jpg
Type: image/jpeg
Size: 15171 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/c02caceb/attachment-0003.jpg>

From richard.wordingham at ntlworld.com  Wed Jan 25 02:13:10 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 25 Jan 2017 08:13:10 +0000
Subject: Implications of Logical Order Exception Property
Message-ID: <20170125081310.5bfe5ce5@JRWUBU2>

What is the significance of logical_order_exception being true for a
character?

TUS 9.0 Section 4.3 appears to claim that such characters need to be
rearranged 'logically' for searching and sorting.

However, I cannot see how they need to be rearranged for searching.

Is this property a general warning, or does it mean that swapping with
the *next* character gives a less bad sorting experience?  For example,
why doesn't U+17CC KHMER SIGN ROBAT have the property?  Consonant +
ROBAT has to be rearranged to ROBAT + consonant for sorting - ROBAT is
a repha stored after rather than before the *visual* base consonant.

Richard.

From fgheen at gmail.com  Tue Jan 24 22:35:29 2017
From: fgheen at gmail.com (Fritz Gheen)
Date: Wed, 25 Jan 2017 11:35:29 +0700
Subject: Curly Lips Code Point Proposal
In-Reply-To: <CADA77mgJNr7JUE8nKvzv93gb6QYSPgMEqeQTLY=9mPXjN_RJHQ@mail.gmail.com>
References: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>
 <CAJ6uix6Fc-Q0MU+=KNBti99RtTQ5TgLahkbtFoCpRTqQmTYVKw@mail.gmail.com>
 <CADA77mgJNr7JUE8nKvzv93gb6QYSPgMEqeQTLY=9mPXjN_RJHQ@mail.gmail.com>
Message-ID: <CAEfiL0GbJiL_EHjdvBG8cxxg9Y1+9V=kNnmxTK24EnwhUCNzpw@mail.gmail.com>

"There are indeed already many emoji misused here and there..."

I'd venture to say most emoji are divorced from their original intent.
Help Desk Lady is one of the most popular emoji...and I can't recall ever
seeing someone use it for that reason.  I personally use Rocket emoji
mostly to mean, "I'm taking-off from home."  And then there's aubergine =)

I'd like to think no emoji is "misused."  People employ emoji outside of
their original or intended meaning, and that's beautiful: language is
fluid; it evolves.


On Wed, Jan 25, 2017 at 12:39 AM, Andrea Giammarchi <
andrea.giammarchi at gmail.com> wrote:

> I wouldn't stereotype "this community" already, as it's a single person
> request and maybe a single person common use case.
>
> However, I have seen mostly on Twitter the usage of :3 to indicate
> "engagement" in the sense of "interest", or "I'm digging it" but if there's
> a meaning widely recognised already internationally, I guess there's no
> point in using the proposed name, yet there's no code point to represent :3
>
> isn't it?
>
> Whatever it means, do we have a code point for it already?
>
> If we do, maybe that'd be already enough.
>
> There are indeed already many emoji misused here and there due different
> visual meaning in different cultures (the triumph face, as example, the one
> with steam from nose which is used as "furious face" in some culture)
>
> If there's no code point, being apparently this popular, should Unicode
> consider one?
>
> Regards
>
>
>
>
>
>
>
> On Tue, Jan 24, 2017 at 5:12 PM, Leonardo Boiko <leoboiko at namakajiri.net>
> wrote:
>
>> I find it curious that this community defines the ":3" emoji as "mmmm" or
>> "om nom nom".  In my circles it's quite the frequent emoticon/emoji, but
>> I've never seen it used this way.  Instead, they usually employ it as "cat
>> mouth" or "cat face", implying  the mood of cuteness, perkiness or
>> mischievousness. (This is distinct from U+1F431 CAT FACE in that it
>> represents a human making a cat-like mouth, not an actual cat.) Here are a
>> few images found through a web search for "cat face":
>>
>>
>>
>> ?
>> ?
>>
>>
>>
>> ?
>> ?
>> Here's the relevant TVTropes article:
>> http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile
>>
>> (TVTropes, incidentally, is one of the many web forums which has a :3
>> textual emoji.)
>>
>> And the KnowYourMeme page:
>> http://knowyourmeme.com/memes/3-cat-face
>>
>>
>>
>>
>>
>>
>> 2017-01-24 14:39 GMT-02:00 Andrea Giammarchi <andrea.giammarchi at gmail.com
>> >:
>>
>>> I'd like to bring to your attention a request, about a common emoticon,
>>> that has apparently no equivalent yet in the Emoji standard.
>>>
>>> This was a PR to the Twemoji project:
>>> https://github.com/twitter/twemoji/issues/199
>>>
>>> The author also created a proper PDF explaining all the reasons:
>>> Proposal for CURLY LIPS Emoji.pdf
>>> <https://github.com/twitter/twemoji/files/727077/Proposal.for.CURLY.LIPS.Emoji.pdf>
>>>
>>> I hope this can be considered in the near future as possible extra face.
>>>
>>> Thanks in advance for any sort of outcome.
>>>
>>> Best Regards
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/4c5ac9e6/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-5.jpg
Type: image/jpeg
Size: 5179 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/4c5ac9e6/attachment-0004.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-0.jpg
Type: image/jpeg
Size: 29144 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/4c5ac9e6/attachment-0005.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-1.jpg
Type: image/jpeg
Size: 15782 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/4c5ac9e6/attachment-0006.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-3.jpg
Type: image/jpeg
Size: 15171 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/4c5ac9e6/attachment-0007.jpg>

From verdy_p at wanadoo.fr  Wed Jan 25 10:51:13 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 25 Jan 2017 17:51:13 +0100
Subject: Curly Lips Code Point Proposal
In-Reply-To: <CAJ6uix6Yap6osjfvjrphd=6eOgf9biSKfJxFC4Dh6TvmYYnqSQ@mail.gmail.com>
References: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>
 <CAJ6uix6Fc-Q0MU+=KNBti99RtTQ5TgLahkbtFoCpRTqQmTYVKw@mail.gmail.com>
 <CADA77mgJNr7JUE8nKvzv93gb6QYSPgMEqeQTLY=9mPXjN_RJHQ@mail.gmail.com>
 <CAEfiL0GbJiL_EHjdvBG8cxxg9Y1+9V=kNnmxTK24EnwhUCNzpw@mail.gmail.com>
 <CAJ6uix6Yap6osjfvjrphd=6eOgf9biSKfJxFC4Dh6TvmYYnqSQ@mail.gmail.com>
Message-ID: <CAGa7JC34sfZxFFu9AxcJK9teEoZGny+Xz+zP2ZK1K=pg0sKyHQ@mail.gmail.com>

I'd say that the current icon shown by Google for that character does not
mean anything to me, it lloks like a ghost of some grotesque creature,
possibly dancing, unrelated to any information desk, and not even a lady
(may be Google wanted it to be gender-neutral, as the character is not
encoded and named to mean a woman). So I doubt seriously it will be used
for its intended usage.

2017-01-25 5:44 GMT+01:00 Leonardo Boiko <leoboiko at namakajiri.net>:

> Undoubtedly so.  That's why U+1F481 INFORMATION DESK PERSON ?? is listed
> with the keyword "sassy" in the Unicode emoji table (besides "tipping
> hand").  Which helps a lot, because the keywords are used by input methods
> to search characters; if no one bothered to keep track of how people are
> using emoji, then people would try looking for the "sassy" gesture and find
> nothing, and they'd have to learn that it's called "information desk
> person", even though no one uses it with this meaning.
>
> Precisely because language (and symbolic systems like emoji) are in flux,
> it's a good idea trying to document how it's used.
>
>
> 2017-01-25 2:35 GMT-02:00 Fritz Gheen <fgheen at gmail.com>:
>
>> "There are indeed already many emoji misused here and there..."
>>
>> I'd venture to say most emoji are divorced from their original intent.
>> Help Desk Lady is one of the most popular emoji...and I can't recall ever
>> seeing someone use it for that reason.  I personally use Rocket emoji
>> mostly to mean, "I'm taking-off from home."  And then there's aubergine
>> =)
>>
>> I'd like to think no emoji is "misused."  People employ emoji outside of
>> their original or intended meaning, and that's beautiful: language is
>> fluid; it evolves.
>>
>>
>>
>>
>>
>> On Wed, Jan 25, 2017 at 12:39 AM, Andrea Giammarchi <
>> andrea.giammarchi at gmail.com> wrote:
>>
>>> I wouldn't stereotype "this community" already, as it's a single person
>>> request and maybe a single person common use case.
>>>
>>> However, I have seen mostly on Twitter the usage of :3 to indicate
>>> "engagement" in the sense of "interest", or "I'm digging it" but if there's
>>> a meaning widely recognised already internationally, I guess there's no
>>> point in using the proposed name, yet there's no code point to represent :3
>>>
>>> isn't it?
>>>
>>> Whatever it means, do we have a code point for it already?
>>>
>>> If we do, maybe that'd be already enough.
>>>
>>> There are indeed already many emoji misused here and there due different
>>> visual meaning in different cultures (the triumph face, as example, the one
>>> with steam from nose which is used as "furious face" in some culture)
>>>
>>> If there's no code point, being apparently this popular, should Unicode
>>> consider one?
>>>
>>> Regards
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 24, 2017 at 5:12 PM, Leonardo Boiko <leoboiko at namakajiri.net
>>> > wrote:
>>>
>>>> I find it curious that this community defines the ":3" emoji as "mmmm"
>>>> or "om nom nom".  In my circles it's quite the frequent emoticon/emoji, but
>>>> I've never seen it used this way.  Instead, they usually employ it as "cat
>>>> mouth" or "cat face", implying  the mood of cuteness, perkiness or
>>>> mischievousness. (This is distinct from U+1F431 CAT FACE in that it
>>>> represents a human making a cat-like mouth, not an actual cat.) Here are a
>>>> few images found through a web search for "cat face":
>>>>
>>>>
>>>>
>>>> ?
>>>> ?
>>>>
>>>>
>>>>
>>>> ?
>>>> ?
>>>> Here's the relevant TVTropes article:
>>>> http://tvtropes.org/pmwiki/pmwiki.php/Main/CatSmile
>>>>
>>>> (TVTropes, incidentally, is one of the many web forums which has a :3
>>>> textual emoji.)
>>>>
>>>> And the KnowYourMeme page:
>>>> http://knowyourmeme.com/memes/3-cat-face
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2017-01-24 14:39 GMT-02:00 Andrea Giammarchi <
>>>> andrea.giammarchi at gmail.com>:
>>>>
>>>>> I'd like to bring to your attention a request, about a common emoticon,
>>>>> that has apparently no equivalent yet in the Emoji standard.
>>>>>
>>>>> This was a PR to the Twemoji project:
>>>>> https://github.com/twitter/twemoji/issues/199
>>>>>
>>>>> The author also created a proper PDF explaining all the reasons:
>>>>> Proposal for CURLY LIPS Emoji.pdf
>>>>> <https://github.com/twitter/twemoji/files/727077/Proposal.for.CURLY.LIPS.Emoji.pdf>
>>>>>
>>>>> I hope this can be considered in the near future as possible extra
>>>>> face.
>>>>>
>>>>> Thanks in advance for any sort of outcome.
>>>>>
>>>>> Best Regards
>>>>>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/e997bb7f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-1.jpg
Type: image/jpeg
Size: 15782 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/e997bb7f/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-3.jpg
Type: image/jpeg
Size: 15171 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/e997bb7f/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-0.jpg
Type: image/jpeg
Size: 29144 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/e997bb7f/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: :3-5.jpg
Type: image/jpeg
Size: 5179 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/e997bb7f/attachment-0003.jpg>

From richard.wordingham at ntlworld.com  Wed Jan 25 13:10:15 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 25 Jan 2017 19:10:15 +0000
Subject: Implications of Logical Order Exception Property
In-Reply-To: <20170125081310.5bfe5ce5@JRWUBU2>
References: <20170125081310.5bfe5ce5@JRWUBU2>
Message-ID: <20170125191015.4a1c4d59@JRWUBU2>

On Wed, 25 Jan 2017 08:13:10 +0000
Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

> What is the significance of logical_order_exception being true for a
> character?
> 
> TUS 9.0 Section 4.3 appears to claim that such characters need to be
> rearranged 'logically' for searching and sorting.
> 
> However, I cannot see how they need to be rearranged for searching.
> 
> Is this property a general warning, or does it mean that swapping with
> the *next* character gives a less bad sorting experience?  For
> example, why doesn't U+17CC KHMER SIGN ROBAT have the property?
> Consonant + ROBAT has to be rearranged to ROBAT + consonant for
> sorting - ROBAT is a repha stored after rather than before the
> *visual* base consonant.

After some further research, I see that it is relevant to Revisions 9
and 11 of the Unicode Collation Algorithm.  For earlier and later
revisions, its effects were defined by the collation element table
rather than the Unicode Character Database.  The reordering for ROBAT
is the wrong way round for the property to be applicable in its
original meaning.  I can't find any other formal requirement for the
property.

I now have a clutch of errors to report on Unicode's use of the term
'logical order' and references to logical_order_exception:

1) Claims that Thai is not encoded in logical order in
  Technical Report 10 (Collation)
  UCD file IndicPositionalCategory.txt
2) Claims that logical_order_exception is relevant for searching (TUS,
as above) 

Should I make this one report or three reports?

Richard.

From markus.icu at gmail.com  Wed Jan 25 13:27:52 2017
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 25 Jan 2017 11:27:52 -0800
Subject: Implications of Logical Order Exception Property
In-Reply-To: <20170125191015.4a1c4d59@JRWUBU2>
References: <20170125081310.5bfe5ce5@JRWUBU2> <20170125191015.4a1c4d59@JRWUBU2>
Message-ID: <CAN49p6pfdJAv8O+GNojoq1r6QxMXnDim4tfwZJnsm0XqBNc61A@mail.gmail.com>

On Wed, Jan 25, 2017 at 11:10 AM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> I now have a clutch of errors to report on Unicode's use of the term
> 'logical order' and references to logical_order_exception:
>
> 1) Claims that Thai is not encoded in logical order in
>   Technical Report 10 (Collation)
>   UCD file IndicPositionalCategory.txt
> 2) Claims that logical_order_exception is relevant for searching (TUS,
> as above)
>

It informs the construction of the DUCET and could be used to
suppress_contractions in a search tailoring (see CLDR root collation data
file).

Should I make this one report or three reports?
>

I think one report would be better.

I would wait a few days to see if there is more feedback here on the list.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/8d7483b9/attachment.html>

From richard.wordingham at ntlworld.com  Wed Jan 25 14:00:38 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 25 Jan 2017 20:00:38 +0000
Subject: Implications of Logical Order Exception Property
In-Reply-To: <CAN49p6pfdJAv8O+GNojoq1r6QxMXnDim4tfwZJnsm0XqBNc61A@mail.gmail.com>
References: <20170125081310.5bfe5ce5@JRWUBU2> <20170125191015.4a1c4d59@JRWUBU2>
 <CAN49p6pfdJAv8O+GNojoq1r6QxMXnDim4tfwZJnsm0XqBNc61A@mail.gmail.com>
Message-ID: <20170125200038.1b2746d8@JRWUBU2>

On Wed, 25 Jan 2017 11:27:52 -0800
Markus Scherer <markus.icu at gmail.com> wrote:

> On Wed, Jan 25, 2017 at 11:10 AM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:  
 
> > 2) Claims that logical_order_exception is relevant for searching
> > (TUS, as above)

> It informs the construction of the DUCET and could be used to
> suppress_contractions in a search tailoring (see CLDR root collation
> data file).

It is irrelevant for searching.  If one created a collation just for
searching, one wouldn't have to remove the effects of this irrelevant
property.

Richard.

From markus.icu at gmail.com  Wed Jan 25 14:35:33 2017
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 25 Jan 2017 12:35:33 -0800
Subject: Implications of Logical Order Exception Property
In-Reply-To: <20170125200038.1b2746d8@JRWUBU2>
References: <20170125081310.5bfe5ce5@JRWUBU2> <20170125191015.4a1c4d59@JRWUBU2>
 <CAN49p6pfdJAv8O+GNojoq1r6QxMXnDim4tfwZJnsm0XqBNc61A@mail.gmail.com>
 <20170125200038.1b2746d8@JRWUBU2>
Message-ID: <CAN49p6oDLPvb42SqDgG_awX0o9wMAKYJ02K37F8Mm6_Vq=Qy0w@mail.gmail.com>

On Wed, Jan 25, 2017 at 12:00 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> > > 2) Claims that logical_order_exception is relevant for searching
> > > (TUS, as above)
>
> > It informs the construction of the DUCET and could be used to
> > suppress_contractions in a search tailoring (see CLDR root collation
> > data file).
>
> It is irrelevant for searching.  If one created a collation just for
> searching, one wouldn't have to remove the effects of this irrelevant
> property.
>

It narrows match boundaries and improves performance.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170125/3586af15/attachment.html>

From richard.wordingham at ntlworld.com  Wed Jan 25 15:44:49 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 25 Jan 2017 21:44:49 +0000
Subject: Implications of Logical Order Exception Property
In-Reply-To: <CAN49p6oDLPvb42SqDgG_awX0o9wMAKYJ02K37F8Mm6_Vq=Qy0w@mail.gmail.com>
References: <20170125081310.5bfe5ce5@JRWUBU2> <20170125191015.4a1c4d59@JRWUBU2>
 <CAN49p6pfdJAv8O+GNojoq1r6QxMXnDim4tfwZJnsm0XqBNc61A@mail.gmail.com>
 <20170125200038.1b2746d8@JRWUBU2>
 <CAN49p6oDLPvb42SqDgG_awX0o9wMAKYJ02K37F8Mm6_Vq=Qy0w@mail.gmail.com>
Message-ID: <20170125214449.53d3455b@JRWUBU2>

On Wed, 25 Jan 2017 12:35:33 -0800
Markus Scherer <markus.icu at gmail.com> wrote:

> On Wed, Jan 25, 2017 at 12:00 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:  
> 
> > > > 2) Claims that logical_order_exception is relevant for searching
> > > > (TUS, as above)  
> >  
> > > It informs the construction of the DUCET and could be used to
> > > suppress_contractions in a search tailoring (see CLDR root
> > > collation data file).  
> >
> > It is irrelevant for searching.  If one created a collation just for
> > searching, one wouldn't have to remove the effects of this
> > irrelevant property.
> >  
> 
> It narrows match boundaries and improves performance.

What is 'it'?  If 'it' means 'removing the effects...', then it would be
irrelevant in a collation created just for searching.  The effects
wouldn't be there to complicate matters.  A search-only collation for
normalised, correctly spelt Thai would be simple; it would have no
contractions unless you wanted to claim that words containing <U+0E24
THAI CHARACTER RU, U+0E45 THAI CHARACTER LAKKHANGYAO> did not contain
<U+0E24>.)

Richard.

From christoph.paeper at crissov.de  Fri Jan 27 05:16:22 2017
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Fri, 27 Jan 2017 12:16:22 +0100
Subject: Curly Lips Code Point Proposal
In-Reply-To: <CAJ6uix6Yap6osjfvjrphd=6eOgf9biSKfJxFC4Dh6TvmYYnqSQ@mail.gmail.com>
References: <CADA77miPo9gB=OpOOHAdkD5eVr-O2Ndm6uOC4oPbR3x63=Q0iw@mail.gmail.com>
 <CAJ6uix6Fc-Q0MU+=KNBti99RtTQ5TgLahkbtFoCpRTqQmTYVKw@mail.gmail.com>
 <CADA77mgJNr7JUE8nKvzv93gb6QYSPgMEqeQTLY=9mPXjN_RJHQ@mail.gmail.com>
 <CAEfiL0GbJiL_EHjdvBG8cxxg9Y1+9V=kNnmxTK24EnwhUCNzpw@mail.gmail.com>
 <CAJ6uix6Yap6osjfvjrphd=6eOgf9biSKfJxFC4Dh6TvmYYnqSQ@mail.gmail.com>
Message-ID: <3E7FB424-A62F-4782-B421-11686246F2BB@crissov.de>

Leonardo Boiko <leoboiko at namakajiri.net>:
> 
> That's why U+1F481 INFORMATION DESK PERSON ?? is listed with the keyword "sassy" in the Unicode emoji table (besides "tipping hand").  Which helps a lot, because the keywords are used by input methods to search characters; if no one bothered to keep track of how people are using emoji, then people would try looking for the "sassy" gesture and find nothing, and they'd have to learn that it's called "information desk person", even though no one uses it with this meaning.
> 
> Precisely because language (and symbolic systems like emoji) are in flux, it's a good idea trying to document how it's used.

?? Maybe, but that?s not what?s actually done.

<http://www.unicode.org/emoji/charts/emoji-annotations.html> s/(butt|boob|phall|penis|vulva|vagina|genital|orgasm|sex|69|fart)/

<http://unicode.org/cldr/charts/latest/annotations/germanic.html>

> ??	*eggplant | aubergine | vegetable	?
> ??	*peach | fruit	?
> ??	*tulip | flower	?
> ?	*Cancer | crab | zodiac	?
> ??	*sweat droplets | comic | splashing | sweat	?
> ??	*dashing away | comic | dash | running	?


From verdy_p at wanadoo.fr  Sat Jan 28 11:24:19 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 28 Jan 2017 18:24:19 +0100
Subject: Pagus symbol
Message-ID: <CAGa7JC2M-+yUnq2ch-H9rFfhY2KSFPJn_v0=MtFkS2ASsMopTg@mail.gmail.com>

See Sample [1]

The symbol that is shown near some villages (Cuce, Cice, Bruts) on this old
map is for "pagus" (plural "pagi") and is an old territorial unit grouping
several villages, and would more or or less map to today's cantons in
France (or "pays" in today's rural speech), or counties in England (however
smaller than counties). [2]

It looks like an ideogram in used in Roman or medieval periods (in the
example above it appears later on a map of the 17th century). I've seen it
several times (not just on maps) with minor variations. It looks like two
symbolized bell towers with a top platform holding a christian cross, both
surrounding the circle (locating the village). It gives higher importnace
to these places than other surrounding villages that are administered from
the pagus.

Are there other examples of symbols used on maps or old judiciary acts that
could be encoded?

[1]
https://commons.wikimedia.org/wiki/File:Tabula_ducatus_britanniae_gallis_-_Sud_Rennes.png
[2] https://en.wikipedia.org/wiki/Pagus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170128/e8c02593/attachment.html>

From verdy_p at wanadoo.fr  Sat Jan 28 11:36:50 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 28 Jan 2017 18:36:50 +0100
Subject: Pagus symbol
In-Reply-To: <CAGa7JC2M-+yUnq2ch-H9rFfhY2KSFPJn_v0=MtFkS2ASsMopTg@mail.gmail.com>
References: <CAGa7JC2M-+yUnq2ch-H9rFfhY2KSFPJn_v0=MtFkS2ASsMopTg@mail.gmail.com>
Message-ID: <CAGa7JC13Gsx6vGPY0Z_P=vt9urGEZ17GnErJO0dxo4dvHzOzMA@mail.gmail.com>

Other example, same period in Western Russia: the symbol is less
"ideographic" and colored in red, it clearly shows a church bell tower and
a dependant building:

https://commons.wikimedia.org/wiki/File:Atlas_Van_der_Hagen-KW1049B10_032-RVSSIAE_Vulgo_MOSCOVIA_dictae,_Pars_Occidentalis.jpeg

Same thing in England

https://commons.wikimedia.org/wiki/File:Atlas_Van_der_Hagen-KW1049B11_004-A_NEW_MAP_OF_THE_KINGDOME_of_ENGLAND,_Representing_the_Princedome_of_WALES,_and_other_PROVINCES,_CITIES,_MARKET_TOWNS,_with_the_ROADS_from_TOWN_to_TOWN.jpeg

Other variant (two towers or high houses):

https://commons.wikimedia.org/wiki/File:Arae_Flaviae_tab_peut.jpg


2017-01-28 18:24 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> See Sample [1]
>
> The symbol that is shown near some villages (Cuce, Cice, Bruts) on this
> old map is for "pagus" (plural "pagi") and is an old territorial unit
> grouping several villages, and would more or or less map to today's cantons
> in France (or "pays" in today's rural speech), or counties in England
> (however smaller than counties). [2]
>
> It looks like an ideogram in used in Roman or medieval periods (in the
> example above it appears later on a map of the 17th century). I've seen it
> several times (not just on maps) with minor variations. It looks like two
> symbolized bell towers with a top platform holding a christian cross, both
> surrounding the circle (locating the village). It gives higher importnace
> to these places than other surrounding villages that are administered from
> the pagus.
>
> Are there other examples of symbols used on maps or old judiciary acts
> that could be encoded?
>
> [1] https://commons.wikimedia.org/wiki/File:Tabula_ducatus_
> britanniae_gallis_-_Sud_Rennes.png
> [2] https://en.wikipedia.org/wiki/Pagus
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170128/218a461f/attachment.html>

From asmusf at ix.netcom.com  Sat Jan 28 14:39:58 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 28 Jan 2017 12:39:58 -0800
Subject: Pagus symbol
In-Reply-To: <CAGa7JC13Gsx6vGPY0Z_P=vt9urGEZ17GnErJO0dxo4dvHzOzMA@mail.gmail.com>
References: <CAGa7JC2M-+yUnq2ch-H9rFfhY2KSFPJn_v0=MtFkS2ASsMopTg@mail.gmail.com>
 <CAGa7JC13Gsx6vGPY0Z_P=vt9urGEZ17GnErJO0dxo4dvHzOzMA@mail.gmail.com>
Message-ID: <4021a0fb-44f6-53f1-b05d-d3806adae28b@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170128/af0763de/attachment.html>