From A.Schappo at lboro.ac.uk  Fri Jan  1 08:00:01 2016
From: A.Schappo at lboro.ac.uk (Andre Schappo)
Date: Fri, 1 Jan 2016 14:00:01 +0000
Subject: Unicode in the Curriculum?
In-Reply-To: <slrnn8aun4.f73.jcb@home.stevens-bradfield.com>
References: <567331D2.1000007@gmail.com>
 <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk>
 <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com>
 <B8EEC34E-8FAE-4832-BEEA-2F87E828E7CB@lboro.ac.uk>
 <slrnn8aun4.f73.jcb@home.stevens-bradfield.com>
Message-ID: <16BCDCD6-502F-4382-BA62-109160E18752@lboro.ac.uk>


Julian,

We have very different POVs on this topic. You raise a number of issues which would take me many many thousands of words to properly discuss. I will attempt a summary discussion of some of the issues.

? IT i18n is a huge subject area. Unicode is only one component. My module included: s/w i18n & L10n, character sets, unicode & unicode encodings, fonts, keyboard mappings, input methods, language tags, IDNs, website i18n, adaptive i18n websites, characteristics of human language scripts. One of my problems when putting the module together was deciding what to leave out. Actually every year I have tended to add a bit more teaching material to the module. It was along the lines ? Oh! I cannot leave that out ??  Much of IT i18n has a WOW factor. I have many times seen the WOW!?I didn't know that!?Really! reactions from students when teaching them about IT i18n
? There is also the cultural aspect which adds an extra richness, depth and interest to IT i18n
? IT i18n has many layers of detail. Each layer has concepts and realisations. Using your terminology, each layer has intellectual content
? Technical skills encompass and embody concepts and realisations. A technical skill is not adaptable/flexible unless one understands the concepts and has had the realisations. Computer Science needs to give students such technical skills so that they will be able to function and contribute in the non academic world.
? In my experience, just telling students to read the manuals does not work. Most need to be guided to the realisations and concepts. Many 1st year students have been programming from an early age, like 9/10 years old, and are gifted programmers. They have encountered unicode and hacked solutions to immediate problems. But, they do not have an understanding of Unicode and there are many layers to understand in unicode. 
? My primary aim/goal/passion is to teach/encourage students to code for the World and not just Britain.
? The current situation is that the majority of students (actually, and academic staff) do not even think about i18n of their Apps/Systems/Websites.
 eg I will say to a Final Year Project student: "have you thought about internationalising your s/w". Time and time again the response is, No.
? There is a lot of ongoing development of i18n features in css, html, programming languages, social media. All these developments need to be studied and taught
? Surely one of the purposes of lecturing is to make the complicated, simpler. When I first started (self)studying Unicode I was completely baffled. I was overwhelmed with a mass of data, concepts, techniques, reports, standards. I just kept reading and thinking and experimenting. I read about Unicode from many different points of view. I wrote code to process unicode text. That took a lot of effort and time. Now I consider myself knowledgeable about unicode and am in a position to make unicode simpler for students. A 1 hour lecture from me on Unicode will save a student days of self study. Students have a very heavy workload and do not have time for unguided and unstructured self study.

All for now??

Andr? Schappo

On 31 Dec 2015, at 18:58, Julian Bradfield wrote:

> On 2015-12-31, Andre Schappo <A.Schappo at lboro.ac.uk> wrote:
> 
>> I have been hitting my head against the Academic Brick Wall for
>> years WRT getting IT i18n and Unicode on the curriculum and I am
>> losing. I did teach a final year elective module on IT i18n but a
>> few months ago my University dropped it. I am continually puzzled by
>> the lack of interest University Computer Science departments have in
>> i18n. I appear to be a solitary UK University Computer Science voice
>> when it comes to i18n. 
> 
> Well, I'd say that it's not the business of Computer Science degrees
> to teach specific technical skills. It's our business to help people
> learn about the fundamentals of the subject, so that they can acquire
> any specific skill on demand, and use that skill competently. In those
> areas where we do teach specific skills (e.g. machine learning
> techniques) we teach those that have some intellectual content to
> them.  (This is why we don't teach programming languages as such - we
> teach a programming language as a means of learning a programming
> paradigm.)
> 
> In my experience so far, using Unicode and doing i18n is not very
> interesting (killingly boring, actually) from a purely CS technical
> point of view, unless you happen to be one of the small minority who
> enjoys script and font layout issues - the interesting bits of doing
> i18n are in producing linguistically and culturally appropriate
> messages, and that's where one should bring in experts, not expect
> typical software developers to be able to do it.
> 
> If you still have the materials for your course, it would be
> interesting to see how you managed to get an interesting (and
> examinable!) course out of i18n.
> 
> I do in fact mention Unicode and i18n in my introductory programming
> course (which is not for CS students), but all I say is "you should
> know it's there, and if you become a competent programmer, then you
> can read the manuals and tutorials to learn what you need".
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 


From asmus-inc at ix.netcom.com  Fri Jan  1 14:09:13 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Fri, 1 Jan 2016 12:09:13 -0800
Subject: Unicode in the Curriculum?
In-Reply-To: <16BCDCD6-502F-4382-BA62-109160E18752@lboro.ac.uk>
References: <567331D2.1000007@gmail.com>
 <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk>
 <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com>
 <B8EEC34E-8FAE-4832-BEEA-2F87E828E7CB@lboro.ac.uk>
 <slrnn8aun4.f73.jcb@home.stevens-bradfield.com>
 <16BCDCD6-502F-4382-BA62-109160E18752@lboro.ac.uk>
Message-ID: <5686DCE9.1060401@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160101/0ec6faa4/attachment.html>

From scarboroughben at gmail.com  Sun Jan  3 22:54:50 2016
From: scarboroughben at gmail.com (Ben Scarborough)
Date: Sun, 3 Jan 2016 22:54:50 -0600
Subject: Errors in CJK F chart in L2/15-339
Message-ID: <CAKZQS28YSDeU2ZjffjmiZdgS_-F8U20tyMew9J27gPo7=UVqAg@mail.gmail.com>

I've found at least two errors in the CJK F charts in L2/15-339 (and thus
in WG2 N4705 and IRG N2130 as well). Who should I contact about this?

?Ben Scarborough
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160103/0c273dfd/attachment.html>

From mpsuzuki at hiroshima-u.ac.jp  Mon Jan  4 00:05:00 2016
From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya)
Date: Mon, 4 Jan 2016 15:05:00 +0900
Subject: [Unicode] Errors in CJK F chart in L2/15-339
In-Reply-To: <2ac9cc08d6334ea0a78759d3ee04a6f9@PS1PR04MB0953.apcprd04.prod.outlook.com>
References: <2ac9cc08d6334ea0a78759d3ee04a6f9@PS1PR04MB0953.apcprd04.prod.outlook.com>
Message-ID: <568A0B8C.9060205@hiroshima-u.ac.jp>

Hi,

I'm interested in the errors you found.

The stabilization of CJK F is very important work of
this year. Usually IRG expects the submissions from
from the members (e.g. UTC), but in this case, you
can submit your individual contribution to IRG, I guess.
Please contact with the chair of IRG, Dr. Lu Qin, at
csluqin at comp.polyu.edu.hk

Regards,
suzuki toshiya, Hiroshima University, Japan

Ben Scarborough wrote:
> I've found at least two errors in the CJK F charts in L2/15-339 (and thus in WG2 N4705 and IRG N2130 as well). Who should I contact about this?
> 
> ?Ben Scarborough
> 

From jknappen at web.de  Mon Jan  4 02:06:03 2016
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Mon, 4 Jan 2016 09:06:03 +0100
Subject: Aw: Symbol for an upside down capital L, pointing to the right?
In-Reply-To: <DM2PR09MB03510AB51C22ECA588F3FB64C8F80@DM2PR09MB0351.namprd09.prod.outlook.com>
References: <DM2PR09MB03510AB51C22ECA588F3FB64C8F80@DM2PR09MB0351.namprd09.prod.outlook.com>
Message-ID: <trinity-87ff45ab-1fe0-479a-bc18-4ce63b445802-1451894763055@3capp-webde-bs39>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/d88fda3c/attachment.html>

From jknappen at web.de  Mon Jan  4 02:15:09 2016
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Mon, 4 Jan 2016 09:15:09 +0100
Subject: Turned Capital letter L (pointing to the left, with serifs)
Message-ID: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/370cbb3f/attachment.html>

From asmus-inc at ix.netcom.com  Mon Jan  4 04:16:59 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 4 Jan 2016 02:16:59 -0800
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
Message-ID: <568A469B.3060401@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/3d626844/attachment.html>

From verdy_p at wanadoo.fr  Mon Jan  4 05:31:23 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 4 Jan 2016 12:31:23 +0100
Subject: Symbol for an upside down capital L, pointing to the right?
In-Reply-To: <trinity-87ff45ab-1fe0-479a-bc18-4ce63b445802-1451894763055@3capp-webde-bs39>
References: <DM2PR09MB03510AB51C22ECA588F3FB64C8F80@DM2PR09MB0351.namprd09.prod.outlook.com>
 <trinity-87ff45ab-1fe0-479a-bc18-4ce63b445802-1451894763055@3capp-webde-bs39>
Message-ID: <CAGa7JC3Hog9V1AA=3PC22V9e4AHTG2JZXk6+7psMQuawqh5+xQ@mail.gmail.com>

May be because in this document, it is a measurement of time in seconds,
and related to the letter T/Tau, rather than G/Gamma.
So the idea of the Tyronian Et is not so stupid, even if the glyph used by
the printer is higher than expected (and most probably borrowed from
another font, possibly using it for the digit 7).


2016-01-04 9:06 GMT+01:00 "J?rg Knappen" <jknappen at web.de>:

> Err... in what respect would this symbol be different from a CAPITAL GREEK
> LETTER GAMMA?
>
> --J?rg Knappen
>
> *Gesendet:* Freitag, 25. Dezember 2015 um 14:43 Uhr
> *Von:* "Costello, Roger L." <costello at mitre.org>
> *An:* "unicode at unicode.org" <unicode at unicode.org>
> *Betreff:* Symbol for an upside down capital L, pointing to the right?
> Hi Folks,
>
> Here is the upside down capital L, pointing to the left:
>
> ? - TURNED SANS-SERIF CAPITAL L (U+2142)
>
> Is there a symbol for an upside down capital L, pointing to the right?
>
> /Roger
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/ba3731de/attachment.html>

From everson at evertype.com  Mon Jan  4 07:20:14 2016
From: everson at evertype.com (Michael Everson)
Date: Mon, 4 Jan 2016 13:20:14 +0000
Subject: Symbol for an upside down capital L, pointing to the right?
In-Reply-To: <trinity-87ff45ab-1fe0-479a-bc18-4ce63b445802-1451894763055@3capp-webde-bs39>
References: <DM2PR09MB03510AB51C22ECA588F3FB64C8F80@DM2PR09MB0351.namprd09.prod.outlook.com>
 <trinity-87ff45ab-1fe0-479a-bc18-4ce63b445802-1451894763055@3capp-webde-bs39>
Message-ID: <051AA876-8469-4A75-AE03-98031D28F09E@evertype.com>

On 4 Jan 2016, at 08:06, J?rg Knappen <jknappen at web.de> wrote:

> Err... in what respect would this symbol be different from a CAPITAL GREEK LETTER GAMMA?

Perhas that is not the right question. Gamma is only one of many right-angle letter characters in the standard. 

The question remains: What would INVERTED SANS-SERIF CAPITAL L to be used for?

Michael Everson * http://www.evertype.com/


From everson at evertype.com  Mon Jan  4 07:25:24 2016
From: everson at evertype.com (Michael Everson)
Date: Mon, 4 Jan 2016 13:25:24 +0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
Message-ID: <5BFE25B1-6487-43AF-B01F-68C3F5C5A1BD@evertype.com>

On 4 Jan 2016, at 08:15, J?rg Knappen <jknappen at web.de> wrote:
> 
> Here is a report of a rather strange beast occurring in historical math printing (work of C. F. Gau?) in thw 19th century:
>  
> http://tex.stackexchange.com/questions/284483/how-do-i-typeset-this-symbol-possibly-astronomical

The image there is clearly a digit 7.  

> images are here:
>  
> http://www.archive.org/stream/abhandlungenmet00gausrich#page/n129/mode/2up

This will not load for me. 

> http://i.stack.imgur.com/57fN3.png

Again, this is a digit 7. From a different font than the other 7?s set there. 

>  It looks like a big digit "7" or like a turned letter "L". In the accepted answer it was identified with the Tironian note et; an identification
> I'd dispute because the Tironian note Et is usually smaller in size than a capital latin letter.

It is not a Tironian et. The Tironian Et typically has a descender and goes to x-height. Also the horizonal stroke would never be written like that 7, and indeed the angle (if less than 90?) of the descender wouldn?t be so small. 

Michael Everson * http://www.evertype.com/


From frederic.grosshans at gmail.com  Mon Jan  4 08:06:24 2016
From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=)
Date: Mon, 04 Jan 2016 14:06:24 +0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
Message-ID: <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>

Le lun. 4 janv. 2016 ? 09:18, "J?rg Knappen" <jknappen at web.de> a ?crit :

> Here is a report of a rather strange beast occurring in historical math
> printing (work of C. F. Gau?) in thw 19th century:
>
>
> http://tex.stackexchange.com/questions/284483/how-do-i-typeset-this-symbol-possibly-astronomical
>
> images are here:
>
> http://www.archive.org/stream/abhandlungenmet00gausrich#page/n129/mode/2up
> http://i.stack.imgur.com/57fN3.png
>
> It looks like a big digit "7" or like a turned letter "L". In the accepted
> answer it was identified with the Tironian note et; an identification
> I'd dispute because the Tironian note Et is usually smaller in size than a
> capital latin letter.
>

I don?t know what the glyph is, but I doubt that a digit or tironian et
makes sense semantically. Since it corresponds to an angular measure (the
daily angular displacement of a celestial body), the unicode character
correspnding to it is likely ? U+29A2 TURNED ANGLE

  Fr?d?ric
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/e1e1ecf0/attachment.html>

From ejp10 at psu.edu  Mon Jan  4 08:44:38 2016
From: ejp10 at psu.edu (Elizabeth J. Pyatt)
Date: Mon, 4 Jan 2016 09:44:38 -0500
Subject: Unicode in the Curriculum? (Julian Bradfield)
In-Reply-To: <mailman.0.1451671201.13788.unicode@unicode.org>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
Message-ID: <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>

Like some others on the list, I believe Unicode should be mentioned at different points in a programming curriculum, particularly at the time when ASCII would be taught. Font design and typography is perhaps a different topic, but if it?s mentioned, why not mention CSS/font options for different scripts?

Any cloud based tool with ambitions to be a force in the global market MUST use Unicode. Tools such as Twitter, Word Press, Wikipedia, Facebook, Google Docs, Apple Mail/Outlook/Thunderbird and others work with multiple languages because of Unicode. And yes we all want to use our emojis! Optimal Unicode support means getting it right the first time to support thousands of languages instead of adding language support one by one.

Even ?English only? pages, particularly educational pages, can include characters outside of Latin 1 such as math and technical symbols, smart curly quotes, long dashes, math symbols, and yes non-English words. I long for the day when I will no longer see phrases with mangled punctuation like: ?They%!re half their size%#Weight Loss Winners?. Thanks to Unicode savvy Web designers, it?s a sight seen much less than 10 years ago.

Elizabeth

=-=-=-=-=-=-=-=-=-=-=-=-=
Elizabeth J. Pyatt, Ph.D.
Instructional Designer
Teaching and Learning with Technology
Penn State University
ejp10 at psu.edu, (814) 865-0805 or (814) 865-2030 (Main Office)

210 Rider Building  (formerly Rider II)
227 W. Beaver Avenue
State College, PA   16801-4819
http://www.personal.psu.edu/ejp10/psu
http://tlt.psu.edu


From raymond at almanach.co.uk  Mon Jan  4 09:38:02 2016
From: raymond at almanach.co.uk (Raymond Mercier)
Date: Mon, 4 Jan 2016 15:38:02 -0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
Message-ID: <36275E73002B46CFAE7F2D3A81917380@UserPC>

The sign described as like 7 is surely a cursive form of ?. The form used by Gauss (Disquisitio de elementis ellipticis Palladis) is much the same as that shown in manuals of Greek Palaeography as a cursive ?. This is given by E.P. Thompson in two works, An Introduction to Greek and Latin Palaeography, Oxford, 1912, p.83, and  A Handbook of Greek and Latin Palaeography, Chicago, 1975, p. 95.
Raymond Mercier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/f1cdd39c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Pi_Abbrev.jpg
Type: image/jpeg
Size: 14412 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/f1cdd39c/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GaussPallas_21.jpg
Type: image/jpeg
Size: 80689 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/f1cdd39c/attachment-0001.jpg>

From everson at evertype.com  Mon Jan  4 09:49:15 2016
From: everson at evertype.com (Michael Everson)
Date: Mon, 4 Jan 2016 15:49:15 +0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <36275E73002B46CFAE7F2D3A81917380@UserPC>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
Message-ID: <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>

Excellent!
Looks like a candidate character for encoding. I?m sure I have some examples of good font designs for the old character in one of my books. 

> On 4 Jan 2016, at 15:38, Raymond Mercier <raymond at almanach.co.uk> wrote:
> 
> The sign described as like 7 is surely a cursive form of ?. The form used by Gauss (Disquisitio de elementis ellipticis Palladis) is much the same as that shown in manuals of Greek Palaeography as a cursive ?. This is given by E.P. Thompson in two works, An Introduction to Greek and Latin Palaeography, Oxford, 1912, p.83, and  A Handbook of Greek and Latin Palaeography, Chicago, 1975, p. 95.
> Raymond Mercier
> <Pi_Abbrev.jpg><GaussPallas_21.jpg>

Michael Everson * http://www.evertype.com/


From asmus-inc at ix.netcom.com  Mon Jan  4 10:54:18 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 4 Jan 2016 08:54:18 -0800
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
Message-ID: <568AA3BA.1030201@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/3e1e1214/attachment.html>

From asmus-inc at ix.netcom.com  Mon Jan  4 10:59:23 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 4 Jan 2016 08:59:23 -0800
Subject: Unicode in the Curriculum? (Julian Bradfield)
In-Reply-To: <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
Message-ID: <568AA4EB.5070606@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/406bf104/attachment.html>

From asmus-inc at ix.netcom.com  Mon Jan  4 10:59:57 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 4 Jan 2016 08:59:57 -0800
Subject: Aw: Symbol for an upside down capital L, pointing to the right?
In-Reply-To: <trinity-87ff45ab-1fe0-479a-bc18-4ce63b445802-1451894763055@3capp-webde-bs39>
References: <DM2PR09MB03510AB51C22ECA588F3FB64C8F80@DM2PR09MB0351.namprd09.prod.outlook.com>
 <trinity-87ff45ab-1fe0-479a-bc18-4ce63b445802-1451894763055@3capp-webde-bs39>
Message-ID: <568AA50D.9020007@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/ec2bedfe/attachment.html>

From dwanders at sonic.net  Mon Jan  4 11:09:55 2016
From: dwanders at sonic.net (Deborah W. Anderson)
Date: Mon, 4 Jan 2016 09:09:55 -0800
Subject: Errors in CJK F chart in L2/15-339
In-Reply-To: <CAKZQS28YSDeU2ZjffjmiZdgS_-F8U20tyMew9J27gPo7=UVqAg@mail.gmail.com>
References: <CAKZQS28YSDeU2ZjffjmiZdgS_-F8U20tyMew9J27gPo7=UVqAg@mail.gmail.com>
Message-ID: <000e01d14712$bc3972c0$34ac5840$@sonic.net>

Dear Ben,

Please send any comments on the proposed additional repertoire for the 5th edition (CD2, L2/15-339) via the contact form http://www.unicode.org/reporting.html.  (Shortly I expect there will be PRI for feedback on L2/15-339, but it isn?t up yet.)

 
Forwarding the feedback to Dr. Lu is also a very good idea, as recommended by Suzuki-san.

 
Thanks,

Debbie Anderson

 
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ben Scarborough
Sent: Sunday, January 03, 2016 8:55 PM
To: unicode at unicode.org
Subject: Errors in CJK F chart in L2/15-339

 
I've found at least two errors in the CJK F charts in L2/15-339 (and thus in WG2 N4705 and IRG N2130 as well). Who should I contact about this?

 
?Ben Scarborough

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/cdf4235d/attachment.html>

From everson at evertype.com  Mon Jan  4 12:41:53 2016
From: everson at evertype.com (Michael Everson)
Date: Mon, 4 Jan 2016 18:41:53 +0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <568AA3BA.1030201@ix.netcom.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
Message-ID: <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>

On 4 Jan 2016, at 16:54, Asmus Freytag (t) <asmus-inc at ix.netcom.com> wrote:
> 
> On 1/4/2016 7:49 AM, Michael Everson wrote:
>> Excellent!
>> Looks like a candidate character for encoding. I?m sure I have some examples of good font designs for the old character in one of my books. 
> 
> Admitting that a Greek letter inherently makes more sense than an "et" as a variable name, I would still need to understand why "pi" would make a sensible mnemonic choice for the variable in Gauss' treatise, before being confident that we've made the correct     identification. The more so, as the use of non-cursive pi for "perihelion" in the same work is clearly mnemonic.

Certainly it does look more like a very common variant of ?tau? than ?pi?

Michael Everson * http://www.evertype.com/


From asmus-inc at ix.netcom.com  Mon Jan  4 13:58:12 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 4 Jan 2016 11:58:12 -0800
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
Message-ID: <568ACED4.3010505@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/86d487da/attachment.html>

From raymond at almanach.co.uk  Mon Jan  4 14:27:44 2016
From: raymond at almanach.co.uk (Raymond Mercier)
Date: Mon, 4 Jan 2016 20:27:44 -0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <568ACED4.3010505@ix.netcom.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
 <568ACED4.3010505@ix.netcom.com>
Message-ID: <C08C601E4F9644FE859DA75844463B97@UserPC>

On further reflection I can well agree that it is tau. The attached images from R. Barbour, Greek Literary Hands show clearly (scan 3) the large upper case tau in several lines, and in scan 4 in the first and other lines a hooked version of tau. So I withdraw my suggestion of pi.
Raymond

From: Asmus Freytag (t) 
Sent: Monday, January 04, 2016 7:58 PM
To: unicode at unicode.org 
Subject: Re: Turned Capital letter L (pointing to the left, with serifs)

On 1/4/2016 10:41 AM, Michael Everson wrote:

  Certainly it does look more like a very common variant of ?tau? than ?pi?


Variant of uppercase tau?

A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/3a1356ad/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scan0003.jpg
Type: image/jpeg
Size: 156740 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/3a1356ad/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scan0004.jpg
Type: image/jpeg
Size: 182112 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/3a1356ad/attachment-0003.jpg>

From raymond at almanach.co.uk  Mon Jan  4 14:33:27 2016
From: raymond at almanach.co.uk (Raymond Mercier)
Date: Mon, 4 Jan 2016 20:33:27 -0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <568ACED4.3010505@ix.netcom.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
 <568ACED4.3010505@ix.netcom.com>
Message-ID: <7F2DF0906BEC42219F6825A4D225926A@UserPC>

On further reflection I can well agree that it is tau. The attached images from R. Barbour, Greek Literary Hands, show clearly (scan 3) the large upper case tau in several lines, and in scan 4 in the first and other lines a hooked version of tau. So I withdraw my suggestion of pi.
Raymond

From: Asmus Freytag (t) 
Sent: Monday, January 04, 2016 7:58 PM
To: unicode at unicode.org 
Subject: Re: Turned Capital letter L (pointing to the left, with serifs)

On 1/4/2016 10:41 AM, Michael Everson wrote:

  Certainly it does look more like a very common variant of ?tau? than ?pi?


Variant of uppercase tau?

A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/f28650ab/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scan0003_2.jpg
Type: image/jpeg
Size: 19199 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/f28650ab/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scan0004_1.jpg
Type: image/jpeg
Size: 20048 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/f28650ab/attachment-0001.jpg>

From asmus-inc at ix.netcom.com  Mon Jan  4 13:58:12 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 4 Jan 2016 11:58:12 -0800
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
Message-ID: <568ACED4.3010505@ix.netcom.com>

On 1/4/2016 10:41 AM, Michael Everson wrote:
>> Certainly it does look more like a very common variant of ?tau? than ?pi?
> 
> Variant of uppercase tau?

No, of lowercase.

Michael

From frederic.grosshans at gmail.com  Mon Jan  4 15:33:45 2016
From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=)
Date: Mon, 04 Jan 2016 21:33:45 +0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <7F2DF0906BEC42219F6825A4D225926A@UserPC>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
 <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC>
Message-ID: <CAJAPE_S4yXRK9Motak6AyUTgsOZeH2Xkf2jr6byZ=nFFM+rUcA@mail.gmail.com>

I looked all the pages of the 1809 edition of _Theoria motus corporum
coelestium in sectionibus conicis solem ambientium_
https://archive.org/stream/bub_gb_ORUOAAAAQAAJ where Gauss used this
notation in pages 80-81. Almost all notations are standard enough to be
familiar to any modern (2015) mathematician or physicist, with two
exceptions : this "7" symbol and ? U+260A ASCENDING NODE (which is still
standard in astronomy). The Greek letters in particular have a pretty
standard shape, and I don't see why this symbol would be the only geek
letter using a fancy cursive shape. Even the Latin letters used standard
shapes ( italic, roman, a few capital fraktur).

That said,  I did not spot a tau in the text, while most of the Greek
alphabet was used. Could "7" be a standard shape for tau in 1809 Hamburg ?

However, I still think it is a ? U+29A2 TURNED ANGLE

  Fr?d?ric

Le lun 4 janv. 2016 21:38, Raymond Mercier <raymond at almanach.co.uk> a
?crit :

> On further reflection I can well agree that it is tau. The attached images
> from R. Barbour, Greek Literary Hands, show clearly (scan 3) the large
> upper case tau in several lines, and in scan 4 in the first and other lines
> a hooked version of tau. So I withdraw my suggestion of pi.
> Raymond
>
> *From:* Asmus Freytag (t) <asmus-inc at ix.netcom.com>
> *Sent:* Monday, January 04, 2016 7:58 PM
> *To:* unicode at unicode.org
> *Subject:* Re: Turned Capital letter L (pointing to the left, with serifs)
>
> On 1/4/2016 10:41 AM, Michael Everson wrote:
>
> Certainly it does look more like a very common variant of ?tau? than ?pi?
>
>
> Variant of uppercase tau?
>
> A./
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/e953cdda/attachment.html>

From everson at evertype.com  Mon Jan  4 16:14:14 2016
From: everson at evertype.com (Michael Everson)
Date: Mon, 4 Jan 2016 22:14:14 +0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <CAJAPE_S4yXRK9Motak6AyUTgsOZeH2Xkf2jr6byZ=nFFM+rUcA@mail.gmail.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
 <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC>
 <CAJAPE_S4yXRK9Motak6AyUTgsOZeH2Xkf2jr6byZ=nFFM+rUcA@mail.gmail.com>
Message-ID: <08A1BCBE-382D-4547-ADFE-83863C089DBC@evertype.com>

On 4 Jan 2016, at 21:33, Fr?d?ric Grosshans <frederic.grosshans at gmail.com> wrote:
> 
> The Greek letters in particular have a pretty standard shape, and I don't see why this symbol would be the only geek letter using a fancy cursive shape. Even the Latin letters used standard shapes ( italic, roman, a few capital fraktur). 

If he uses a regular tau for anything else that would be the reason.

> That said,  I did not spot a tau in the text, while most of the Greek alphabet was used. Could "7" be a standard shape for tau in 1809 Hamburg?

It?s a standard variant in many older Greek typefaces with ligatures. 

Michael Everson * http://www.evertype.com/


From asmus-inc at ix.netcom.com  Mon Jan  4 23:08:03 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 4 Jan 2016 21:08:03 -0800
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <CAJAPE_S4yXRK9Motak6AyUTgsOZeH2Xkf2jr6byZ=nFFM+rUcA@mail.gmail.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
 <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC>
 <CAJAPE_S4yXRK9Motak6AyUTgsOZeH2Xkf2jr6byZ=nFFM+rUcA@mail.gmail.com>
Message-ID: <568B4FB3.2080708@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160104/125d5ee1/attachment.html>

From lists+unicode at seantek.com  Mon Jan  4 23:30:32 2016
From: lists+unicode at seantek.com (Sean Leonard)
Date: Mon, 4 Jan 2016 21:30:32 -0800
Subject: Unicode password mapping for crypto standard
Message-ID: <568B54F8.5000802@seantek.com>

Hi Unicode list, I am looking for feedback on this proposal, 
specifically a standard specification to map between (presumably) 
Unicode text strings and octet strings.

A "password" is defined as an arbitrary octet string in a number of 
protocols and formats. This has worked for basic cases where the 
"password" is just ASCII, but there are interoperability issues when 
characters beyond ASCII get involved. My observation is that a lot of 
security folks get hand-wavy about the Unicode stuff, which is why there 
is little standardization in this area.

Recently in the IETF, application/pkcs8-encrypted is proposed for the PKCS #8 EncryptedPrivateKeyInfo type. For purposes of our discussion, the format takes as input an opaque octet string (any octet in the range 00h-FFh, of any length), and executes various specified algorithms; the result is a decrypted private key. The most common algorithm is PBKDF2, but any algorithm can be used (including, for example, a raw symmetric encryption algorithm such as AES-256).

PKCS #8 punts on the issue of character encoding. It says that ASCII or UTF-8 could be used, but doesn?t enforce anything in particular. PKCS #12 specifies UTF-16LE with a terminating NULL character (00h 00h).

In the application/pkcs8-encrypted registration, I thought it might be wise to allow senders and receivers to specify how input (whether user input or otherwise) gets mapped to the octet string, since it's not part of the format. Originally my concern at that time was to reflect IANA character sets, rather than profiles of Unicode.

These days, however, most user agents are Unicode-enabled and will accept user input in Unicode. Therefore, issue is less about legacy character sets, and more about how to take the Unicode input and get a consistent and reasonable stream of bits out on both ends. For example: should the password be case folded, converted to NFKC, encoded in UTF-8 vs. UTF-16BE, etc.? Constraining or transforming the input would be helpful for disparate systems to agree on these things.


Thank you,

Sean

PS I read the "Unicode in passwords" thread. It's relevant. An 
alternative or addition to proposing a mapping to/from Unicode, might be 
to have a "keyboard-mapping" or "keyboard-layout" parameter, that 
specifies the suggested layout of the keyboard (or input device) used 
for password input, preferably by deferring to some international 
standard on the topic. Such a parameter could influence the initial user 
input method, but it doesn't answer the question of how to turn the key 
presses into specific bits (Unicode-based or otherwise).

**********
The relevant part of the template (most recent proposal, today) is:
***
Optional parameters:

password-mapping:
When the private key encryption algorithm incorporates a "password" that 
is an octet string, a mapping between user input and the octet string is 
desirable. PKCS #5 [RFC2898] Section 3 recommends "that applications 
follow some common text encoding rules"; it then suggests, but does not 
recommend, ASCII and UTF-8. This parameter specifies the charset that a 
recipient SHOULD attempt first when mapping user input to the octet 
string. It has similar semantics as the charset parameter from 
text/plain, except that it only applies to the user?s input of the 
password. There is no default value.

The following special values are defined:
*pkcs12  = UTF-16LE with U+0000 NULL terminator (PKCS #12-style)
*precis  = PRECIS password profile, i.e., OpaqueString from Section 4 of 
RFC 7613 (always UTF-8)
*precis-XXX = PRECIS profile as named XXX in the IANA PRECIS Profiles 
Registry <https://www.iana.org/assignments/precis-parameters>
*hex     = hexadecimal input: the input is mapped to 0-9, A-F, and then 
converted directly to octets. If there are an odd number of hex digits, 
the final digit 0 is appended, or an error condition may be raised. 
Compare with Annex M.4 of IEEE 802.11-2012.
*dtmf    = The characters "0"-"9", "A"-"D", "*", and "#", which map to 
their corresponding ASCII codes. (This is to support restricted-input 
devices, i.e., telephones and telephone-like equipment.)

Otherwise, the value of this parameter is a charset, from the Character 
Sets Registry <http://www.iana.org/assignments/character-sets>.
***

The relevant part of the original template (proposed 2015-11-04) is:
***
Optional parameters:
charset: When the private key encryption algorithm incorporates a ?password" that is an octet string, a mapping between user input and the octet string is desirable. PKCS #5 [RFC2898] Section 3 recommends "that applications follow some common text encoding rules"; it then suggests, but does not recommend, ASCII and UTF-8. This parameter specifies the charset that a recipient SHOULD attempt first when mapping user input to the octet string. It has the same semantics as the charset parameter from text/plain, except that it only applies to the user?s input of the password. There is no default value.

ualg: When the charset is a Unicode-based encoding, this parameter is a space-delimited list of Unicode algorithms that a recipient SHOULD first attempt to apply to the Unicode user input in succession, in order to derive the octet string. The list of algorithm keywords is defined by [UNICODE]. ?Tailored operations? are operations that are sensitive to language, which must be provided as an input parameter. If a tailored operation is called for, the exclamation mark followed by the [BCP47] language tag specifies the language. For example, "toNFD toNFKC_Casefold!tr" first applies Normalization Form D, followed by Normalization Form KC with Case Folding in the Turkish language, according to [UNICODE] and [UAX31]. The default value of this parameter is empty, and leaves the matter of whether to normalize, case fold, or apply other transformations unspecified.


The latest template is here:

http://mailarchive.ietf.org/arch/msg/precis/Qil9mc5AtqxXp8OXllp0lAwYts4


From c933103 at gmail.com  Tue Jan  5 01:19:25 2016
From: c933103 at gmail.com (gfb hjjhjh)
Date: Tue, 5 Jan 2016 15:19:25 +0800
Subject: Unicode password mapping for crypto standard
In-Reply-To: <568B54F8.5000802@seantek.com>
References: <568B54F8.5000802@seantek.com>
Message-ID: <CAGHjPPL_DKCKNXgsgCHraX=UtZfnYVCrbfsxaVSAEShh8YZYsw@mail.gmail.com>

Hello, I don't have much knowledge on the topic, but 1. probably something
like the punycode used for internationalized domain name might help? 2. I
don't think keyboard mapping is a good idea, as to some less computer-savvy
Chinese-speaking users, it's often that their only way to write Chinese
into computer is by handwriting and handwriting doesn't seem to be
something supported by keyboard mapping.
2016/01/05 13:33 "Sean Leonard" <lists+unicode at seantek.com>:

> Hi Unicode list, I am looking for feedback on this proposal, specifically
> a standard specification to map between (presumably) Unicode text strings
> and octet strings.
>
> A "password" is defined as an arbitrary octet string in a number of
> protocols and formats. This has worked for basic cases where the "password"
> is just ASCII, but there are interoperability issues when characters beyond
> ASCII get involved. My observation is that a lot of security folks get
> hand-wavy about the Unicode stuff, which is why there is little
> standardization in this area.
>
> Recently in the IETF, application/pkcs8-encrypted is proposed for the PKCS
> #8 EncryptedPrivateKeyInfo type. For purposes of our discussion, the format
> takes as input an opaque octet string (any octet in the range 00h-FFh, of
> any length), and executes various specified algorithms; the result is a
> decrypted private key. The most common algorithm is PBKDF2, but any
> algorithm can be used (including, for example, a raw symmetric encryption
> algorithm such as AES-256).
>
> PKCS #8 punts on the issue of character encoding. It says that ASCII or
> UTF-8 could be used, but doesn?t enforce anything in particular. PKCS #12
> specifies UTF-16LE with a terminating NULL character (00h 00h).
>
> In the application/pkcs8-encrypted registration, I thought it might be
> wise to allow senders and receivers to specify how input (whether user
> input or otherwise) gets mapped to the octet string, since it's not part of
> the format. Originally my concern at that time was to reflect IANA
> character sets, rather than profiles of Unicode.
>
> These days, however, most user agents are Unicode-enabled and will accept
> user input in Unicode. Therefore, issue is less about legacy character
> sets, and more about how to take the Unicode input and get a consistent and
> reasonable stream of bits out on both ends. For example: should the
> password be case folded, converted to NFKC, encoded in UTF-8 vs. UTF-16BE,
> etc.? Constraining or transforming the input would be helpful for disparate
> systems to agree on these things.
>
>
> Thank you,
>
> Sean
>
> PS I read the "Unicode in passwords" thread. It's relevant. An alternative
> or addition to proposing a mapping to/from Unicode, might be to have a
> "keyboard-mapping" or "keyboard-layout" parameter, that specifies the
> suggested layout of the keyboard (or input device) used for password input,
> preferably by deferring to some international standard on the topic. Such a
> parameter could influence the initial user input method, but it doesn't
> answer the question of how to turn the key presses into specific bits
> (Unicode-based or otherwise).
>
> **********
> The relevant part of the template (most recent proposal, today) is:
> ***
> Optional parameters:
>
> password-mapping:
> When the private key encryption algorithm incorporates a "password" that
> is an octet string, a mapping between user input and the octet string is
> desirable. PKCS #5 [RFC2898] Section 3 recommends "that applications follow
> some common text encoding rules"; it then suggests, but does not recommend,
> ASCII and UTF-8. This parameter specifies the charset that a recipient
> SHOULD attempt first when mapping user input to the octet string. It has
> similar semantics as the charset parameter from text/plain, except that it
> only applies to the user?s input of the password. There is no default value.
>
> The following special values are defined:
> *pkcs12  = UTF-16LE with U+0000 NULL terminator (PKCS #12-style)
> *precis  = PRECIS password profile, i.e., OpaqueString from Section 4 of
> RFC 7613 (always UTF-8)
> *precis-XXX = PRECIS profile as named XXX in the IANA PRECIS Profiles
> Registry <https://www.iana.org/assignments/precis-parameters>
> *hex     = hexadecimal input: the input is mapped to 0-9, A-F, and then
> converted directly to octets. If there are an odd number of hex digits, the
> final digit 0 is appended, or an error condition may be raised. Compare
> with Annex M.4 of IEEE 802.11-2012.
> *dtmf    = The characters "0"-"9", "A"-"D", "*", and "#", which map to
> their corresponding ASCII codes. (This is to support restricted-input
> devices, i.e., telephones and telephone-like equipment.)
>
> Otherwise, the value of this parameter is a charset, from the Character
> Sets Registry <http://www.iana.org/assignments/character-sets>.
> ***
>
> The relevant part of the original template (proposed 2015-11-04) is:
> ***
> Optional parameters:
> charset: When the private key encryption algorithm incorporates a
> ?password" that is an octet string, a mapping between user input and the
> octet string is desirable. PKCS #5 [RFC2898] Section 3 recommends "that
> applications follow some common text encoding rules"; it then suggests, but
> does not recommend, ASCII and UTF-8. This parameter specifies the charset
> that a recipient SHOULD attempt first when mapping user input to the octet
> string. It has the same semantics as the charset parameter from text/plain,
> except that it only applies to the user?s input of the password. There is
> no default value.
>
> ualg: When the charset is a Unicode-based encoding, this parameter is a
> space-delimited list of Unicode algorithms that a recipient SHOULD first
> attempt to apply to the Unicode user input in succession, in order to
> derive the octet string. The list of algorithm keywords is defined by
> [UNICODE]. ?Tailored operations? are operations that are sensitive to
> language, which must be provided as an input parameter. If a tailored
> operation is called for, the exclamation mark followed by the [BCP47]
> language tag specifies the language. For example, "toNFD
> toNFKC_Casefold!tr" first applies Normalization Form D, followed by
> Normalization Form KC with Case Folding in the Turkish language, according
> to [UNICODE] and [UAX31]. The default value of this parameter is empty, and
> leaves the matter of whether to normalize, case fold, or apply other
> transformations unspecified.
>
>
> The latest template is here:
>
> http://mailarchive.ietf.org/arch/msg/precis/Qil9mc5AtqxXp8OXllp0lAwYts4
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160105/92a4ac9b/attachment.html>

From duerst at it.aoyama.ac.jp  Tue Jan  5 02:26:45 2016
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Tue, 5 Jan 2016 17:26:45 +0900
Subject: Unicode in the Curriculum?
In-Reply-To: <slrnn8aun4.f73.jcb@home.stevens-bradfield.com>
References: <567331D2.1000007@gmail.com>
 <59AD8E6F-F67A-4FFD-B364-6AC9E555D23D@lboro.ac.uk>
 <6370df3a44124747bb580b2760e9d144@EX13D08UWB002.ant.amazon.com>
 <B8EEC34E-8FAE-4832-BEEA-2F87E828E7CB@lboro.ac.uk>
 <slrnn8aun4.f73.jcb@home.stevens-bradfield.com>
Message-ID: <568B7E45.2020401@it.aoyama.ac.jp>

I agree to a certain extent with Julian. There are extremely many 
subjects industry surely would like computer science students to learn 
in college, and internationalization/Unicode is only one of them.

On the other hand, I think that universities teach about integer and 
floating point representation for numbers, and likewise, they should 
teach about ASCII and Unicode for text representation.

I personally have given a full course on internationalization/Unicode 
topics only once, as a guest lecturer at the University of Linz in the 
1990ies. In that same aera, I also once gave a course about computer 
topics for Japanology students, which of course included character 
encodings, but also complete beginner stuff such as use of Web search 
engines.

Otherwise, I integrate Unicode and internationalization subjects in my 
courses where possible. As an example, in my C programming course, 
there's an exercise where students use the same C program with different 
source encodings, execution encodings, and terminal settings, getting 
some understanding for character count vs. byte count, repertoire of 
different encodings, and so on. This kind of stuff is a bit easier to do 
here in Japan, where "ASCII isn't enough" doesn't have to be explained 
at great length, and where multiple encodings (mostly UTF-8 and 
Shift_JIS) are still in use.

Regards,   Martin.

On 2016/01/01 03:58, Julian Bradfield wrote:
> On 2015-12-31, Andre Schappo <A.Schappo at lboro.ac.uk> wrote:
>
>> I have been hitting my head against the Academic Brick Wall for
>> years WRT getting IT i18n and Unicode on the curriculum and I am
>> losing. I did teach a final year elective module on IT i18n but a
>> few months ago my University dropped it. I am continually puzzled by
>> the lack of interest University Computer Science departments have in
>> i18n. I appear to be a solitary UK University Computer Science voice
>> when it comes to i18n.
>
> Well, I'd say that it's not the business of Computer Science degrees
> to teach specific technical skills. It's our business to help people
> learn about the fundamentals of the subject, so that they can acquire
> any specific skill on demand, and use that skill competently. In those
> areas where we do teach specific skills (e.g. machine learning
> techniques) we teach those that have some intellectual content to
> them.  (This is why we don't teach programming languages as such - we
> teach a programming language as a means of learning a programming
> paradigm.)
>
> In my experience so far, using Unicode and doing i18n is not very
> interesting (killingly boring, actually) from a purely CS technical
> point of view, unless you happen to be one of the small minority who
> enjoys script and font layout issues - the interesting bits of doing
> i18n are in producing linguistically and culturally appropriate
> messages, and that's where one should bring in experts, not expect
> typical software developers to be able to do it.
>
> If you still have the materials for your course, it would be
> interesting to see how you managed to get an interesting (and
> examinable!) course out of i18n.
>
> I do in fact mention Unicode and i18n in my introductory programming
> course (which is not for CS students), but all I say is "you should
> know it's there, and if you become a competent programmer, then you
> can read the manuals and tutorials to learn what you need".
>

From jknappen at web.de  Tue Jan  5 03:10:40 2016
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Tue, 5 Jan 2016 10:10:40 +0100
Subject: Aw: Re: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <568B4FB3.2080708@ix.netcom.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
 <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC>
 <CAJAPE_S4yXRK9Motak6AyUTgsOZeH2Xkf2jr6byZ=nFFM+rUcA@mail.gmail.com>,
 <568B4FB3.2080708@ix.netcom.com>
Message-ID: <trinity-7a6ad70d-746f-439a-ae00-100f9be0c370-1451985040574@3capp-webde-bs43>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160105/7bae958c/attachment.html>

From frederic.grosshans at gmail.com  Tue Jan  5 03:22:16 2016
From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=)
Date: Tue, 05 Jan 2016 09:22:16 +0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <trinity-7a6ad70d-746f-439a-ae00-100f9be0c370-1451985040574@3capp-webde-bs43>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
 <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC>
 <CAJAPE_S4yXRK9Motak6AyUTgsOZeH2Xkf2jr6byZ=nFFM+rUcA@mail.gmail.com>
 <568B4FB3.2080708@ix.netcom.com>
 <trinity-7a6ad70d-746f-439a-ae00-100f9be0c370-1451985040574@3capp-webde-bs43>
Message-ID: <CAJAPE_QTrJKt6T+JfW8VoCsD0xSO8uBfT3rc7tkOuLkS5Pf73Q@mail.gmail.com>

Le mar. 5 janv. 2016 10:13, "J?rg Knappen" <jknappen at web.de> a ?crit :

> I have looked up some printed sources and I agree with Michael Everson and
> Fr?d?ric Grosshans that the
> beast in question is a variant of the greek letter tau (capital or
> lowercase).
>

The identification to ? is from Asmus Freytag, not me. I have proposed
another identity  (TURNED ANGLE), and I only start to be convinced by the ?
identification

  Fr?d?ric
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160105/bd3fd693/attachment.html>

From jknappen at web.de  Tue Jan  5 04:07:23 2016
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Tue, 5 Jan 2016 11:07:23 +0100
Subject: Aw: Re: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <trinity-7a6ad70d-746f-439a-ae00-100f9be0c370-1451985040574@3capp-webde-bs43>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
 <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC>
 <CAJAPE_S4yXRK9Motak6AyUTgsOZeH2Xkf2jr6byZ=nFFM+rUcA@mail.gmail.com>,
 <568B4FB3.2080708@ix.netcom.com>,
 <trinity-7a6ad70d-746f-439a-ae00-100f9be0c370-1451985040574@3capp-webde-bs43>
Message-ID: <trinity-8f796aff-95f2-4135-bda9-60ebb1841607-1451988443362@3capp-webde-bs43>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160105/26e94f9f/attachment.html>

From asmus-inc at ix.netcom.com  Tue Jan  5 08:04:48 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Tue, 5 Jan 2016 06:04:48 -0800
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <CAJAPE_QTrJKt6T+JfW8VoCsD0xSO8uBfT3rc7tkOuLkS5Pf73Q@mail.gmail.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
 <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC>
 <CAJAPE_S4yXRK9Motak6AyUTgsOZeH2Xkf2jr6byZ=nFFM+rUcA@mail.gmail.com>
 <568B4FB3.2080708@ix.netcom.com>
 <trinity-7a6ad70d-746f-439a-ae00-100f9be0c370-1451985040574@3capp-webde-bs43>
 <CAJAPE_QTrJKt6T+JfW8VoCsD0xSO8uBfT3rc7tkOuLkS5Pf73Q@mail.gmail.com>
Message-ID: <568BCD80.20707@ix.netcom.com>

On 1/5/2016 1:22 AM, Fr?d?ric Grosshans wrote:
>
>
> Le mar. 5 janv. 2016 10:13, "J?rg Knappen" <jknappen at web.de 
> <mailto:jknappen at web.de>> a ?crit :
>
>     I have looked up some printed sources and I agree with Michael
>     Everson and Fr?d?ric Grosshans that the
>     beast in question is a variant of the greek letter tau (capital or
>     lowercase).
>
>
> The identification to ? is from Asmus Freytag, not me.

Mine is a concurring opinion based on ME's suggestion, but corroborated, 
in my view, by the systematic notational conventions and not merely 
informed by visual similarity.

A./

> I have proposed another identity  (TURNED ANGLE), and I only start to 
> be convinced by the ? identification
>
>   Fr?d?ric

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160105/cc4a99b0/attachment.html>

From markus.icu at gmail.com  Tue Jan  5 10:26:42 2016
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 5 Jan 2016 08:26:42 -0800
Subject: Unicode password mapping for crypto standard
In-Reply-To: <568B54F8.5000802@seantek.com>
References: <568B54F8.5000802@seantek.com>
Message-ID: <CAN49p6pj-3EE=okMYZw73r4EfTb1JZ2vc6vPkuqkvOe51PqtqA@mail.gmail.com>

I would specify that UTF-8 must be used, without mapping.
US-ASCII is a proper subset, so need not be mentioned explicitly, nor
distinguished in the protocol.
Mappings would require that all implementations carry relevant data, and
are up to date to recent versions of Unicode, or else previously-unassigned
code points will cause failures.
As long as a user types the same password the same way, or with IMEs that
produce the same output, they are fine. Strange variants might improve
password security.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160105/75c1be5f/attachment.html>

From verdy_p at wanadoo.fr  Tue Jan  5 10:26:52 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 5 Jan 2016 17:26:52 +0100
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
Message-ID: <CAGa7JC3y3SVq+p+4qjnCLZD+PqjMq1nnuqVfw=soHKqN_jX_0w@mail.gmail.com>

And given the context of use on the document, where it is a measurement of
time in seconds (it is a mean daily time drift, if you don't read German),
some variants of T/Tau is certainly a best option. The other variables in
the additive formula were also related to time and where also based on "t",
so the formula used various variants of the T/Tau letter.

Intuitively when reading the formula and description I undoubtly pronounced
it "tau" (there was no other occurence of the tau letter in the formula,
but the fact it used a bold capital may be related to the fact that the
mean daily time drift in this formula is nearly constant, with very tiny
variations that the formula wants to take into account in a differential.
Traditionally, consitnats or near constants are using bold capital letters,
and it was made to contrast it with the true time "t" which is obviously
not constant (|dt / dTau| is largely above 1, most of the time except in
very few short periods of time in the year, but the formula is not
interested in finding/predicting those events but to estimate how the
geocentric time evolves over long periods thru years in order to compute
calendars).

The discovery of the cursive variant of pi is interesting but largely too
far graphically : it is is single curved stroke like a turned "J", but here
the "7" shaped letter clearly uses two strokes, like Tau) and semantically
(pi would be related to an angle measurement, not to time, even if the
formula is related to the pseudo-elliptic revolution of Earth around Sun,
it would not be coherent with the additive differential formula cumulating
with time "t".

In summary for me it's just a bold capital Greek letter Tau (in
cursive/itialic style, like "t", because it is a true variable and not a
symbol like the differential operator). The printer however chose to use a
decorative variant of the bold digit 7 to represent it, because it had it
in its collections of metal fonts (e.g. for titling on cover pages, where
titles/headings are customarily using decorative such bold font styles).
May be if you read the rest of the text including the presentation, you
will discover it more completely or even spelled explicitly in sentences.
But we have no audio records to confirm it: the reader has to interpret it
but it is easier to read and understand if you just identify it as "Tau"
rather than "T" or worse as "7".

2016-01-04 19:41 GMT+01:00 Michael Everson <everson at evertype.com>:

> On 4 Jan 2016, at 16:54, Asmus Freytag (t) <asmus-inc at ix.netcom.com>
> wrote:
> >
> > On 1/4/2016 7:49 AM, Michael Everson wrote:
> >> Excellent!
> >> Looks like a candidate character for encoding. I?m sure I have some
> examples of good font designs for the old character in one of my books.
> >
> > Admitting that a Greek letter inherently makes more sense than an "et"
> as a variable name, I would still need to understand why "pi" would make a
> sensible mnemonic choice for the variable in Gauss' treatise, before being
> confident that we've made the correct     identification. The more so, as
> the use of non-cursive pi for "perihelion" in the same work is clearly
> mnemonic.
>
> Certainly it does look more like a very common variant of ?tau? than ?pi?
>
> Michael Everson * http://www.evertype.com/
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160105/f4720314/attachment.html>

From bortzmeyer at nic.fr  Tue Jan  5 10:37:05 2016
From: bortzmeyer at nic.fr (Stephane Bortzmeyer)
Date: Tue, 5 Jan 2016 17:37:05 +0100
Subject: Unicode password mapping for crypto standard
In-Reply-To: <568B54F8.5000802@seantek.com>
References: <568B54F8.5000802@seantek.com>
Message-ID: <20160105163705.GA4941@nic.fr>

On Mon, Jan 04, 2016 at 09:30:32PM -0800,
 Sean Leonard <lists+unicode at seantek.com> wrote 
 a message of 120 lines which said:

> how to take the Unicode input and get a consistent and reasonable
> stream of bits out on both ends. For example: should the password be
> case folded, converted to NFKC, encoded in UTF-8 vs. UTF-16BE, etc.?

There is already a standard on that, RFC 7613 "Preparation,
Enforcement, and Comparison of Internationalized Strings Representing
Usernames and Passwords" <http://www.rfc-editor.org/rfc/rfc7613.txt>
and I suggest we use it and do not reinvent the wheel.


From raymond at almanach.co.uk  Tue Jan  5 12:20:07 2016
From: raymond at almanach.co.uk (Raymond Mercier)
Date: Tue, 5 Jan 2016 18:20:07 -0000
Subject: Turned Capital letter L (pointing to the left, with serifs)
In-Reply-To: <568BCD80.20707@ix.netcom.com>
References: <trinity-6bcf6f39-b92b-400a-89df-b2bd9409ad84-1451895309221@3capp-webde-bs39>
 <CAJAPE_Q4UN6QiNgo1U35me_WLTUnKS1Jr0KfVsBc5RvevVc3bQ@mail.gmail.com>
 <36275E73002B46CFAE7F2D3A81917380@UserPC>
 <041CC0E9-0FE5-4A57-A61F-529C9D96B58E@evertype.com>
 <568AA3BA.1030201@ix.netcom.com>
 <ECEEFA15-B085-4D28-AC53-7B89C6CB7E10@evertype.com>
 <568ACED4.3010505@ix.netcom.com> <7F2DF0906BEC42219F6825A4D225926A@UserPC>
 <CAJAPE_S4yXRK9Motak6AyUTgsOZeH2Xkf2jr6byZ=nFFM+rUcA@mail.gmail.com>
 <568B4FB3.2080708@ix.netcom.com>
 <trinity-7a6ad70d-746f-439a-ae00-100f9be0c370-1451985040574@3capp-webde-bs43>
 <CAJAPE_QTrJKt6T+JfW8VoCsD0xSO8uBfT3rc7tkOuLkS5Pf73Q@mail.gmail.com>
 <568BCD80.20707@ix.netcom.com>
Message-ID: <395E59FB896A41FBA00C1E782F83E3B6@UserPC>

I have looked at both the collected works of Gauss and at the English version of the Theoria Motus, in order to see what a later editor made of this symbol.


In the Werke the symbol ?7? continues to be used : C F Gauss, Werke, Vol. 7, ed. E J Schering, Gotha, 1871; ? 77, M = N + n?7?  ?  ?.


In the translation the ?7? is replaced by the lower case tau. 


Theory of the motion of the heavenly bodies moving about the sun in conic sections: a translation of Gauss's "Theoria motus." With an appendix. By Charles Henry Davis, Boston : Little, Brown and company, 1857; ? 77, M = N + n?  ?  ?.


So this seems to settle the matter of the identity, and just leaves one to puzzle over the German use of this sign for tau.

Raymond

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160105/f33cc808/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: carlfriedrichgau07gaus 100.jpg
Type: image/jpeg
Size: 64903 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160105/f33cc808/attachment.jpg>

From A.Schappo at lboro.ac.uk  Wed Jan  6 06:09:33 2016
From: A.Schappo at lboro.ac.uk (Andre Schappo)
Date: Wed, 6 Jan 2016 12:09:33 +0000
Subject: Unicode in the Curriculum?
In-Reply-To: <568AA4EB.5070606@ix.netcom.com>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
 <568AA4EB.5070606@ix.netcom.com>
Message-ID: <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk>


On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote:

> On 1/4/2016 6:44 AM, Elizabeth J. Pyatt wrote:
>> Like some others on the list, I believe Unicode should be mentioned at different points in a programming curriculum, particularly at the time when ASCII would be taught. 
> ASCII shouldn't be taught, perhaps?

I really like the idea of questioning whether or not ASCII should even be taught.

Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text.

ASCII, along with, ISO-8859 ISO-2022 GB2312  ?etc? should be consigned to 

?and finally, the legacy character sets/encodings...

Maybe ASCII should now be flagged as deprecated https://twitter.com/andreschappo/status/684706421712228352

Andr? Schappo


From corbett.dav at husky.neu.edu  Wed Jan  6 08:42:21 2016
From: corbett.dav at husky.neu.edu (David Corbett)
Date: Wed, 6 Jan 2016 09:42:21 -0500
Subject: HENTAIGANA LETTER E-1
Message-ID: <CAKQz=Z6FoXjriAEf+7hp_sZPgqpBuG9T3yEMYxGUDc0bWdVyfg@mail.gmail.com>

Is there a difference between HENTAIGANA LETTER E-1 in L2/15-343 and
U+1B001 HIRAGANA LETTER ARCHAIC YE?

From kenwhistler at att.net  Wed Jan  6 09:43:41 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Wed, 6 Jan 2016 07:43:41 -0800
Subject: Unicode in the Curriculum?
In-Reply-To: <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
 <568AA4EB.5070606@ix.netcom.com>
 <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk>
Message-ID: <568D362D.2020409@att.net>

Actually, ASCII should *not* be ignored or deprecated.

We *love* ASCII. The issue is just making sure that students understand
that the *true name* of "ASCII" is "UTF-8". It is just the very first 128
values that open into the entire world of Unicode characters.

It is a mind trick to play on young programmers: when you learn
"ASCII", you are just playing on the bunny slope at the UTF-8
ski resort. Slap on your snowboard and practice -- get out there
onto the 2-, 3- and 4-byte slopes with the experts!

--Ken

On 1/6/2016 4:09 AM, Andre Schappo wrote:
> On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote:
>
>> ASCII shouldn't be taught, perhaps?
> I really like the idea of questioning whether or not ASCII should even be taught.
>
> Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text.
>
> ASCII, along with, ISO-8859 ISO-2022 GB2312  ?etc? should be consigned to
>
> ?and finally, the legacy character sets/encodings...
>
> Maybe ASCII should now be flagged as deprecated https://twitter.com/andreschappo/status/684706421712228352
>
> Andr? Schappo
>
>
>
>


From everson at evertype.com  Wed Jan  6 10:22:12 2016
From: everson at evertype.com (Michael Everson)
Date: Wed, 6 Jan 2016 16:22:12 +0000
Subject: HENTAIGANA LETTER E-1
In-Reply-To: <CAKQz=Z6FoXjriAEf+7hp_sZPgqpBuG9T3yEMYxGUDc0bWdVyfg@mail.gmail.com>
References: <CAKQz=Z6FoXjriAEf+7hp_sZPgqpBuG9T3yEMYxGUDc0bWdVyfg@mail.gmail.com>
Message-ID: <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com>

On 6 Jan 2016, at 14:42, David Corbett <corbett.dav at husky.neu.edu> wrote:
> 
> Is there a difference between HENTAIGANA LETTER E-1 in L2/15-343 and
> U+1B001 HIRAGANA LETTER ARCHAIC YE?

No, there is not. The former would be unified with it. 

Michael Everson * http://www.evertype.com/


From Shawn.Steele at microsoft.com  Wed Jan  6 12:59:22 2016
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Wed, 6 Jan 2016 18:59:22 +0000
Subject: Unicode in the Curriculum?
In-Reply-To: <568D362D.2020409@att.net>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
 <568AA4EB.5070606@ix.netcom.com>
 <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk> <568D362D.2020409@att.net>
Message-ID: <SN1PR0301MB204575062E07C4EA5F95B11882F40@SN1PR0301MB2045.namprd03.prod.outlook.com>

+1  :)  

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler
Sent: Wednesday, January 6, 2016 7:44 AM
To: Andre Schappo <A.Schappo at lboro.ac.uk>
Cc: unicode at unicode.org
Subject: Re: Unicode in the Curriculum?

Actually, ASCII should *not* be ignored or deprecated.

We *love* ASCII. The issue is just making sure that students understand that the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values that open into the entire world of Unicode characters.

It is a mind trick to play on young programmers: when you learn "ASCII", you are just playing on the bunny slope at the UTF-8 ski resort. Slap on your snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with the experts!

--Ken

On 1/6/2016 4:09 AM, Andre Schappo wrote:
> On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote:
>
>> ASCII shouldn't be taught, perhaps?
> I really like the idea of questioning whether or not ASCII should even be taught.
>
> Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text.
>
> ASCII, along with, ISO-8859 ISO-2022 GB2312  .etc. should be consigned 
> to
>
> .and finally, the legacy character sets/encodings...
>
> Maybe ASCII should now be flagged as deprecated 
> https://twitter.com/andreschappo/status/684706421712228352
>
> Andr? Schappo
>
>
>
>


From gwalla at gmail.com  Wed Jan  6 16:42:42 2016
From: gwalla at gmail.com (Garth Wallace)
Date: Wed, 6 Jan 2016 14:42:42 -0800
Subject: HENTAIGANA LETTER E-1
In-Reply-To: <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com>
References: <CAKQz=Z6FoXjriAEf+7hp_sZPgqpBuG9T3yEMYxGUDc0bWdVyfg@mail.gmail.com>
 <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com>
Message-ID: <CA+p4_H0kt=fg-r4W3tweiOEg4H+8BsuhB9p70xu71XbcjXnT+Q@mail.gmail.com>

On Wed, Jan 6, 2016 at 8:22 AM, Michael Everson <everson at evertype.com> wrote:
> On 6 Jan 2016, at 14:42, David Corbett <corbett.dav at husky.neu.edu> wrote:
>>
>> Is there a difference between HENTAIGANA LETTER E-1 in L2/15-343 and
>> U+1B001 HIRAGANA LETTER ARCHAIC YE?
>
> No, there is not. The former would be unified with it.
>
> Michael Everson * http://www.evertype.com/
>

They never took that out? I pointed it out back in July and Ken Lunde
passed it along in his official feedback AIUI:
<http://www.unicode.org/L2/L2015/15277-pubrev.html>. I could have
sworn they took it out after that. It's a very clear duplicate.

From asmus-inc at ix.netcom.com  Wed Jan  6 17:19:09 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Wed, 6 Jan 2016 15:19:09 -0800
Subject: Unicode in the Curriculum?
In-Reply-To: <SN1PR0301MB204575062E07C4EA5F95B11882F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
 <568AA4EB.5070606@ix.netcom.com>
 <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk> <568D362D.2020409@att.net>
 <SN1PR0301MB204575062E07C4EA5F95B11882F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
Message-ID: <568DA0ED.2050804@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160106/188ca52d/attachment.html>

From Shawn.Steele at microsoft.com  Wed Jan  6 17:27:25 2016
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Wed, 6 Jan 2016 23:27:25 +0000
Subject: Unicode in the Curriculum?
In-Reply-To: <568DA0ED.2050804@ix.netcom.com>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
 <568AA4EB.5070606@ix.netcom.com>
 <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk> <568D362D.2020409@att.net>
 <SN1PR0301MB204575062E07C4EA5F95B11882F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <568DA0ED.2050804@ix.netcom.com>
Message-ID: <SN1PR0301MB204526EB041488634BF68DAD82F40@SN1PR0301MB2045.namprd03.prod.outlook.com>

Then it should be UTF-8.  Learning to do something in a non-Unicode code page and then redoing it for UTF-8 or UTF-16 merely leads to conversion problems, incompatibilities, and other nonsense.

If someone ?needs? to not use UTF-16 for whatever reason, then they should use UTF-8.  The ?advanced? training should be the other non-Unicode code pages.

Teach them right the first time.  They?ll never use a code page.

-Shawn

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (t)
Sent: January 6, 2016 3:19 PM
To: unicode at unicode.org
Subject: Re: Unicode in the Curriculum?

On 1/6/2016 10:59 AM, Shawn Steele wrote:

+1  :)

I'm not going to join the happy chorus here.

The "bunny" slope for most people is their own native language...

A./


-----Original Message-----

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler

Sent: Wednesday, January 6, 2016 7:44 AM

To: Andre Schappo <A.Schappo at lboro.ac.uk><mailto:A.Schappo at lboro.ac.uk>

Cc: unicode at unicode.org<mailto:unicode at unicode.org>

Subject: Re: Unicode in the Curriculum?


Actually, ASCII should *not* be ignored or deprecated.


We *love* ASCII. The issue is just making sure that students understand that the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values that open into the entire world of Unicode characters.


It is a mind trick to play on young programmers: when you learn "ASCII", you are just playing on the bunny slope at the UTF-8 ski resort. Slap on your snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with the experts!


--Ken


On 1/6/2016 4:09 AM, Andre Schappo wrote:

On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote:


ASCII shouldn't be taught, perhaps?

I really like the idea of questioning whether or not ASCII should even be taught.


Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text.


ASCII, along with, ISO-8859 ISO-2022 GB2312  .etc. should be consigned

to


.and finally, the legacy character sets/encodings...


Maybe ASCII should now be flagged as deprecated

https://twitter.com/andreschappo/status/684706421712228352


Andr? Schappo


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160106/0855bc6f/attachment.html>

From asmus-inc at ix.netcom.com  Wed Jan  6 17:32:33 2016
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Wed, 6 Jan 2016 15:32:33 -0800
Subject: Unicode in the Curriculum?
In-Reply-To: <SN1PR0301MB204526EB041488634BF68DAD82F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
 <568AA4EB.5070606@ix.netcom.com>
 <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk> <568D362D.2020409@att.net>
 <SN1PR0301MB204575062E07C4EA5F95B11882F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <568DA0ED.2050804@ix.netcom.com>
 <SN1PR0301MB204526EB041488634BF68DAD82F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
Message-ID: <568DA411.5000506@ix.netcom.com>

On 1/6/2016 3:27 PM, Shawn Steele wrote:
>
> Then it should be UTF-8.  Learning to do something in a non-Unicode 
> code page and then redoing it for UTF-8 or UTF-16 merely leads to 
> conversion problems, incompatibilities, and other nonsense.
>

Agreed.

But so does teaching people that it's OK to use ASCII-fallbacks, because 
a few of their characters are not available on the bunny slope.
>
> If someone ?needs? to not use UTF-16 for whatever reason, then they 
> should use UTF-8.  The ?advanced? training should be the other 
> non-Unicode code pages.
>

I think any training in non-Unicode character sets is beyond a standard 
curriculum, except perhaps History of Computing or Digital Archaeology  :)
>
> Teach them right the first time.  They?ll never use a code page.
>

+1

A./
>
> -Shawn
>
> *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of 
> *Asmus Freytag (t)
> *Sent:* January 6, 2016 3:19 PM
> *To:* unicode at unicode.org
> *Subject:* Re: Unicode in the Curriculum?
>
> On 1/6/2016 10:59 AM, Shawn Steele wrote:
>
>     +1  :)
>
>
> I'm not going to join the happy chorus here.
>
> The "bunny" slope for most people is their own native language...
>
> A./
>
>     -----Original Message-----
>
>     From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler
>
>     Sent: Wednesday, January 6, 2016 7:44 AM
>
>     To: Andre Schappo<A.Schappo at lboro.ac.uk> <mailto:A.Schappo at lboro.ac.uk>
>
>     Cc:unicode at unicode.org <mailto:unicode at unicode.org>
>
>     Subject: Re: Unicode in the Curriculum?
>
>     Actually, ASCII should *not* be ignored or deprecated.
>
>     We *love* ASCII. The issue is just making sure that students understand that the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values that open into the entire world of Unicode characters.
>
>     It is a mind trick to play on young programmers: when you learn "ASCII", you are just playing on the bunny slope at the UTF-8 ski resort. Slap on your snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with the experts!
>
>     --Ken
>
>     On 1/6/2016 4:09 AM, Andre Schappo wrote:
>
>         On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote:
>
>             ASCII shouldn't be taught, perhaps?
>
>         I really like the idea of questioning whether or not ASCII should even be taught.
>
>         Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text.
>
>         ASCII, along with, ISO-8859 ISO-2022 GB2312  .etc. should be consigned
>
>         to
>
>         .and finally, the legacy character sets/encodings...
>
>         Maybe ASCII should now be flagged as deprecated
>
>         https://twitter.com/andreschappo/status/684706421712228352
>
>         Andr? Schappo
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160106/5822b5f5/attachment.html>

From Shawn.Steele at microsoft.com  Wed Jan  6 17:36:16 2016
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Wed, 6 Jan 2016 23:36:16 +0000
Subject: Unicode in the Curriculum?
In-Reply-To: <568DA411.5000506@ix.netcom.com>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
 <568AA4EB.5070606@ix.netcom.com>
 <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk> <568D362D.2020409@att.net>
 <SN1PR0301MB204575062E07C4EA5F95B11882F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <568DA0ED.2050804@ix.netcom.com>
 <SN1PR0301MB204526EB041488634BF68DAD82F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <568DA411.5000506@ix.netcom.com>
Message-ID: <SN1PR0301MB2045D058F50203B01ED0556B82F40@SN1PR0301MB2045.namprd03.prod.outlook.com>

?  I think any training in non-Unicode character sets is beyond a standard curriculum, except perhaps History of Computing or Digital Archaeology  :)

One could only hope.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160106/306819e5/attachment.html>

From eik at iki.fi  Thu Jan  7 00:50:26 2016
From: eik at iki.fi (Erkki I Kolehmainen)
Date: Thu, 7 Jan 2016 08:50:26 +0200
Subject: Unicode in the Curriculum?
In-Reply-To: <568DA0ED.2050804@ix.netcom.com>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
 <568AA4EB.5070606@ix.netcom.com>
 <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk> <568D362D.2020409@att.net>
 <SN1PR0301MB204575062E07C4EA5F95B11882F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <568DA0ED.2050804@ix.netcom.com>
Message-ID: <001b01d14917$b0cc75c0$12656140$@fi>

+1

 
I cannot but agree with Asmus.

 
Sincerely, Erkki

 
L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Asmus Freytag (t)
L?hetetty: 7. tammikuuta 2016 01:19
Vastaanottaja: unicode at unicode.org
Aihe: Re: Unicode in the Curriculum?

 
On 1/6/2016 10:59 AM, Shawn Steele wrote:

+1  :)  


I'm not going to join the happy chorus here.

The "bunny" slope for most people is their own native language...

A./


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler
Sent: Wednesday, January 6, 2016 7:44 AM
To: Andre Schappo  <mailto:A.Schappo at lboro.ac.uk> <A.Schappo at lboro.ac.uk>
Cc: unicode at unicode.org
Subject: Re: Unicode in the Curriculum?
 
Actually, ASCII should *not* be ignored or deprecated.
 
We *love* ASCII. The issue is just making sure that students understand that the *true name* of "ASCII" is "UTF-8". It is just the very first 128 values that open into the entire world of Unicode characters.
 
It is a mind trick to play on young programmers: when you learn "ASCII", you are just playing on the bunny slope at the UTF-8 ski resort. Slap on your snowboard and practice -- get out there onto the 2-, 3- and 4-byte slopes with the experts!
 
--Ken
 
On 1/6/2016 4:09 AM, Andre Schappo wrote:

On 4 Jan 2016, at 16:59, Asmus Freytag (t) wrote:
 

ASCII shouldn't be taught, perhaps?

I really like the idea of questioning whether or not ASCII should even be taught.
 
Wherever in a programming curriculum, text processing/transmission/storage/presentation/encoding is taught, then it should be Unicode text.
 
ASCII, along with, ISO-8859 ISO-2022 GB2312  .etc. should be consigned 
to
 
.and finally, the legacy character sets/encodings...
 
Maybe ASCII should now be flagged as deprecated 
https://twitter.com/andreschappo/status/684706421712228352
 
Andr? Schappo
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160107/e1d4e820/attachment.html>

From mpsuzuki at hiroshima-u.ac.jp  Thu Jan  7 09:56:38 2016
From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya)
Date: Fri, 8 Jan 2016 00:56:38 +0900
Subject: [Unicode] Re: HENTAIGANA LETTER E-1
In-Reply-To: <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com>
References: <CAKQz=Z6FoXjriAEf+7hp_sZPgqpBuG9T3yEMYxGUDc0bWdVyfg@mail.gmail.com>
 <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com>
 <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com>
Message-ID: <568E8AB6.5010307@hiroshima-u.ac.jp>

Hi,

I'm not a representative of the experts working for the
proposal from Japan NB, but I could explain something.

1) "They never took that out?" I'm not sure who you mean
"they" (UTC? JNB?), but it seems that no official document
asking for the response from JNB is submitted in WG2.
If UTC sends something officially, JNB would response
something, I believe.

2) Difference in HENTAIGANA LETTER E-1 and U+1B001.

U+1B001 is a character designed to note an ancient (and
extinct in modern Japanese language) pronunciation YE.

When standard kana was defined about 100 years ago,
the pronunciation YE was already merged to E.
Some scholars planned to use a few kana-like characters
to note such pronunciation (to discuss about the ancient
Japanese language pronunciation), and used some hentaigana-
like glyphs for such purpose. As far as I know, there is
no wide consensus that the glyph looking like U+1B001 was
historically used to note YE mainly, when YE and E were
distinctively used in Japanese language.

On the other hand, JNB's proposal does not include any
ancient/extinct pronunciation, Their phonetic coverage
is exactly same with modern Japanese language. So,
the glyph looking like U+1B001 is not designed to note
the pronunciation YE. The motivation why JNB proposed
hentaigana would be just because of their shape differences.

Therefore, U+1B001 and HENTAIGANA E-1 could be said as
differently designed, their designed usages are different.
Please do not think JNB hentaigana experts overlooked
U+1B001 and proposed a duplicated encoding. They ought to
have known it but proposed.

However, some WG2 experts suggested to unify them because
of the shape similarity. I'm not sure whether 2 glyphs are
indistinctively similar for hentaigana scholars, but I
accept with that some people are hard to distinguish.
I cannot distinguish some Latin and Greek alphabets when
they are displayed as single isolated character.


Regards,
mpsuzuki

Garth Wallace wrote:
> On Wed, Jan 6, 2016 at 8:22 AM, Michael Everson <everson at evertype.com> wrote:
>> On 6 Jan 2016, at 14:42, David Corbett <corbett.dav at husky.neu.edu> wrote:
>>> Is there a difference between HENTAIGANA LETTER E-1 in L2/15-343 and
>>> U+1B001 HIRAGANA LETTER ARCHAIC YE?
>> No, there is not. The former would be unified with it.
>>
>> Michael Everson * http://www.evertype.com/
>>
> 
> They never took that out? I pointed it out back in July and Ken Lunde
> passed it along in his official feedback AIUI:
> <http://www.unicode.org/L2/L2015/15277-pubrev.html>. I could have
> sworn they took it out after that. It's a very clear duplicate.

From gwalla at gmail.com  Thu Jan  7 14:39:50 2016
From: gwalla at gmail.com (Garth Wallace)
Date: Thu, 7 Jan 2016 12:39:50 -0800
Subject: [Unicode] Re: HENTAIGANA LETTER E-1
In-Reply-To: <568E8AB6.5010307@hiroshima-u.ac.jp>
References: <CAKQz=Z6FoXjriAEf+7hp_sZPgqpBuG9T3yEMYxGUDc0bWdVyfg@mail.gmail.com>
 <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com>
 <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com>
 <568E8AB6.5010307@hiroshima-u.ac.jp>
Message-ID: <CA+p4_H2wJdTAmfL+dxRhMxwV9K92RH_ddwPQ8n5VHLE-6TFODQ@mail.gmail.com>

On Thu, Jan 7, 2016 at 7:56 AM, suzuki toshiya
<mpsuzuki at hiroshima-u.ac.jp> wrote:
> Hi,
>
> I'm not a representative of the experts working for the
> proposal from Japan NB, but I could explain something.
>
> 1) "They never took that out?" I'm not sure who you mean
> "they" (UTC? JNB?), but it seems that no official document
> asking for the response from JNB is submitted in WG2.
> If UTC sends something officially, JNB would response
> something, I believe.

I meant the JNB. I thought they had removed that character from the
later revised proposals that were posted on the UTC document register,
but I checked and I had apparently been mistaken.

The issue is only raised in passing in a footnote in Mr. Lunde's feedback.

> 2) Difference in HENTAIGANA LETTER E-1 and U+1B001.
>
> U+1B001 is a character designed to note an ancient (and
> extinct in modern Japanese language) pronunciation YE.
>
> When standard kana was defined about 100 years ago,
> the pronunciation YE was already merged to E.
> Some scholars planned to use a few kana-like characters
> to note such pronunciation (to discuss about the ancient
> Japanese language pronunciation), and used some hentaigana-
> like glyphs for such purpose. As far as I know, there is
> no wide consensus that the glyph looking like U+1B001 was
> historically used to note YE mainly, when YE and E were
> distinctively used in Japanese language.

AIUI they simply reused an existing hentaigana to make the
distinction, rather than making a new kana that just happened to look
exactly like it.

> On the other hand, JNB's proposal does not include any
> ancient/extinct pronunciation, Their phonetic coverage
> is exactly same with modern Japanese language. So,
> the glyph looking like U+1B001 is not designed to note
> the pronunciation YE. The motivation why JNB proposed
> hentaigana would be just because of their shape differences.
>
> Therefore, U+1B001 and HENTAIGANA E-1 could be said as
> differently designed, their designed usages are different.
> Please do not think JNB hentaigana experts overlooked
> U+1B001 and proposed a duplicated encoding. They ought to
> have known it but proposed.

It's not unknown for a single character to have more than one
pronunciation in different contexts.

> However, some WG2 experts suggested to unify them because
> of the shape similarity. I'm not sure whether 2 glyphs are
> indistinctively similar for hentaigana scholars, but I
> accept with that some people are hard to distinguish.
> I cannot distinguish some Latin and Greek alphabets when
> they are displayed as single isolated character.

We're not talking about about different scripts, though. Hentaigana
are obsolete hiragana (eliminated from modern written Japanese by a
spelling reform) but they are still hiragana. Latin and Greek, on the
other hand, are clearly separate but related scripts.

From mpsuzuki at hiroshima-u.ac.jp  Fri Jan  8 08:55:36 2016
From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya)
Date: Fri, 8 Jan 2016 23:55:36 +0900
Subject: [Unicode] Re: HENTAIGANA LETTER E-1
In-Reply-To: <5c73403150964b0a802b31df2ab5fa76@PS1PR04MB0953.apcprd04.prod.outlook.com>
References: <CAKQz=Z6FoXjriAEf+7hp_sZPgqpBuG9T3yEMYxGUDc0bWdVyfg@mail.gmail.com>	<584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com>	<90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com>	<568E8AB6.5010307@hiroshima-u.ac.jp>
 <5c73403150964b0a802b31df2ab5fa76@PS1PR04MB0953.apcprd04.prod.outlook.com>
Message-ID: <568FCDE8.3020307@hiroshima-u.ac.jp>

Garth Wallace wrote:
> On Thu, Jan 7, 2016 at 7:56 AM, suzuki toshiya
> <mpsuzuki at hiroshima-u.ac.jp> wrote:
>> Hi,
>>
>> I'm not a representative of the experts working for the
>> proposal from Japan NB, but I could explain something.
>>
>> 1) "They never took that out?" I'm not sure who you mean
>> "they" (UTC? JNB?), but it seems that no official document
>> asking for the response from JNB is submitted in WG2.
>> If UTC sends something officially, JNB would response
>> something, I believe.
> 
> I meant the JNB. I thought they had removed that character from the
> later revised proposals that were posted on the UTC document register,
> but I checked and I had apparently been mistaken.
> 
> The issue is only raised in passing in a footnote in Mr. Lunde's feedback.

I think HENTAIGANA LETTER E-1 is intentionally proposed
to be coded separately, and no official document is
sent to JNB, so it is still kept as it was before.

>> 2) Difference in HENTAIGANA LETTER E-1 and U+1B001.
>>
>> U+1B001 is a character designed to note an ancient (and
>> extinct in modern Japanese language) pronunciation YE.
>>
>> When standard kana was defined about 100 years ago,
>> the pronunciation YE was already merged to E.
>> Some scholars planned to use a few kana-like characters
>> to note such pronunciation (to discuss about the ancient
>> Japanese language pronunciation), and used some hentaigana-
>> like glyphs for such purpose. As far as I know, there is
>> no wide consensus that the glyph looking like U+1B001 was
>> historically used to note YE mainly, when YE and E were
>> distinctively used in Japanese language.
> 
> AIUI they simply reused an existing hentaigana to make the
> distinction, rather than making a new kana that just happened to look
> exactly like it.

It is difficult (for me) to judge U+1B001 has same identity
with the hentaigana before kana standardization with similar
appearance. The rationale to encode U+1B001 was justified by
its unique phonetic value, so its character name is YE. It
is normative. Some people may think they can identify the
hentaigana by their glyph shapes only, but others may have
different view. As the first proposal (L2/15-193) prioritized
the (modern) phonetic value as the first key to identify the
glyph, I think some user community would want to identify the
glyph by the phonetic value. I don't say it is the best
solution, but I say they have their own rationale.

>> On the other hand, JNB's proposal does not include any
>> ancient/extinct pronunciation, Their phonetic coverage
>> is exactly same with modern Japanese language. So,
>> the glyph looking like U+1B001 is not designed to note
>> the pronunciation YE. The motivation why JNB proposed
>> hentaigana would be just because of their shape differences.
>>
>> Therefore, U+1B001 and HENTAIGANA E-1 could be said as
>> differently designed, their designed usages are different.
>> Please do not think JNB hentaigana experts overlooked
>> U+1B001 and proposed a duplicated encoding. They ought to
>> have known it but proposed.
> 
> It's not unknown for a single character to have more than one
> pronunciation in different contexts.

Is it easy to distinguish the contexts how the "unified U+1B001"
should be pronounced (some case, it must be YE, some case, it
must be E, some case, both of YE/E are acceptable)? I don't have
good connection with the users community of U+1B001, so I cannot
estimate which is easier (less troublesome for existing user
communities) in separation or unification. Do you have any
connection with the user community of U+1B001?

>> However, some WG2 experts suggested to unify them because
>> of the shape similarity. I'm not sure whether 2 glyphs are
>> indistinctively similar for hentaigana scholars, but I
>> accept with that some people are hard to distinguish.
>> I cannot distinguish some Latin and Greek alphabets when
>> they are displayed as single isolated character.
> 
> We're not talking about about different scripts, though. Hentaigana
> are obsolete hiragana (eliminated from modern written Japanese by a
> spelling reform) but they are still hiragana. Latin and Greek, on the
> other hand, are clearly separate but related scripts.

I'm afraid that the counting how many scripts in the set
of modern hiragana, U+1B001 and JNB proposal could depend
on the people. Some people may count only 1, some people
may count 2, some people may count 3. If there is stable
consensus already, it could be used as the rational to unify,
but, I don't think so. Anyway, Latin and Greek were not
good example, I'm sorry.

Regards,
mpsuzuki

From lists+unicode at seantek.com  Sat Jan  9 17:27:04 2016
From: lists+unicode at seantek.com (Sean Leonard)
Date: Sat, 9 Jan 2016 15:27:04 -0800
Subject: Unicode password mapping for crypto standard
In-Reply-To: <20160105163705.GA4941@nic.fr>
References: <568B54F8.5000802@seantek.com> <20160105163705.GA4941@nic.fr>
Message-ID: <56919748.2000906@seantek.com>

On 1/5/2016 8:37 AM, Stephane Bortzmeyer wrote:
> On Mon, Jan 04, 2016 at 09:30:32PM -0800,
>   Sean Leonard <lists+unicode at seantek.com> wrote
>   a message of 120 lines which said:
>
>> how to take the Unicode input and get a consistent and reasonable
>> stream of bits out on both ends. For example: should the password be
>> case folded, converted to NFKC, encoded in UTF-8 vs. UTF-16BE, etc.?
> There is already a standard on that, RFC 7613 "Preparation,
> Enforcement, and Comparison of Internationalized Strings Representing
> Usernames and Passwords" <http://www.rfc-editor.org/rfc/rfc7613.txt>
> and I suggest we use it and do not reinvent the wheel.
>

Hello (sorry for my delayed response):

Yes, I am aware of PRECIS. I actually asked the PRECIS mailing list a 
couple of months ago but got no feedback.

PRECIS is an overarching framework; it doesn't specify mappings in 
particular. So merely saying "PRECIS!" is not enough.

In my proposal, the parameter "password-mapping" can take two relevant 
PRECIS forms:
*precis
*precis-XXX (where XXX is a registered profile name)

In the first form, the mapping is defined by the OpaqueString profile, 
/as amended from time to time/. This is the PRECIS password profile but 
it doesn't specify a version or anything so additional characters may be 
admitted in the future or treated differently, as the standards get 
updated (including the Unicode standard). It is meant to be "living".

In the second form, it's PRECIS but is fixed to the specific profile 
name. An interesting use case might be the recently registered 
"Nickname" class [RFC7700] and 
<http://www.iana.org/assignments/precis-parameters/profiles/Nickname.txt>. 
In that profile, spaces are stripped and characters are treated 
case-insensitively with Unicode Default Case Folding (among other 
things). In applications where the encryption key is derived from a user 
handle, this might be a relevant profile to name. Compare with 
UsernameCaseMapped, etc.

Sean

From lists+unicode at seantek.com  Sat Jan  9 17:30:45 2016
From: lists+unicode at seantek.com (Sean Leonard)
Date: Sat, 9 Jan 2016 15:30:45 -0800
Subject: Unicode password mapping for crypto standard
In-Reply-To: <CAN49p6pj-3EE=okMYZw73r4EfTb1JZ2vc6vPkuqkvOe51PqtqA@mail.gmail.com>
References: <568B54F8.5000802@seantek.com>
 <CAN49p6pj-3EE=okMYZw73r4EfTb1JZ2vc6vPkuqkvOe51PqtqA@mail.gmail.com>
Message-ID: <56919825.6080307@seantek.com>

On 1/5/2016 8:26 AM, Markus Scherer wrote:
> I would specify that UTF-8 must be used, without mapping.
> US-ASCII is a proper subset, so need not be mentioned explicitly, nor 
> distinguished in the protocol.
> Mappings would require that all implementations carry relevant data, 
> and are up to date to recent versions of Unicode, or else 
> previously-unassigned code points will cause failures.
> As long as a user types the same password the same way, or with IMEs 
> that produce the same output, they are fine. Strange variants might 
> improve password security.

Right.

In PRECIS, UTF-8 is enforced. However as you point out, the issue is 
that "strange variants" exist, as well as different IMEs and different 
keyboard/keystroke combinations. A case in point is that 0xFF is not a 
valid UTF-8 octet. However, nothing constrains the underlying technology 
not to use 0xFF, so there should be a way for a user (or process) to 
force the use of specific octet strings as inputs. That is why the 
"password-mapping" parameter is proposed as a hint rather than a strict 
rule.

Also as pointed out, PKCS#8 encrypted blobs are used within PKCS #12, 
which has its own Unicode mapping (based on UTF-16LE).

Sean

From albrecht.dreiheller at siemens.com  Mon Jan 11 07:22:53 2016
From: albrecht.dreiheller at siemens.com (Dreiheller, Albrecht)
Date: Mon, 11 Jan 2016 13:22:53 +0000
Subject: AW: Unicode in the Curriculum?
In-Reply-To: <SN1PR0301MB204526EB041488634BF68DAD82F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
 <568AA4EB.5070606@ix.netcom.com>
 <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk>
 <568D362D.2020409@att.net>
 <SN1PR0301MB204575062E07C4EA5F95B11882F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <568DA0ED.2050804@ix.netcom.com>
 <SN1PR0301MB204526EB041488634BF68DAD82F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
Message-ID: <3E10480FE4510343914E4312AB46E74212B75C4A@DEFTHW99EH5MSX.ww902.siemens.net>

From: Unicode [mailto:unicode-bounces at unicode.org] Im Auftrag von Shawn Steele
Date: Donnerstag, 7. Januar 2016 00:27
To: Asmus Freytag (t); unicode at unicode.org
Subject: RE: Unicode in the Curriculum?

Then it should be UTF-8.  Learning to do something in a non-Unicode code page and then redoing it for UTF-8 or UTF-16 merely leads to conversion problems, incompatibilities, and other nonsense.

If someone ?needs? to not use UTF-16 for whatever reason, then they should use UTF-8.  The ?advanced? training should be the other non-Unicode code pages.

Teach them right the first time.  They?ll never use a code page.

-Shawn

They'll never use a code page for encoding, I agree, but ?

When setting up a requirement specification for a font manufacturer for a new font for Chinese (both simplified and traditional), Japanese or  Korean,
there is no easy  way to define the character repertoire without refering to the code pages like GB2312, Big-5, JIS, etc.

A.D.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160111/aff91c08/attachment.html>

From public at khwilliamson.com  Mon Jan 11 16:42:37 2016
From: public at khwilliamson.com (Karl Williamson)
Date: Mon, 11 Jan 2016 15:42:37 -0700
Subject: Trying to understand Line_Break property apparent discrepancy
Message-ID: <56942FDD.9070805@khwilliamson.com>

It appears that 
http://www.unicode.org/Public/8.0.0/ucd/auxiliary/LineBreakTest.txt is 
testing a tailoring rather than the default line break algorithm, 
contrary to its heading "# Default Line Break Test".  And 
http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/LineBreakTest.html follows 
along.

For example, the default algorithm as shown in 
http://www.unicode.org/reports/tr14/#Table2 follows LB25, which is an 
approximation of the desired behavior.  But the test and html don't 
follow this.  I suspect they are looking for the tailoring described in 
http://www.unicode.org/reports/tr14/#Examples example 7.

For example, the test file tests for, and the html says that a class CL 
code point followed by a class PO one is an unconditional line break 
opportunity, based on rule 999. (which is the same as LB31 in TR14)

Whereas, http://www.unicode.org/reports/tr14/#Table2 says that a class 
CL code point followed by a class PO one is an

	 "indirect break opportunity 	B % A is equivalent to B ? A and B SP+ ? 
A; in other words, do not break before A, unless one or more spaces 
follow B."  This is by LB25 and LB18.

There is a discrepancy here, which could be resolved either by changing 
the tests and html to follow LB25, or documenting that these are for 
something above and beyond the default algorithm.  (There may also be 
other discrepancies that I haven't stumbled against)


From public at khwilliamson.com  Mon Jan 11 17:32:47 2016
From: public at khwilliamson.com (Karl Williamson)
Date: Mon, 11 Jan 2016 16:32:47 -0700
Subject: Redundancy in TR14
Message-ID: <56943B9F.90701@khwilliamson.com>

Example 7 in http://www.unicode.org/reports/tr14/#Examples

has these two rules

NU ? (NU | SY | IS)

NU (NU | SY | IS)* ? (NU | SY | IS | CL | CP )

It appears to me that the first rule generates a subset of what the 2nd 
rule generates, and so is useless.  It could be hence removed for 
simplicity, unless I'm missing something or there is a typo and it is 
meant to generate something else.


From public at khwilliamson.com  Mon Jan 11 18:16:56 2016
From: public at khwilliamson.com (Karl Williamson)
Date: Mon, 11 Jan 2016 17:16:56 -0700
Subject: Trying to understand Line_Break property apparent discrepancy
In-Reply-To: <56942FDD.9070805@khwilliamson.com>
References: <56942FDD.9070805@khwilliamson.com>
Message-ID: <569445F8.5090000@khwilliamson.com>

On 01/11/2016 03:42 PM, Karl Williamson wrote:
> It appears that
> http://www.unicode.org/Public/8.0.0/ucd/auxiliary/LineBreakTest.txt is
> testing a tailoring rather than the default line break algorithm,
> contrary to its heading "# Default Line Break Test".  And
> http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/LineBreakTest.html follows
> along.
>
> For example, the default algorithm as shown in
> http://www.unicode.org/reports/tr14/#Table2 follows LB25, which is an
> approximation of the desired behavior.  But the test and html don't
> follow this.  I suspect they are looking for the tailoring described in
> http://www.unicode.org/reports/tr14/#Examples example 7.
>
> For example, the test file tests for, and the html says that a class CL
> code point followed by a class PO one is an unconditional line break
> opportunity, based on rule 999. (which is the same as LB31 in TR14)
>
> Whereas, http://www.unicode.org/reports/tr14/#Table2 says that a class
> CL code point followed by a class PO one is an
>
>       "indirect break opportunity     B % A is equivalent to B ? A and B
> SP+ ? A; in other words, do not break before A, unless one or more
> spaces follow B."  This is by LB25 and LB18.
>
> There is a discrepancy here, which could be resolved either by changing
> the tests and html to follow LB25, or documenting that these are for
> something above and beyond the default algorithm.  (There may also be
> other discrepancies that I haven't stumbled against)
>
>
>
>

Ooops.  I didn't see this statement in the html file:
"The Line Break tests use tailoring of numbers described in Example 7 of 
Section 8.2 Examples of Customization. They also differ from the results 
produced by a pair table implementation in sequences like: ZW SP CL."

This explains everything.  Please disregard the earlier email from me.


From charupdate at orange.fr  Mon Jan 11 21:54:31 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Tue, 12 Jan 2016 04:54:31 +0100 (CET)
Subject: Unicode in the Curriculum?
Message-ID: <1906649992.303.1452570871573.JavaMail.www@wwinf1e33>

On 1/6/2016 3:27 PM, Shawn Steele wrote: 

[?]

> Teach them right the first time. They?ll never use a code page. 

On 6 Jan 2016 15:32:33, Asmus Freytag (t) wrote:

[?]

> +1 

On 11 Jan 2016 14:22:53, Albrecht Dreiheller wrote:

[?]

> When setting up a requirement specification for a font manufacturer for a new font for Chinese [?], there is no easy way to define the character repertoire without refering to the code pages [?].

Among the many uses of code pages, this thread was focusing on training for computer scientists. If enlarging the subject to cover font design and possibly keyboard input as well is really useful, then from a German POV it might be interesting to look up the discussion at http://www.typografie.info/3/topic/26274-liste-unbedingt-notwendiger-zeichen/ Retrieved January 7, 2016.

For *IT students* (and other people as well), the day they encounter their first ?U+?, it is straightforward either to look up some pieces of information about Unicode, since they have already a strong experience of the internet; or at least if they don?t (and anyway), to use the Contact form to submit their questions. While the interest *on the whole* won?t be missing, the actual problem is oversolliciting and misdirecting the interest through the entertainment and advertising industries. The attention as a limited resource is even uselessly threatened through the side-effects of consumption (food, ?). 

Checking these problems is a matter of on-going efforts. I just would like to complete the discussion on that side.

[By this occasion I apologize for my last and previous e-mails; I hope I got some skill to stop bothering uselessly, and to hopefully focus on the topics I?m able to do some useful work in. Soon I should send a link FWIW.]

Best regards,

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160112/fb149db6/attachment.html>

From mark at macchiato.com  Mon Jan 11 23:55:04 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 12 Jan 2016 06:55:04 +0100
Subject: Redundancy in TR14
In-Reply-To: <56943B9F.90701@khwilliamson.com>
References: <56943B9F.90701@khwilliamson.com>
Message-ID: <CAJ2xs_HzU+ZEReO=VhC5_VxbAyYZ7GOX8Afev5k_KqTi61_0Hw@mail.gmail.com>

Looks that way to me too. Can you submit this as feedback?

{phone}
On Jan 12, 2016 00:39, "Karl Williamson" <public at khwilliamson.com> wrote:

> Example 7 in http://www.unicode.org/reports/tr14/#Examples
>
> has these two rules
>
> NU ? (NU | SY | IS)
>
> NU (NU | SY | IS)* ? (NU | SY | IS | CL | CP )
>
> It appears to me that the first rule generates a subset of what the 2nd
> rule generates, and so is useless.  It could be hence removed for
> simplicity, unless I'm missing something or there is a typo and it is meant
> to generate something else.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160112/225148ae/attachment.html>

From public at khwilliamson.com  Tue Jan 12 00:25:46 2016
From: public at khwilliamson.com (Karl Williamson)
Date: Mon, 11 Jan 2016 23:25:46 -0700
Subject: Redundancy in TR14
In-Reply-To: <CAJ2xs_HzU+ZEReO=VhC5_VxbAyYZ7GOX8Afev5k_KqTi61_0Hw@mail.gmail.com>
References: <56943B9F.90701@khwilliamson.com>
 <CAJ2xs_HzU+ZEReO=VhC5_VxbAyYZ7GOX8Afev5k_KqTi61_0Hw@mail.gmail.com>
Message-ID: <56949C6A.4000306@khwilliamson.com>

On 01/11/2016 10:55 PM, Mark Davis ?? wrote:
> Looks that way to me too. Can you submit this as feedback?

will do

>
> {phone}
>
> On Jan 12, 2016 00:39, "Karl Williamson" <public at khwilliamson.com
> <mailto:public at khwilliamson.com>> wrote:
>
>     Example 7 in http://www.unicode.org/reports/tr14/#Examples
>
>     has these two rules
>
>     NU ? (NU | SY | IS)
>
>     NU (NU | SY | IS)* ? (NU | SY | IS | CL | CP )
>
>     It appears to me that the first rule generates a subset of what the
>     2nd rule generates, and so is useless.  It could be hence removed
>     for simplicity, unless I'm missing something or there is a typo and
>     it is meant to generate something else.
>


From drott at google.com  Wed Jan 13 05:25:56 2016
From: drott at google.com (=?UTF-8?Q?Dominik_R=C3=B6ttsches?=)
Date: Wed, 13 Jan 2016 13:25:56 +0200
Subject: Additional ZWJ prefixes in ZWJ emoji sequences page
Message-ID: <CAN6muBtHgRqxmA-NQk84pcN1r1zPdmSdSmteGxT5N6JLwaP-ag@mail.gmail.com>

Hi,

if I am not mistaken, there are a couple of additional, probably
unintentional ZWJ prefixes in field count 1,2,3 and 4,5,6 in

http://www.unicode.org/emoji/charts/emoji-zwj-sequences.html

>From a hexdump of the page:

00008dd0  74 72 3e 0a 3c 74 72 3e  0a 3c 74 64 20 63 6c 61  |tr>.<tr>.<td cla|

00008de0  73 73 3d 27 72 63 68 61  72 73 27 3e 31 3c 2f 74  |ss='rchars'>1</t|

00008df0  64 3e 0a 3c 74 64 3e 55  2b 31 46 34 36 39 20 55  |d>.<td>U+1F469 U|

00008e00  2b 32 30 30 44 20 55 2b  32 37 36 34 20 55 2b 46  |+200D U+2764 U+F|

00008e10  45 30 46 20 55 2b 32 30  30 44 20 55 2b 31 46 34  |E0F U+200D U+1F4|

00008e20  38 42 20 55 2b 32 30 30  44 20 55 2b 31 46 34 36  |8B U+200D U+1F46|

00008e30  38 3c 2f 74 64 3e 0a 3c  74 64 20 63 6c 61 73 73  |8</td>.<td class|

00008e40  3d 27 63 68 61 72 73 27  3e e2 80 8d f0 9f 91 a9  |='chars'>.......|


So, after the U+003E '>', there is the e2 80 8d sequence of a ZWJ
there in field 1.

Perhaps someone could fix that.

Thanks,

Dominik

From mark at macchiato.com  Wed Jan 13 09:51:38 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Wed, 13 Jan 2016 16:51:38 +0100
Subject: Additional ZWJ prefixes in ZWJ emoji sequences page
In-Reply-To: <CAN6muBtHgRqxmA-NQk84pcN1r1zPdmSdSmteGxT5N6JLwaP-ag@mail.gmail.com>
References: <CAN6muBtHgRqxmA-NQk84pcN1r1zPdmSdSmteGxT5N6JLwaP-ag@mail.gmail.com>
Message-ID: <CAJ2xs_H7QYPh4Z5uCNA+S3V9MRA1bOLtvbyOYxy2DYAXT-c5jw@mail.gmail.com>

You're right. It's between the closing > and the following ???  character

\u003e *\u200d* \U0001f469

We'll see why that spurious character is there in the HTML.

Mark

On Wed, Jan 13, 2016 at 12:25 PM, Dominik R?ttsches <drott at google.com>
wrote:

> Hi,
>
> if I am not mistaken, there are a couple of additional, probably
> unintentional ZWJ prefixes in field count 1,2,3 and 4,5,6 in
>
> http://www.unicode.org/emoji/charts/emoji-zwj-sequences.html
>
> From a hexdump of the page:
>
> 00008dd0  74 72 3e 0a 3c 74 72 3e  0a 3c 74 64 20 63 6c 61  |tr>.<tr>.<td
> cla|
>
> 00008de0  73 73 3d 27 72 63 68 61  72 73 27 3e 31 3c 2f 74
> |ss='rchars'>1</t|
>
> 00008df0  64 3e 0a 3c 74 64 3e 55  2b 31 46 34 36 39 20 55
> |d>.<td>U+1F469 U|
>
> 00008e00  2b 32 30 30 44 20 55 2b  32 37 36 34 20 55 2b 46  |+200D U+2764
> U+F|
>
> 00008e10  45 30 46 20 55 2b 32 30  30 44 20 55 2b 31 46 34  |E0F U+200D
> U+1F4|
>
> 00008e20  38 42 20 55 2b 32 30 30  44 20 55 2b 31 46 34 36  |8B U+200D
> U+1F46|
>
> 00008e30  38 3c 2f 74 64 3e 0a 3c  74 64 20 63 6c 61 73 73  |8</td>.<td
> class|
>
> 00008e40  3d 27 63 68 61 72 73 27  3e e2 80 8d f0 9f 91 a9
> |='chars'>.......|
>
>
> So, after the U+003E '>', there is the e2 80 8d sequence of a ZWJ
> there in field 1.
>
> Perhaps someone could fix that.
>
> Thanks,
>
> Dominik
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160113/cec492db/attachment.html>

From gwalla at gmail.com  Wed Jan 13 15:39:26 2016
From: gwalla at gmail.com (Garth Wallace)
Date: Wed, 13 Jan 2016 13:39:26 -0800
Subject: [Unicode] Re: HENTAIGANA LETTER E-1
In-Reply-To: <568FCDE8.3020307@hiroshima-u.ac.jp>
References: <CAKQz=Z6FoXjriAEf+7hp_sZPgqpBuG9T3yEMYxGUDc0bWdVyfg@mail.gmail.com>
 <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com>
 <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com>
 <568E8AB6.5010307@hiroshima-u.ac.jp>
 <5c73403150964b0a802b31df2ab5fa76@PS1PR04MB0953.apcprd04.prod.outlook.com>
 <568FCDE8.3020307@hiroshima-u.ac.jp>
Message-ID: <CA+p4_H14sAYcEOmXmi1bvkq+7Xk2ReVwv7A37Es1+=bFChoGUw@mail.gmail.com>

On Fri, Jan 8, 2016 at 6:55 AM, suzuki toshiya
<mpsuzuki at hiroshima-u.ac.jp> wrote:
> Garth Wallace wrote:
>> On Thu, Jan 7, 2016 at 7:56 AM, suzuki toshiya
>> <mpsuzuki at hiroshima-u.ac.jp> wrote:
>>> Hi,
>>>
>>> I'm not a representative of the experts working for the
>>> proposal from Japan NB, but I could explain something.
>>>
>>> 1) "They never took that out?" I'm not sure who you mean
>>> "they" (UTC? JNB?), but it seems that no official document
>>> asking for the response from JNB is submitted in WG2.
>>> If UTC sends something officially, JNB would response
>>> something, I believe.
>>
>> I meant the JNB. I thought they had removed that character from the
>> later revised proposals that were posted on the UTC document register,
>> but I checked and I had apparently been mistaken.
>>
>> The issue is only raised in passing in a footnote in Mr. Lunde's feedback.
>
> I think HENTAIGANA LETTER E-1 is intentionally proposed
> to be coded separately, and no official document is
> sent to JNB, so it is still kept as it was before.
>
>>> 2) Difference in HENTAIGANA LETTER E-1 and U+1B001.
>>>
>>> U+1B001 is a character designed to note an ancient (and
>>> extinct in modern Japanese language) pronunciation YE.
>>>
>>> When standard kana was defined about 100 years ago,
>>> the pronunciation YE was already merged to E.
>>> Some scholars planned to use a few kana-like characters
>>> to note such pronunciation (to discuss about the ancient
>>> Japanese language pronunciation), and used some hentaigana-
>>> like glyphs for such purpose. As far as I know, there is
>>> no wide consensus that the glyph looking like U+1B001 was
>>> historically used to note YE mainly, when YE and E were
>>> distinctively used in Japanese language.
>>
>> AIUI they simply reused an existing hentaigana to make the
>> distinction, rather than making a new kana that just happened to look
>> exactly like it.
>
> It is difficult (for me) to judge U+1B001 has same identity
> with the hentaigana before kana standardization with similar
> appearance. The rationale to encode U+1B001 was justified by
> its unique phonetic value, so its character name is YE. It
> is normative. Some people may think they can identify the
> hentaigana by their glyph shapes only, but others may have
> different view. As the first proposal (L2/15-193) prioritized
> the (modern) phonetic value as the first key to identify the
> glyph, I think some user community would want to identify the
> glyph by the phonetic value. I don't say it is the best
> solution, but I say they have their own rationale.

The rationale for U+1B001, AIUI, was that it was used in some modern
scholarly works about the history of the Japanese language to
distinguish between /e/ and /je/ before they merged in the modern
language. I don't know if historically that distinction existed in
writing.

The character name is normative. But the pronunciation is not, and I
don't think the Unicode name should be taken to mean that it can only
be used when a particular pronunciation is intended. Spelling and
pronunciation are outside of Unicode's scope.

>>> On the other hand, JNB's proposal does not include any
>>> ancient/extinct pronunciation, Their phonetic coverage
>>> is exactly same with modern Japanese language. So,
>>> the glyph looking like U+1B001 is not designed to note
>>> the pronunciation YE. The motivation why JNB proposed
>>> hentaigana would be just because of their shape differences.
>>>
>>> Therefore, U+1B001 and HENTAIGANA E-1 could be said as
>>> differently designed, their designed usages are different.
>>> Please do not think JNB hentaigana experts overlooked
>>> U+1B001 and proposed a duplicated encoding. They ought to
>>> have known it but proposed.
>>
>> It's not unknown for a single character to have more than one
>> pronunciation in different contexts.
>
> Is it easy to distinguish the contexts how the "unified U+1B001"
> should be pronounced (some case, it must be YE, some case, it
> must be E, some case, both of YE/E are acceptable)? I don't have
> good connection with the users community of U+1B001, so I cannot
> estimate which is easier (less troublesome for existing user
> communities) in separation or unification. Do you have any
> connection with the user community of U+1B001?

I do not. For that matter, I'm not a member of the UTC. I've only read
Nozomu Kat?'s original proposal
<http://www.unicode.org/L2/L2007/07421-e-ye.pdf> and some of the
documents that followed.

>>> However, some WG2 experts suggested to unify them because
>>> of the shape similarity. I'm not sure whether 2 glyphs are
>>> indistinctively similar for hentaigana scholars, but I
>>> accept with that some people are hard to distinguish.
>>> I cannot distinguish some Latin and Greek alphabets when
>>> they are displayed as single isolated character.
>>
>> We're not talking about about different scripts, though. Hentaigana
>> are obsolete hiragana (eliminated from modern written Japanese by a
>> spelling reform) but they are still hiragana. Latin and Greek, on the
>> other hand, are clearly separate but related scripts.
>
> I'm afraid that the counting how many scripts in the set
> of modern hiragana, U+1B001 and JNB proposal could depend
> on the people. Some people may count only 1, some people
> may count 2, some people may count 3. If there is stable
> consensus already, it could be used as the rational to unify,
> but, I don't think so. Anyway, Latin and Greek were not
> good example, I'm sorry.

You're right, it's unclear, though at least in Unicode terms I don't
think you can really count 3. U+1B001 has the script property
"hiragana", but that still leaves the question of whether hentaigana
should be considered a separate script from hiragana. The proposal
summary for L2/15-239 does say it's for a new script, named
"hentaigana". However, elsewhere in that document it says "In year
1900, Japanese government selected one phonogram for each phonetic
value and announced not to use other phonograms in elementary
education. Afterward, the selected phonograms are called ?HIRAGANA?
and others are called ?HENTAIGANA?, the meaning is variants of a
HIRAGANA." Also, the original proposal was to encode them as Standard
Variation Sequences of hiragana, which I think implies that the JNB,
at least at that time, considered them to be variants of hiragana and
not something other than hiragana. AIUI, and correct me if I'm wrong,
hentaigana is a retronym; at the time they were in regular use they
were used in combination with and interchangeably with the modern set
of hiragana, and did not have an identity as a distinct set until the
spelling reform of 1900.

I believe that in Unicode, characters that were once used in a script
but were later made obsolete are usually still considered part of the
same script as the surviving set. That has been the case for Latin, at
least.


From frederic.grosshans at gmail.com  Thu Jan 14 06:04:49 2016
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Thu, 14 Jan 2016 13:04:49 +0100
Subject: [Unicode] Re: HENTAIGANA LETTER E-1
In-Reply-To: <CA+p4_H14sAYcEOmXmi1bvkq+7Xk2ReVwv7A37Es1+=bFChoGUw@mail.gmail.com>
References: <CAKQz=Z6FoXjriAEf+7hp_sZPgqpBuG9T3yEMYxGUDc0bWdVyfg@mail.gmail.com>
 <584B230B-BFBF-4A93-95B3-6F1E463E6181@evertype.com>
 <90baabfa798243e6b41e4f3d6502ed3c@PS1PR04MB0953.apcprd04.prod.outlook.com>
 <568E8AB6.5010307@hiroshima-u.ac.jp>
 <5c73403150964b0a802b31df2ab5fa76@PS1PR04MB0953.apcprd04.prod.outlook.com>
 <568FCDE8.3020307@hiroshima-u.ac.jp>
 <CA+p4_H14sAYcEOmXmi1bvkq+7Xk2ReVwv7A37Es1+=bFChoGUw@mail.gmail.com>
Message-ID: <56978EE1.9070206@gmail.com>

Le 13/01/2016 22:39, Garth Wallace a ?crit :
> The rationale for U+1B001, AIUI, was that it was used in some modern
> scholarly works about the history of the Japanese language to
> distinguish between/e/  and/je/  before they merged in the modern
> language. I don't know if historically that distinction existed in
> writing.
>
> The character name is normative. But the pronunciation is not, and I
> don't think the Unicode name should be taken to mean that it can only
> be used when a particular pronunciation is intended. Spelling and
> pronunciation are outside of Unicode's scope.

Let us suppose *HENTAIGANA LETTER E-1 is to be unified with ?? U+1B001 
HIRAGANA LETTER ARCHAIC YE.
An annotation can be added to U+1B001 description to confirm this usage. 
If it is not enough, and an official HENTAIGANA name is desired for 
consistency, I think it is conceivable to add the following line in 
http://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt

    1B001;HENTAIGANA LETTER E-1;alternate

Fr?d?ric

From charupdate at orange.fr  Thu Jan 14 07:38:52 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 14 Jan 2016 14:38:52 +0100 (CET)
Subject: Unicode in the Curriculum?
In-Reply-To: <SN1PR0301MB2045D058F50203B01ED0556B82F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
References: <mailman.0.1451671201.13788.unicode@unicode.org>
 <B9696509-8692-4898-8040-C8C516B42CF4@psu.edu>
 <568AA4EB.5070606@ix.netcom.com>
 <C85D17C6-70B0-4A94-B7A3-49120C2D16D9@lboro.ac.uk>
 <568D362D.2020409@att.net>
 <SN1PR0301MB204575062E07C4EA5F95B11882F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <568DA0ED.2050804@ix.netcom.com>
 <SN1PR0301MB204526EB041488634BF68DAD82F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <568DA411.5000506@ix.netcom.com>
 <SN1PR0301MB2045D058F50203B01ED0556B82F40@SN1PR0301MB2045.namprd03.prod.outlook.com>
Message-ID: <1262344245.11703.1452778732578.JavaMail.www@wwinf1n02>

On January 7, 2016, at 00:39, Shawn Steele wrote:

>> ? I think any training in non-Unicode character sets is beyond a standard curriculum, except perhaps History of Computing or Digital Archaeology :)

> One could only hope.


Since the topic widened to font design, one easily agrees that also in these curricula, Unicode is taught, and code pages are replaced with Unicode collections. Even the Multilingual European Subsets were originally declared to be an intermediate stage on the road towards the implementation of the whole UCS. I fully agree that code pages are to be relegated into the archives. *If* there is an exception for CJK fonts, it merely confirms the rule. Last fall we?ve seen the side effects of remnant code page use in the recognition of native languages in Northwest Territories. I apologize to all persons I?ve hurt.

E.g. one may teach that Latin script is covered by the Unicode collections Basic Latin ? Latin-1 Supplement ? Latin Extended-A ? Latin Extended-B ? IPA Extensions ? Spacing Modifier Letters ? Combining Diacritical Marks ? Combining Diacritical Marks Extended ? Phonetic Extensions ? Phonetic Extensions Supplement ? Combining Diacritical Marks Supplement ? Latin Extended Additional ? General Punctuation ? Superscripts and Subscripts ? [most of] Currency Symbols ? Letterlike Symbols ? Number Forms ? Enclosed Alphanumerics ? Latin Extended-C ? Supplemental Punctuation ? Modifier Tone Letters ? Latin Extended-D ? Latin Extended-E ? Combining Half Marks ? Mathematical Alphanumeric Symbols ? Enclosed Alphanumeric Supplement, AFAIK.

The more we add to the cart, the more the specified font will be useful?but the more it will be costly. Therefore cheaper fonts may restrict themselves to less collections or subsets of them, at risk of not covering e.g. U+2010 HYPHEN and U+02BC LETTER APOSTROPHE. I apologize again to all persons I?ve hurt in that other thread. In fact I felt that something is wrong, but above all I?was wrong myself. Looking for defaults on Unicode?s side was a big mistake. You are heroes.

BTW, for keyboard input, there is strictly no problem on Windows. Typing the ??1,600?Latin characters +?punctuation is straightforward since we know how keyboard layout drivers work. There is mainly *one* long dead trans list, and almost every keyboard can have Kana on Right Alt, and Compose on Kana?+?Space. ISO/IEC?9995 should soon be revised to become ultimately fit for real mainstream computers. Microsoft didn?t wait for ISO/IEC?9995-2 to provide performative APIs, nor did Tavultesoft wait for ISO/IEC?9995-11 to provide performative UIs.

Would it be possible to teach them too how a Unicode keyboard is made, and how KbdUTool works? And Keyman Developer? Perhaps in a lecture on C, or in a workshop on compilers, or in a lecture on UI design?

Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160114/156bd365/attachment.html>

From eric.muller at efele.net  Sat Jan 16 09:00:17 2016
From: eric.muller at efele.net (Eric Muller)
Date: Sat, 16 Jan 2016 07:00:17 -0800
Subject: The Chinese Typewriter: The Design and Science of East Asian
 Information Technology
Message-ID: <569A5B01.9070701@efele.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160116/ac8ba3c3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: typewriter.jpg
Type: image/jpeg
Size: 28061 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160116/ac8ba3c3/attachment.jpg>

From joe at unicode.org  Sat Jan 16 12:30:51 2016
From: joe at unicode.org (Joe Becker)
Date: Sat, 16 Jan 2016 10:30:51 -0800
Subject: The Chinese Typewriter: The Design and Science of East Asian
 Information Technology
In-Reply-To: <569A5B01.9070701@efele.net>
References: <569A5B01.9070701@efele.net>
Message-ID: <569A8C5B.2080305@unicode.org>


See, for starters ...

     https://www.youtube.com/watch?v=tdT-oFxc-C0  -- A Chinese 
Typewriter in Silicon Valley, Thomas S. Mullaney, Google Tech Talk, 
December 5, 2011

     http://thechinesetypewriter.wordpress.com/

     http://tsmullaney.com/

Zhou


From c933103 at gmail.com  Wed Jan 20 00:07:37 2016
From: c933103 at gmail.com (gfb hjjhjh)
Date: Wed, 20 Jan 2016 14:07:37 +0800
Subject: Is it possible to choose rotational direction of vertical script if I
 want to force them to display horizontally?
In-Reply-To: <CAGHjPPLeAx9t58eVaiNnyxwLNNrUgNgCAXpfau8ggXpttKTbKw@mail.gmail.com>
References: <CAGHjPPLeAx9t58eVaiNnyxwLNNrUgNgCAXpfau8ggXpttKTbKw@mail.gmail.com>
Message-ID: <CAGHjPPL_MM54W8g1SdK0POC8+6_y-DSki8xiJvkdYGHo3zuOHQ@mail.gmail.com>

For instance, traditional Mongolian script write in vertical-lr mode (text
run vertically from top to bottom, first line start on left), if you use
css writing mode horizontal-tb (default) then you can force it horizontal
by rotating each line of the text by 90 degree anticlockwise, and the
resultant text would be ltr. However, I just read on a Chinese webpage
http://www.zhihu.com/question/30727581 which claim there're a "traditional
way" of writing Mongolian horizontally by rotating it 90 degree clockwise
(despite I am not sure about what kind of tradition the webpage is
referring to nor do i know is it legit.), is this achievable within current
computer system like on webpage via css or in different word
processors/office suite software?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160120/16915308/attachment.html>

From everson at evertype.com  Wed Jan 20 08:49:43 2016
From: everson at evertype.com (Michael Everson)
Date: Wed, 20 Jan 2016 14:49:43 +0000
Subject: ISO 15924 updated
Message-ID: <9D4E073B-B424-4747-9FCD-6428155C97BD@evertype.com>

Two aliases have been added for the Jamo subset of Hangul, and for Han + Bopomofo.

See http://www.unicode.org/iso15924/codechanges.html

Michael Everson
Registrar

From quanxunzhen at gmail.com  Wed Jan 20 01:08:06 2016
From: quanxunzhen at gmail.com (Xidorn Quan)
Date: Wed, 20 Jan 2016 18:08:06 +1100
Subject: Is it possible to choose rotational direction of vertical script
 if I want to force them to display horizontally?
In-Reply-To: <CAGHjPPL_MM54W8g1SdK0POC8+6_y-DSki8xiJvkdYGHo3zuOHQ@mail.gmail.com>
References: <CAGHjPPLeAx9t58eVaiNnyxwLNNrUgNgCAXpfau8ggXpttKTbKw@mail.gmail.com>
 <CAGHjPPL_MM54W8g1SdK0POC8+6_y-DSki8xiJvkdYGHo3zuOHQ@mail.gmail.com>
Message-ID: <CAMdq699FEG00LKnm1mQwPxyiUzUgSvpd42-f1UuuS9yi_BxcdA@mail.gmail.com>

On Wed, Jan 20, 2016 at 5:07 PM, gfb hjjhjh <c933103 at gmail.com> wrote:
> However, I just read on a Chinese webpage
> http://www.zhihu.com/question/30727581 which claim there're a "traditional
> way" of writing Mongolian horizontally by rotating it 90 degree clockwise
> (despite I am not sure about what kind of tradition the webpage is referring
> to nor do i know is it legit.),

It doesn't seem to me the post claims anything. AFAICS, it is just a
question that, whether there exists any valid method to write
Mongolian horizontally.

> is this achievable within current computer
> system like on webpage via css or in different word processors/office suite
> software?

It seems to be a question for www-style at w3.org, instead of the Unicode
mailing list.

- Xidorn

From rwhlk142 at gmail.com  Fri Jan 22 16:56:42 2016
From: rwhlk142 at gmail.com (Robert Wheelock)
Date: Fri, 22 Jan 2016 17:56:42 -0500
Subject: Fwd: Drawing Souvenir-style letters for Hebrew script
In-Reply-To: <ae187f4e3e0ee35388205ac81a74d66d@opensiddur.org>
References: <ae187f4e3e0ee35388205ac81a74d66d@opensiddur.org>
Message-ID: <CAPKujtTLEoXuKXH6aA1OhAsDTF=Rqc4O0hpdX9L35QruqAtdeA@mail.gmail.com>

---------- Forwarded message ----------
From: Robert Lloyd Wheelock <RWhlk142 at gmail.com>
Date: Wed, Jan 20, 2016 at 5:30 AM
Subject: RE: Drawing Souvenir-style letters for Hebrew script
To: Robert Lloyd Wheelock <RWhlk142 at gmail.com>


From: Robert Lloyd Wheelock <RWhlk142 at gmail.com>
Subject: RE:  Drawing Souvenir-style letters for Hebrew script

Hello!

I tried my hands to design Hebrew letters using similarly-looking
Latin/Roman ones in the font *Souvenir*, a playful retro round serif
typeface.  Souvenir's somewhat bending strokes lend nicely to create
characters for Latin-Roman/Greek/Cyrillic, but present quite a challenge
when designing characters for Hebrew/Arabic/Syro-Aramaic/Devanagari... .

How would you employ softer, more rounded strokes to construct the 22
Hebrew letters with the 5 *sofith* (final forms for
kaf/mem/nun/pe?/?adheh), so that they would harmonize well into the
*Souvenir* font family?!?!

Designing the vowel points and cantillation signs would be much easier.

Shalom!  Thank You!

--
This mail was sent by Robert Lloyd Wheelock via The Open Siddur Project's
contact form http://opensiddur.org/contact/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160122/c654a086/attachment.html>

From as at signographie.de  Sat Jan 23 05:58:31 2016
From: as at signographie.de (=?iso-8859-1?Q?Andreas_St=F6tzner?=)
Date: Sat, 23 Jan 2016 12:58:31 +0100
Subject: Drawing Souvenir-style letters for Hebrew script
In-Reply-To: <CAPKujtTLEoXuKXH6aA1OhAsDTF=Rqc4O0hpdX9L35QruqAtdeA@mail.gmail.com>
References: <ae187f4e3e0ee35388205ac81a74d66d@opensiddur.org>
 <CAPKujtTLEoXuKXH6aA1OhAsDTF=Rqc4O0hpdX9L35QruqAtdeA@mail.gmail.com>
Message-ID: <29C476FD-048C-45FC-B77F-54D49667EFBD@signographie.de>


Am 22.01.2016 um 23:56 schrieb Robert Wheelock:

> I tried my hands to design Hebrew letters using similarly-looking Latin/Roman ones in the font *Souvenir*, a playful retro round serif typeface.  Souvenir's somewhat bending strokes lend nicely to create characters for Latin-Roman/Greek/Cyrillic, but present quite a challenge when designing characters for Hebrew/Arabic/Syro-Aramaic/Devanagari... .
> 
> How would you employ softer, more rounded strokes to construct the 22 Hebrew letters with the 5 *sofith* (final forms for kaf/mem/nun/pe?/?adheh), so that they would harmonize well into the *Souvenir* font family?!?!


this is surely an interesting matter but no one within the scope of the Unicode disc. list. You may wish to turn to the Typedrawers.com forum, where font design issues can get forwarded to a well-prepared audience.

With kind regards,
			Andreas St?tzner


_______________________________________________________________________________

Andreas St?tzner  Gestaltung Signographie Fontentwicklung
 
Haus des Buches 
Gerichtsweg 28, Raum 434
04103 Leipzig
0176-86823396


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160123/80a5a904/attachment.html>

From d3ck0r at gmail.com  Sat Jan 30 08:40:23 2016
From: d3ck0r at gmail.com (J Decker)
Date: Sat, 30 Jan 2016 06:40:23 -0800
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
Message-ID: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>

I do see that the code points D800-DFFF should not be encoded in any
UTF format (UTF8/32)...

UTF8 has a way to define any byte that might otherwise be used as an
encoding byte.

UTF16 has no way to define a code point that is D800-DFFF; this is an
issue if I want to apply some sort of encryption algorithm and still
have the result treated as text for transmission and encoding to other
string systems.

http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
private areas Area-A which is U-F0000:U-FFFFD and Area-B which is
U-100000:U-10FFFD which will suffice for a workaround for my
purposes....

For my purposes I will implement F0000-F0800 to be (code point minus
D800 and then add F0000 (or vice versa)) and then encoded as a
surrogate pair... it would have been super nice of unicode standards
included a way to specify code point even if there isn't a language
character assigned to that point.

http://unicode.org/faq/utf_bom.html
does say: "Q: Are there any 16-bit values that are invalid?

A: Unpaired surrogates are invalid in UTFs. These include any value in
the range D800 to DBFF not followed by a value in the range DC00 to
DFFF, or any value in the range DC00 to DFFF not preceded by a value
in the range D800 to DBFF
"

and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?

A different issue arises if an unpaired surrogate is encountered when
converting ill-formed UTF-16 data. By represented such an unpaired
surrogate on its own as a 3-byte sequence, the resulting UTF-8 data
stream would become ill-formed. While it faithfully reflects the
nature of the input, Unicode conformance requires that encoding form
conversion always results in valid data stream. Therefore a converter
must treat this as an error. "


I did see these older messages... (not that they talk about this much
just more info)
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html

From doug at ewellic.org  Sat Jan 30 15:05:52 2016
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 30 Jan 2016 14:05:52 -0700
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
Message-ID: <5E72B7B357714F38A8B16215EFB36319@DougEwell>

J Decker wrote:

> UTF16 has no way to define a code point that is D800-DFFF; this is an
> issue if I want to apply some sort of encryption algorithm and still
> have the result treated as text for transmission and encoding to other
> string systems.

Unpaired surrogates are not valid Unicode text. If you want to encrypt 
data into 16-bit code units and have them treated as valid Unicode text, 
the encryption algorithm must not generate unpaired surrogates.

This is not negotiable and not something you can be "partially" 
compliant on. See Unicode Conformance Requirement C1: "A process shall 
not interpret a high-surrogate code point or a low-surrogate code point 
as an abstract character."

There's a reason this is "C1" and not farther down the list. It is 
fundamental to Unicode.

> For my purposes I will implement F0000-F0800 to be (code point minus
> D800 and then add F0000 (or vice versa)) and then encoded as a
> surrogate pair...

This is fine for a private implementation where you are sure no input 
will contain these PUA code points. Keep in mind that some people do use 
them -- for example, they are assigned in the ConScript Unicode 
Registry, which is unofficial and not affiliated with Unicode.

> it would have been super nice of unicode standards
> included a way to specify code point even if there isn't a language
> character assigned to that point.

It's not a question of whether a code point is assigned to a "language 
character." There are hundreds of thousands of unassigned code points 
that can be represented in any UTF, such as this one: ??, U+77777. But 
unpaired surrogates can *never* be assigned to a character. If they 
could, they would have failed in their basic purpose of extending 
UTF-16.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From chris.jacobs at xs4all.nl  Sat Jan 30 15:29:35 2016
From: chris.jacobs at xs4all.nl (Chris Jacobs)
Date: Sat, 30 Jan 2016 22:29:35 +0100
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <5E72B7B357714F38A8B16215EFB36319@DougEwell>
References: <5E72B7B357714F38A8B16215EFB36319@DougEwell>
Message-ID: <be3cdc420ff817e6c28f27e22cb95c78@xs4all.nl>


Doug Ewell schreef op 2016-01-30 22:05:
> J Decker wrote:
> 
>> UTF16 has no way to define a code point that is D800-DFFF; this is an
>> issue if I want to apply some sort of encryption algorithm and still
>> have the result treated as text for transmission and encoding to other
>> string systems.

This is not an issue at all. You don't have to restrict the input to 
text to be able to generate an output that can be treated as text.

Just, as a last step, apply e.g. UUENCODE or Base64.
Look how PGP solves this.

Chris

From doug at ewellic.org  Sat Jan 30 15:46:39 2016
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 30 Jan 2016 14:46:39 -0700
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <be3cdc420ff817e6c28f27e22cb95c78@xs4all.nl>
References: <5E72B7B357714F38A8B16215EFB36319@DougEwell>
 <be3cdc420ff817e6c28f27e22cb95c78@xs4all.nl>
Message-ID: <04EF7083723A40ACAADA05B703602249@DougEwell>

Chris Jacobs wrote:

>>> UTF16 has no way to define a code point that is D800-DFFF; this is
>>> an issue if I want to apply some sort of encryption algorithm and
>>> still have the result treated as text for transmission and encoding
>>> to other string systems.
>
> This is not an issue at all. You don't have to restrict the input to
> text to be able to generate an output that can be treated as text.

I gathered that J wanted to generate arbitrary output that could be 
interpreted as UTF-16 code units. I admit to being less than 100% sure 
of this.

Certainly there is no shortage of algorithms to map arbitrary byte input 
to text output, usually limited to some subset of ASCII. One interesting 
approach for the Unicode era was Markus Scherer's "Base16k" concept, at 
https://sites.google.com/site/markusicu/unicode/base16k .

--
Doug Ewell | http://ewellic.org | Thornton, CO ???? 


From Shawn.Steele at microsoft.com  Sat Jan 30 18:45:18 2016
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Sun, 31 Jan 2016 00:45:18 +0000
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
References: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
Message-ID: <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>

Why do you need illegal unicode code points?

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker
Sent: Saturday, January 30, 2016 6:40 AM
To: unicode at unicode.org
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

I do see that the code points D800-DFFF should not be encoded in any UTF format (UTF8/32)...

UTF8 has a way to define any byte that might otherwise be used as an encoding byte.

UTF16 has no way to define a code point that is D800-DFFF; this is an issue if I want to apply some sort of encryption algorithm and still have the result treated as text for transmission and encoding to other string systems.

http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
private areas Area-A which is U-F0000:U-FFFFD and Area-B which is U-100000:U-10FFFD which will suffice for a workaround for my purposes....

For my purposes I will implement F0000-F0800 to be (code point minus
D800 and then add F0000 (or vice versa)) and then encoded as a surrogate pair... it would have been super nice of unicode standards included a way to specify code point even if there isn't a language character assigned to that point.

http://unicode.org/faq/utf_bom.html
does say: "Q: Are there any 16-bit values that are invalid?

A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any value in the range DC00 to DFFF not preceded by a value in the range D800 to DBFF "

and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?

A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error. "


I did see these older messages... (not that they talk about this much just more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html


From d3ck0r at gmail.com  Sat Jan 30 20:28:03 2016
From: d3ck0r at gmail.com (J Decker)
Date: Sat, 30 Jan 2016 18:28:03 -0800
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
References: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
 <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
Message-ID: <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>

On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele
<Shawn.Steele at microsoft.com> wrote:
> Why do you need illegal unicode code points?

This originated from learning Javascript; which is internally UTF-16.
Playing with localStorage, some browsers use a sqlite3 database to
store values.  The database is UTF-8 so there must be a valid
conversion between the internal UTF-16 and UTF-8 localStorage (and
reverse).  I wanted to obfuscate the data stored for a certain
application; and cover all content that someone might send.  Having
slept on this, I realized that even if hieroglyphics were stored, if I
pulled out the character using codePointAt() and applied a 20 bit
random value to it using XOR it could end up as a normal character,
and I wouldn't know I had to use a 20 bit value... so every character
would have to use a 20 bit mask (which could end up with a value
that's D800-DFFF).

I've reconsidered and think for ease of implementation to just mask
every UTF-16 character (not  codepoint) with a 10 bit value, This will
result in no character changing from BMP space to surrogate-pair or
vice-versa.

Thanks for the feedback.
(sorry if I've used some terms inaccurately)

>
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker
> Sent: Saturday, January 30, 2016 6:40 AM
> To: unicode at unicode.org
> Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
>
> I do see that the code points D800-DFFF should not be encoded in any UTF format (UTF8/32)...
>
> UTF8 has a way to define any byte that might otherwise be used as an encoding byte.
>
> UTF16 has no way to define a code point that is D800-DFFF; this is an issue if I want to apply some sort of encryption algorithm and still have the result treated as text for transmission and encoding to other string systems.
>
> http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
> private areas Area-A which is U-F0000:U-FFFFD and Area-B which is U-100000:U-10FFFD which will suffice for a workaround for my purposes....
>
> For my purposes I will implement F0000-F0800 to be (code point minus
> D800 and then add F0000 (or vice versa)) and then encoded as a surrogate pair... it would have been super nice of unicode standards included a way to specify code point even if there isn't a language character assigned to that point.
>
> http://unicode.org/faq/utf_bom.html
> does say: "Q: Are there any 16-bit values that are invalid?
>
> A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any value in the range DC00 to DFFF not preceded by a value in the range D800 to DBFF "
>
> and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
>
> A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error. "
>
>
>
> I did see these older messages... (not that they talk about this much just more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html


From prosfilaes at gmail.com  Sat Jan 30 21:20:14 2016
From: prosfilaes at gmail.com (David Starner)
Date: Sun, 31 Jan 2016 03:20:14 +0000
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>
References: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
 <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>
Message-ID: <CAMZ=zj58554qRUKv2NH-EON=ZA+fyeqJ0WRoU8nu3g3KPRWKNw@mail.gmail.com>

Obfuscate is right. It might conceivably be better than nothing, but at its
best it will stop someone for an hour or so. Why not run it through a
standard encryption protocol and if necessary use one of the options
mentioned before to turn it into valid text?

On Sat, Jan 30, 2016, 6:31 PM J Decker <d3ck0r at gmail.com> wrote:

> On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele
> <Shawn.Steele at microsoft.com> wrote:
> > Why do you need illegal unicode code points?
>
> This originated from learning Javascript; which is internally UTF-16.
> Playing with localStorage, some browsers use a sqlite3 database to
> store values.  The database is UTF-8 so there must be a valid
> conversion between the internal UTF-16 and UTF-8 localStorage (and
> reverse).  I wanted to obfuscate the data stored for a certain
> application; and cover all content that someone might send.  Having
> slept on this, I realized that even if hieroglyphics were stored, if I
> pulled out the character using codePointAt() and applied a 20 bit
> random value to it using XOR it could end up as a normal character,
> and I wouldn't know I had to use a 20 bit value... so every character
> would have to use a 20 bit mask (which could end up with a value
> that's D800-DFFF).
>
> I've reconsidered and think for ease of implementation to just mask
> every UTF-16 character (not  codepoint) with a 10 bit value, This will
> result in no character changing from BMP space to surrogate-pair or
> vice-versa.
>
> Thanks for the feedback.
> (sorry if I've used some terms inaccurately)
>
> >
> > -----Original Message-----
> > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker
> > Sent: Saturday, January 30, 2016 6:40 AM
> > To: unicode at unicode.org
> > Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair
> specifiers
> >
> > I do see that the code points D800-DFFF should not be encoded in any UTF
> format (UTF8/32)...
> >
> > UTF8 has a way to define any byte that might otherwise be used as an
> encoding byte.
> >
> > UTF16 has no way to define a code point that is D800-DFFF; this is an
> issue if I want to apply some sort of encryption algorithm and still have
> the result treated as text for transmission and encoding to other string
> systems.
> >
> > http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
> > private areas Area-A which is U-F0000:U-FFFFD and Area-B which is
> U-100000:U-10FFFD which will suffice for a workaround for my purposes....
> >
> > For my purposes I will implement F0000-F0800 to be (code point minus
> > D800 and then add F0000 (or vice versa)) and then encoded as a surrogate
> pair... it would have been super nice of unicode standards included a way
> to specify code point even if there isn't a language character assigned to
> that point.
> >
> > http://unicode.org/faq/utf_bom.html
> > does say: "Q: Are there any 16-bit values that are invalid?
> >
> > A: Unpaired surrogates are invalid in UTFs. These include any value in
> the range D800 to DBFF not followed by a value in the range DC00 to DFFF,
> or any value in the range DC00 to DFFF not preceded by a value in the range
> D800 to DBFF "
> >
> > and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
> >
> > A different issue arises if an unpaired surrogate is encountered when
> converting ill-formed UTF-16 data. By represented such an unpaired
> surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream
> would become ill-formed. While it faithfully reflects the nature of the
> input, Unicode conformance requires that encoding form conversion always
> results in valid data stream. Therefore a converter must treat this as an
> error. "
> >
> >
> >
> > I did see these older messages... (not that they talk about this much
> just more info)
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
> > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
> > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
> > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160131/72a08694/attachment.html>

From Shawn.Steele at microsoft.com  Sun Jan 31 02:21:01 2016
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Sun, 31 Jan 2016 08:21:01 +0000
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <CAMZ=zj58554qRUKv2NH-EON=ZA+fyeqJ0WRoU8nu3g3KPRWKNw@mail.gmail.com>
References: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
 <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>
 <CAMZ=zj58554qRUKv2NH-EON=ZA+fyeqJ0WRoU8nu3g3KPRWKNw@mail.gmail.com>
Message-ID: <SN1PR0301MB204555730BB416DD91F15F0982DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>

Typically XOR?ing a constant isn?t really considered worth messing with.  It?s somewhat trivial to figure out the key to un-XOR.

On Sat, Jan 30, 2016, 6:31 PM J Decker <d3ck0r at gmail.com<mailto:d3ck0r at gmail.com>> wrote:
On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele
<Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>> wrote:
> Why do you need illegal unicode code points?

This originated from learning Javascript; which is internally UTF-16.
Playing with localStorage, some browsers use a sqlite3 database to
store values.  The database is UTF-8 so there must be a valid
conversion between the internal UTF-16 and UTF-8 localStorage (and
reverse).  I wanted to obfuscate the data stored for a certain
application; and cover all content that someone might send.  Having
slept on this, I realized that even if hieroglyphics were stored, if I
pulled out the character using codePointAt() and applied a 20 bit
random value to it using XOR it could end up as a normal character,
and I wouldn't know I had to use a 20 bit value... so every character
would have to use a 20 bit mask (which could end up with a value
that's D800-DFFF).

I've reconsidered and think for ease of implementation to just mask
every UTF-16 character (not  codepoint) with a 10 bit value, This will
result in no character changing from BMP space to surrogate-pair or
vice-versa.

Thanks for the feedback.
(sorry if I've used some terms inaccurately)

>
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org<mailto:unicode-bounces at unicode.org>] On Behalf Of J Decker
> Sent: Saturday, January 30, 2016 6:40 AM
> To: unicode at unicode.org<mailto:unicode at unicode.org>
> Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
>
> I do see that the code points D800-DFFF should not be encoded in any UTF format (UTF8/32)...
>
> UTF8 has a way to define any byte that might otherwise be used as an encoding byte.
>
> UTF16 has no way to define a code point that is D800-DFFF; this is an issue if I want to apply some sort of encryption algorithm and still have the result treated as text for transmission and encoding to other string systems.
>
> http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
> private areas Area-A which is U-F0000:U-FFFFD and Area-B which is U-100000:U-10FFFD which will suffice for a workaround for my purposes....
>
> For my purposes I will implement F0000-F0800 to be (code point minus
> D800 and then add F0000 (or vice versa)) and then encoded as a surrogate pair... it would have been super nice of unicode standards included a way to specify code point even if there isn't a language character assigned to that point.
>
> http://unicode.org/faq/utf_bom.html
> does say: "Q: Are there any 16-bit values that are invalid?
>
> A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any value in the range DC00 to DFFF not preceded by a value in the range D800 to DBFF "
>
> and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
>
> A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error. "
>
>
>
> I did see these older messages... (not that they talk about this much just more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160131/3a79fb37/attachment.html>

From d3ck0r at gmail.com  Sun Jan 31 03:27:19 2016
From: d3ck0r at gmail.com (J Decker)
Date: Sun, 31 Jan 2016 01:27:19 -0800
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <SN1PR0301MB204555730BB416DD91F15F0982DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
References: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
 <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>
 <CAMZ=zj58554qRUKv2NH-EON=ZA+fyeqJ0WRoU8nu3g3KPRWKNw@mail.gmail.com>
 <SN1PR0301MB204555730BB416DD91F15F0982DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
Message-ID: <CAA2GJqX1rox1wF=C5TY_=zpC3fPSU_FFGagdfk2hhSQLgcOoPw@mail.gmail.com>

On Sun, Jan 31, 2016 at 12:21 AM, Shawn Steele
<Shawn.Steele at microsoft.com> wrote:
> Typically XOR?ing a constant isn?t really considered worth messing with.
> It?s somewhat trivial to figure out the key to un-XOR.
>
obviously.  It's not constant, nor is it stored anywhere in the code or data.
>
>
> On Sat, Jan 30, 2016, 6:31 PM J Decker <d3ck0r at gmail.com> wrote:
>
> On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele
> <Shawn.Steele at microsoft.com> wrote:
>> Why do you need illegal unicode code points?
>
> This originated from learning Javascript; which is internally UTF-16.
> Playing with localStorage, some browsers use a sqlite3 database to
> store values.  The database is UTF-8 so there must be a valid
> conversion between the internal UTF-16 and UTF-8 localStorage (and
> reverse).  I wanted to obfuscate the data stored for a certain
> application; and cover all content that someone might send.  Having
> slept on this, I realized that even if hieroglyphics were stored, if I
> pulled out the character using codePointAt() and applied a 20 bit
> random value to it using XOR it could end up as a normal character,
> and I wouldn't know I had to use a 20 bit value... so every character
> would have to use a 20 bit mask (which could end up with a value
> that's D800-DFFF).
>
> I've reconsidered and think for ease of implementation to just mask
> every UTF-16 character (not  codepoint) with a 10 bit value, This will
> result in no character changing from BMP space to surrogate-pair or
> vice-versa.
>
> Thanks for the feedback.
> (sorry if I've used some terms inaccurately)
>
>>
>> -----Original Message-----
>> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker
>> Sent: Saturday, January 30, 2016 6:40 AM
>> To: unicode at unicode.org
>> Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
>>
>> I do see that the code points D800-DFFF should not be encoded in any UTF
>> format (UTF8/32)...
>>
>> UTF8 has a way to define any byte that might otherwise be used as an
>> encoding byte.
>>
>> UTF16 has no way to define a code point that is D800-DFFF; this is an
>> issue if I want to apply some sort of encryption algorithm and still have
>> the result treated as text for transmission and encoding to other string
>> systems.
>>
>> http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
>> private areas Area-A which is U-F0000:U-FFFFD and Area-B which is
>> U-100000:U-10FFFD which will suffice for a workaround for my purposes....
>>
>> For my purposes I will implement F0000-F0800 to be (code point minus
>> D800 and then add F0000 (or vice versa)) and then encoded as a surrogate
>> pair... it would have been super nice of unicode standards included a way to
>> specify code point even if there isn't a language character assigned to that
>> point.
>>
>> http://unicode.org/faq/utf_bom.html
>> does say: "Q: Are there any 16-bit values that are invalid?
>>
>> A: Unpaired surrogates are invalid in UTFs. These include any value in the
>> range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any
>> value in the range DC00 to DFFF not preceded by a value in the range D800 to
>> DBFF "
>>
>> and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
>>
>> A different issue arises if an unpaired surrogate is encountered when
>> converting ill-formed UTF-16 data. By represented such an unpaired surrogate
>> on its own as a 3-byte sequence, the resulting UTF-8 data stream would
>> become ill-formed. While it faithfully reflects the nature of the input,
>> Unicode conformance requires that encoding form conversion always results in
>> valid data stream. Therefore a converter must treat this as an error. "
>>
>>
>>
>> I did see these older messages... (not that they talk about this much just
>> more info) http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
>> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
>> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
>> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html


From chris.jacobs at xs4all.nl  Sun Jan 31 10:31:45 2016
From: chris.jacobs at xs4all.nl (Chris Jacobs)
Date: Sun, 31 Jan 2016 17:31:45 +0100
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>
References: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
 <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>
Message-ID: <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl>


J Decker schreef op 2016-01-31 03:28:
> I've reconsidered and think for ease of implementation to just mask
> every UTF-16 character (not  codepoint) with a 10 bit value, This will
> result in no character changing from BMP space to surrogate-pair or
> vice-versa.
> 
> Thanks for the feedback.

So you are still trying to handle the unarmed output as plaintext.
Do you realize that if a string in the output is replaced by a canonical 
equivalent
one this may mess up things because the originals are not canonical 
equivalent?


From d3ck0r at gmail.com  Sun Jan 31 11:56:11 2016
From: d3ck0r at gmail.com (J Decker)
Date: Sun, 31 Jan 2016 09:56:11 -0800
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl>
References: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
 <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>
 <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl>
Message-ID: <CAA2GJqW=pNjF-MMaPEPHKbcgBXVwdsvHwFJ_bdcK14HP=yjBkA@mail.gmail.com>

On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs <chris.jacobs at xs4all.nl> wrote:
>
>
> J Decker schreef op 2016-01-31 03:28:
>>
>> I've reconsidered and think for ease of implementation to just mask
>> every UTF-16 character (not  codepoint) with a 10 bit value, This will
>> result in no character changing from BMP space to surrogate-pair or
>> vice-versa.
>>
>> Thanks for the feedback.
>
>
> So you are still trying to handle the unarmed output as plaintext.
> Do you realize that if a string in the output is replaced by a canonical
> equivalent
> one this may mess up things because the originals are not canonical
> equivalent?
>
I see ... things like mentioned here
http://websec.github.io/unicode-security-guide/character-transformations/

From chris.jacobs at xs4all.nl  Sun Jan 31 12:07:57 2016
From: chris.jacobs at xs4all.nl (Chris Jacobs)
Date: Sun, 31 Jan 2016 19:07:57 +0100
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <CAA2GJqW=pNjF-MMaPEPHKbcgBXVwdsvHwFJ_bdcK14HP=yjBkA@mail.gmail.com>
References: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
 <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>
 <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl>
 <CAA2GJqW=pNjF-MMaPEPHKbcgBXVwdsvHwFJ_bdcK14HP=yjBkA@mail.gmail.com>
Message-ID: <41a4f348541db2b8a0ad5b191c3a1c64@xs4all.nl>


J Decker schreef op 2016-01-31 18:56:
> On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs <chris.jacobs at xs4all.nl> 
> wrote:
>> 
>> 
>> J Decker schreef op 2016-01-31 03:28:
>>> 
>>> I've reconsidered and think for ease of implementation to just mask
>>> every UTF-16 character (not  codepoint) with a 10 bit value, This 
>>> will
>>> result in no character changing from BMP space to surrogate-pair or
>>> vice-versa.
>>> 
>>> Thanks for the feedback.
>> 
>> 
>> So you are still trying to handle the unarmed output as plaintext.
>> Do you realize that if a string in the output is replaced by a 
>> canonical
>> equivalent
>> one this may mess up things because the originals are not canonical
>> equivalent?
>> 
> I see ... things like mentioned here
> http://websec.github.io/unicode-security-guide/character-transformations/

Yes especially the part about normalization.
This would not only spoil the normalized string, but also, as the string 
can have a different length,
for anything after that your ever-changing xor-values may go out of 
sync.


From Shawn.Steele at microsoft.com  Sun Jan 31 13:52:32 2016
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Sun, 31 Jan 2016 19:52:32 +0000
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <41a4f348541db2b8a0ad5b191c3a1c64@xs4all.nl>
References: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
 <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>
 <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl>
 <CAA2GJqW=pNjF-MMaPEPHKbcgBXVwdsvHwFJ_bdcK14HP=yjBkA@mail.gmail.com>
 <41a4f348541db2b8a0ad5b191c3a1c64@xs4all.nl>
Message-ID: <SN1PR0301MB2045F317D3E86219D5DEDC4A82DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>

It should be understood that any algorithm that changes the Unicode character data to non-character data is therefore binary, and not Unicode.  It's inappropriate to shove binary data into unicode streams because stuff will break.
https://blogs.msdn.microsoft.com/shawnste/2005/09/26/avoid-treating-binary-data-as-a-string/


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Chris Jacobs
Sent: Sunday, January 31, 2016 10:08 AM
To: J Decker <d3ck0r at gmail.com>
Cc: unicode at unicode.org
Subject: Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers


J Decker schreef op 2016-01-31 18:56:
> On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs <chris.jacobs at xs4all.nl>
> wrote:
>> 
>> 
>> J Decker schreef op 2016-01-31 03:28:
>>> 
>>> I've reconsidered and think for ease of implementation to just mask 
>>> every UTF-16 character (not  codepoint) with a 10 bit value, This 
>>> will result in no character changing from BMP space to 
>>> surrogate-pair or vice-versa.
>>> 
>>> Thanks for the feedback.
>> 
>> 
>> So you are still trying to handle the unarmed output as plaintext.
>> Do you realize that if a string in the output is replaced by a 
>> canonical equivalent one this may mess up things because the 
>> originals are not canonical equivalent?
>> 
> I see ... things like mentioned here
> http://websec.github.io/unicode-security-guide/character-transformatio
> ns/

Yes especially the part about normalization.
This would not only spoil the normalized string, but also, as the string can have a different length, for anything after that your ever-changing xor-values may go out of sync.


From verdy_p at wanadoo.fr  Sun Jan 31 15:49:26 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 31 Jan 2016 22:49:26 +0100
Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
In-Reply-To: <SN1PR0301MB2045F317D3E86219D5DEDC4A82DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
References: <CAA2GJqW4YwFX3VbkWR_8OQ8oMwZiHSYE2+eaFuVR_9MZ_Tw64g@mail.gmail.com>
 <SN1PR0301MB2045F4F5D30140B4165F62B582DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
 <CAA2GJqWJhwmXKD6KTimRCbGpOvD_oo3Vava-FQFce01YvUomwg@mail.gmail.com>
 <645d8c72efe9658e2e6580f95600fdcb@xs4all.nl>
 <CAA2GJqW=pNjF-MMaPEPHKbcgBXVwdsvHwFJ_bdcK14HP=yjBkA@mail.gmail.com>
 <41a4f348541db2b8a0ad5b191c3a1c64@xs4all.nl>
 <SN1PR0301MB2045F317D3E86219D5DEDC4A82DD0@SN1PR0301MB2045.namprd03.prod.outlook.com>
Message-ID: <CAGa7JC0N2_SVkNrHynNDwRsQs3Lw9r6mTjku4obOsw6v7H-TsA@mail.gmail.com>

I also agree.

To transport binary data over a plain-text format there are other common
types, including Base64, Quoted-Printable (and you can also compress the
binary data before this transformation, using Gzip, deflate... for example
in MIME for emails; or compress it after this transformation only over the
transport channel like in HTTP which natively supports transparent 8-bit
streams, this solution being generally more performant).

There's no reliable way to preserve the exact binary encoding of texts
using invalid UTF sequences (including unpaired surrogates in UTF-16, or
isolated surrogate code points and other non-characters in other UTFs, or
forbidden byte values or restricted byte sequence in UTF-8) without using a
binary envelope (which cannot preserve the same encoding of valid UTF
sequences).

Even by using another encoding scheme/encoding form or legacy charset
mapped with Unicode (including GB and HKCS charsets), you will fail each
time due to the canonical equivalences and the existing conforming
conversions between all UTFs which are made to preserve the identity of
characters, not the equality of their binary encodings.

In summary, what you need is:
- a transport-syntax (see HTTP for example) to allow decoding your
envelope, and
- a separate media-type (see HTTP and MIME for example, don't choose any
one in "text/*", but in "binary/*" or possibly "application/*") or some
filesystem convention or standards for file types (such as file name
extensions in common Unix/Linux filesystems or FTP, or external metadata
streams for file attributes such as in MacOS, or VMS, or even in NTFS and
almost all HTTP-based filesystems) for your chosen binary encoding
encapsulated in a text-compatible format.

If your encoded document does not match exactly the strict text encoding
conformances, it cannot be declared and handled at all as if it was valid
text. You have to handle it as an opaque BLOB (as if they were data for a
bitmap image or executable code, or a PKI encryption key, or a data
signature such as SHA or an encrypted stream such as DES).

Basic filesystems for Unix/Linux or FAT treat all their files as
unrestricted blobs (that's why they use a separate data to represent its
actual type to decode it with specific algorithms, the most common being
filename extensions to determine the envelope format, then using internal
data structures in this envelope such as MPEG, OGG, or XML with schemas
validation, or ZIP archives embedding mutiple structured streams with some
conventions)

All these options are out of scope of the Unicode standard which is not
made to transport and preserve the binary encodings, but is made purposely
to allow transparent conversions between all conforming UTFs of valid text
only (nothing else) and to support canonical equivalences as much as
possible in "Unicode-conforming process", so that they'll be able to choose
between these wellknown and standardized text representations.

2016-01-31 20:52 GMT+01:00 Shawn Steele <Shawn.Steele at microsoft.com>:

> It should be understood that any algorithm that changes the Unicode
> character data to non-character data is therefore binary, and not Unicode.
> It's inappropriate to shove binary data into unicode streams because stuff
> will break.
>
> https://blogs.msdn.microsoft.com/shawnste/2005/09/26/avoid-treating-binary-data-as-a-string/
>
>
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Chris
> Jacobs
> Sent: Sunday, January 31, 2016 10:08 AM
> To: J Decker <d3ck0r at gmail.com>
> Cc: unicode at unicode.org
> Subject: Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair
> specifiers
>
>
>
> J Decker schreef op 2016-01-31 18:56:
> > On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs <chris.jacobs at xs4all.nl>
> > wrote:
> >>
> >>
> >> J Decker schreef op 2016-01-31 03:28:
> >>>
> >>> I've reconsidered and think for ease of implementation to just mask
> >>> every UTF-16 character (not  codepoint) with a 10 bit value, This
> >>> will result in no character changing from BMP space to
> >>> surrogate-pair or vice-versa.
> >>>
> >>> Thanks for the feedback.
> >>
> >>
> >> So you are still trying to handle the unarmed output as plaintext.
> >> Do you realize that if a string in the output is replaced by a
> >> canonical equivalent one this may mess up things because the
> >> originals are not canonical equivalent?
> >>
> > I see ... things like mentioned here
> > http://websec.github.io/unicode-security-guide/character-transformatio
> > ns/
>
> Yes especially the part about normalization.
> This would not only spoil the normalized string, but also, as the string
> can have a different length, for anything after that your ever-changing
> xor-values may go out of sync.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160131/2e999d2f/attachment.html>