From steffen at sdaoden.eu  Wed Aug  5 09:40:25 2020
From: steffen at sdaoden.eu (Steffen Nurpmeso)
Date: Wed, 05 Aug 2020 16:40:25 +0200
Subject: [Gossip] unicode@unicode.org no longer archived ..
In-Reply-To: <20200725174622.QD5fB%steffen@sdaoden.eu>
References: <20200720152252.KyZcL%steffen@sdaoden.eu>
 <CAHjiUbqFmzAhdABSYx_P_W7YPMQ2XW4iST_5T04XzF5W_2VBDw@mail.gmail.com>
 <20200725174622.QD5fB%steffen@sdaoden.eu>
Message-ID: <20200805144025.S48Uc%steffen@sdaoden.eu>

Hello Jeff.

Steffen Nurpmeso wrote in
 <20200725174622.QD5fB%steffen at sdaoden.eu>:
 |Jeff Breidenbach wrote in
 |<CAHjiUbqFmzAhdABSYx_P_W7YPMQ2XW4iST_5T04XzF5W_2VBDw at mail.gmail.com>:
 ||I checked and for whatever reason, they simply aren't sending email to
 ||archive at mail-archive.com. No idea why. If you can help make that happen,
 ||archiving will work again.
 |
 |I will forward it.
 |
 ||While we are talking, I wants folks to know that the Mail Archive \
 ||itself is
 ||running fine (knock on wood). But I can't keep up with the support \
 ||requests
 ||though, so sorry to everyone affected by that.
 |
 |It's a pity i do not have more money to spend also for "hobby"
 |projects like MailArchive.
 |Thank You!

I am sorry, i did not get any response from Unicode officials, it
seems the industry backing Unicode is no longer interested, and
they left for web forums etc., leaving behind rag rugs of the
past. 

Sorry for the noise, and thanks again for the MailArchive!!

Ciao from Germany,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

From doug at ewellic.org  Wed Aug  5 10:25:44 2020
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 5 Aug 2020 09:25:44 -0600
Subject: [Gossip] unicode@unicode.org no longer archived ..
In-Reply-To: <20200805144025.S48Uc%steffen@sdaoden.eu>
References: <20200720152252.KyZcL%steffen@sdaoden.eu>
 <CAHjiUbqFmzAhdABSYx_P_W7YPMQ2XW4iST_5T04XzF5W_2VBDw@mail.gmail.com>
 <20200725174622.QD5fB%steffen@sdaoden.eu>
 <20200805144025.S48Uc%steffen@sdaoden.eu>
Message-ID: <000001d66b3c$b0862740$119275c0$@ewellic.org>

Steffen Nurpmeso wrote:

> I am sorry, i did not get any response from Unicode officials, it
> seems the industry backing Unicode is no longer interested, and they
> left for web forums etc., leaving behind rag rugs of the past.

Sometimes the magic happens without a lot of fanfare.

Between these three links:

1) https://corp.unicode.org/pipermail/unicode/
2) https://www.unicode.org/mail-arch/unicode-ml/
3) https://www.unicode.org/mail-arch/unicode-ml/Archives-Old/

you should now have login-free access to web archives for the entire history of the Unicode public mailing list.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From naa.ganesan at gmail.com  Sun Aug  9 13:30:44 2020
From: naa.ganesan at gmail.com (N. Ganesan)
Date: Sun, 9 Aug 2020 13:30:44 -0500
Subject: Tamil Brahmi Virama at U+11070
Message-ID: <CAA+QEUc2-ehe6TsdDHpfd7FU2yP-jB1rx-HBXhmtx9xiB-E+xg@mail.gmail.com>

I read in this month's UTC meeting minutes,
*>Consensus:* Accept U+11070 BRAHMI SIGN OLD TAMIL VIRAMA for encoding
>in a future version of the standard.

We thank UTC for this decision. It will be far easier to know which is
Tamil Brahmi Virama (from other diacritics) when a plain-text message is
posted in social media etc., in the future.

Just last week, a Shiva Linga with the words, "ekan aatan kOTTam" in Tamil
Brahmi inscription has been found in a Shiva temple at Kinnimangalam, near
Madurai, Tamil Nadu. Because of the paleography with a fully developed
PuLLi system (5 puLLis!), this Lingam ( https://en.wikipedia.org/wiki/Lingam )
can be dated to being ~1800 years old. There are 3 Lingas with Tamil Brahmi
inscribed on them at (1) Netrambakkam near Madras (2) Kinnimangalam near
Madurai and (3) Inuvil in Sri Lanka island and they form the source for all
later PaLLippadai memorial temples in Pallava period and in South East
Asia, called as Devaraja cult temples such as in Cambodia.

http://nganesan.blogspot.com/2020/07/ekamukha-linga-with-tamil-brahmi.html
http://nganesan.blogspot.com/2020/07/kinnimangalam-linga-brahmi-pulli.html

N. Ganesan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20200809/17976d5d/attachment.htm>

From jameskasskrv at gmail.com  Fri Aug 14 18:23:30 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Fri, 14 Aug 2020 23:23:30 +0000
Subject: [off topic] Code2003 is a rip-off
Message-ID: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com>


There's a font called Code2003 which is available for download on 
various web sites.? Most of its glyphs were stolen from my fonts 
Code2000 and Code2001.? Several ranges included in the font which were 
not covered by my fonts were likely stolen from elsewhere.? For example, 
for the range "Miscellaneous Symbols and Pictographs", its "developer" 
simply stole the glyphs from the Unicode chart for that range as found 
on the Unicode web site.? (Although some of those glyphs were modified 
by mirroring or slight rotation.? Please see attached graphic.)

The extended font information contained within Code2003 lists me as its 
developer and contains broken links to my old web site and e-mail 
address.? I am not affiliated with "St. Gigafont", I do not steal glyphs 
from the Unicode web site charts, and Code2003 is being distributed 
without my permission or authorization.

Some download web sites request donations.? Any donations are going to 
"St. Gigafont", not to me.

This e-mail is a "heads-up" both to other font developers whose work may 
have been stolen and to The Unicode Consortium itself because the PDF 
charts are copyrighted and may use copyrighted fonts. Please forward 
this e-mail to interested parties.

Best regards,

James Kass

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20200814_5_Capture.jpg
Type: image/jpeg
Size: 58025 bytes
Desc: not available
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20200814/b41e44c6/attachment-0001.jpg>

From richard.wordingham at ntlworld.com  Sat Aug 15 06:33:29 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 15 Aug 2020 12:33:29 +0100
Subject: Code2003 is a rip-off
In-Reply-To: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com>
References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com>
Message-ID: <20200815123329.2b1c6f38@JRWUBU2>

On Fri, 14 Aug 2020 23:23:30 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> There's a font called Code2003 which is available for download on 
> various web sites.? Most of its glyphs were stolen from my fonts 
> Code2000 and Code2001.? Several ranges included in the font which
> were not covered by my fonts were likely stolen from elsewhere.? For
> example, for the range "Miscellaneous Symbols and Pictographs", its
> "developer" simply stole the glyphs from the Unicode chart for that
> range as found on the Unicode web site.? (Although some of those
> glyphs were modified by mirroring or slight rotation.? Please see
> attached graphic.)

Have you read the legal defence at
https://digiex.net/threads/hello-to-all-i-wish-to-introduce-myself.15144/ ?
I think the editor's arguments are wrong, but I think 'rip-off' is too
strong a word.

He may have made what would be legal use of other fonts if they were
themselves legal.

> The extended font information contained within Code2003 lists me as
> its developer and contains broken links to my old web site and e-mail 
> address.? I am not affiliated with "St. Gigafont", I do not steal
> glyphs from the Unicode web site charts, and Code2003 is being
> distributed without my permission or authorization.

The allocation of plaudits and brickbats is a tricky task.  Where do we
stand on getting Code2000 licenses?  Have you made representations to
St. Gigafont about his implicit accusation of copying?

Artistically, you could be aggrieved by the removal of shaping -
Devanagari shaping is completely gone.

> Some download web sites request donations.? Any donations are going
> to "St. Gigafont", not to me.

I hope that at least they are being channelled to St. Gigafont.  I
found a copy of the font that said, in its name table, both that it
was licensed under the SIL Open Font Licence, and that it was
shareware, to be licensed from you for US$5.  Does my licence from you
cover me for Code2003 so far as your rights are concerned? 

Have you managed to contact "St. Gigafont"?  It's conceivable that some
of the donations have been set aside for you.  It would seem that "St.
Gigafont" has been hosting Code2000 and Code2001. You might even
recover an income trickle.

> This e-mail is a "heads-up" both to other font developers whose work
> may have been stolen and to The Unicode Consortium itself because the
> PDF charts are copyrighted and may use copyrighted fonts.

I once found that one of my fonts released under the SIL Open font
Licence was being redistributed under the same name but with
modifications and no hint of them in the name table.  I have wondered
whether that constituted a donation of the changes to me.

Bringing the matter more clearly into the scope of this list, is the
original goal of Code2000 still achievable?  Is it achievable without
horrendous artistic compromises?  I was recently horrified by how many
ligatures are needed just to write Pali in the Sinhala script, let
alone Sanskrit.

Richard.


From jk at koremail.com  Sat Aug 15 08:39:10 2020
From: jk at koremail.com (jk at koremail.com)
Date: Sat, 15 Aug 2020 21:39:10 +0800
Subject: Code2003 is a rip-off
In-Reply-To: <20200815123329.2b1c6f38@JRWUBU2>
References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com>
 <20200815123329.2b1c6f38@JRWUBU2>
Message-ID: <20195fd5a2c9e90ebeac2f92e7a0e2f3@koremail.com>

On 2020-08-15 19:33, Richard Wordingham via Unicode wrote:
> On Fri, 14 Aug 2020 23:23:30 +0000
> James Kass via Unicode <unicode at unicode.org> wrote:
> 
>> There's a font called Code2003 which is available for download on
>> various web sites.? Most of its glyphs were stolen from my fonts
>> Code2000 and Code2001.? Several ranges included in the font which
>> were not covered by my fonts were likely stolen from elsewhere.? For
>> example, for the range "Miscellaneous Symbols and Pictographs", its
>> "developer" simply stole the glyphs from the Unicode chart for that
>> range as found on the Unicode web site.? (Although some of those
>> glyphs were modified by mirroring or slight rotation.? Please see
>> attached graphic.)
> 
> Have you read the legal defence at
> https://digiex.net/threads/hello-to-all-i-wish-to-introduce-myself.15144/ 
> ?
> I think the editor's arguments are wrong, but I think 'rip-off' is too
> strong a word.
> 

There are many stronger words that one could use, James has be very 
restrained here considering the thousands of hours invested. It is far 
to common that people ignore the licenses of software. Many of us can 
recount similar events. Such behaviour is very disheartening for 
independent developers.

John Knightley

From richard.wordingham at ntlworld.com  Sat Aug 15 10:13:09 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 15 Aug 2020 16:13:09 +0100
Subject: Code2003 is a rip-off
In-Reply-To: <20195fd5a2c9e90ebeac2f92e7a0e2f3@koremail.com>
References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com>
 <20200815123329.2b1c6f38@JRWUBU2>
 <20195fd5a2c9e90ebeac2f92e7a0e2f3@koremail.com>
Message-ID: <20200815161309.6dc2b9ed@JRWUBU2>

On Sat, 15 Aug 2020 21:39:10 +0800
John Knightley via Unicode <unicode at unicode.org> wrote:

> On 2020-08-15 19:33, Richard Wordingham via Unicode wrote:
> > On Fri, 14 Aug 2020 23:23:30 +0000
> > James Kass via Unicode <unicode at unicode.org> wrote:
> >   
> >> There's a font called Code2003 which is available for download on
> >> various web sites.? Most of its glyphs were stolen from my fonts
> >> Code2000 and Code2001.? Several ranges included in the font which
> >> were not covered by my fonts were likely stolen from elsewhere.
> >> For example, for the range "Miscellaneous Symbols and
> >> Pictographs", its "developer" simply stole the glyphs from the
> >> Unicode chart for that range as found on the Unicode web site.
> >> (Although some of those glyphs were modified by mirroring or
> >> slight rotation.? Please see attached graphic.)  
> > 
> > Have you read the legal defence at
> > https://digiex.net/threads/hello-to-all-i-wish-to-introduce-myself.15144/ 
> > ?
> > I think the editor's arguments are wrong, but I think 'rip-off' is
> > too strong a word.
> >   
> 
> There are many stronger words that one could use, James has be very 
> restrained here considering the thousands of hours invested. It is
> far to common that people ignore the licenses of software. Many of us
> can recount similar events. Such behaviour is very disheartening for 
> independent developers.

The first point here is that James is not being robbed of income.
There does not seem to be any way for the general public to licence the
font(s) from him.

Now, James is still being given credit for the glyphs.  That doesn't
seem to be true for other glyph creators, e.g. Alif Silpachai for the
Tai Tham glyphs.  Alif's font is free as in free beer, not as in free
speech.

There is a potential loss of reputation to James as a font can be a lot
more than just glyphs.  The shaping has mostly gone missing, and he may
be criticised for the various consequent shortcomings, whereas there was
a time when Code2000 was the best Devanagari font I had available on
my machine.

The final issue is that he has been robbed of artistic and technical
control.  That is indeed an issue if it was not James Kass who put the
fonts on SourceForge.  Now, if the font had been released under the SIL
Open Font Licence, he would also have lost control.  One may see the
name 'Code2003' as impertinent, but at least the font is not being
paraded as Code2000, Code 2001 or Code2002.

Richard.


From jameskasskrv at gmail.com  Sat Aug 15 17:57:29 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Sat, 15 Aug 2020 22:57:29 +0000
Subject: Code2003 is a rip-off
In-Reply-To: <20200815161309.6dc2b9ed@JRWUBU2>
References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com>
 <20200815123329.2b1c6f38@JRWUBU2>
 <20195fd5a2c9e90ebeac2f92e7a0e2f3@koremail.com>
 <20200815161309.6dc2b9ed@JRWUBU2>
Message-ID: <9c1603f4-0de1-b18c-cee9-4480bc53507c@gmail.com>


On 2020-08-15 3:13 PM, Richard Wordingham via Unicode wrote:
> The first point here is that James is not being robbed of income.
> There does not seem to be any way for the general public to licence the
> font(s) from him.
I'm putting my old web site back on line.? Code2001 will remain freeware.

> Now, James is still being given credit for the glyphs.  That doesn't
> seem to be true for other glyph creators, e.g. Alif Silpachai for the
> Tai Tham glyphs.  Alif's font is free as in free beer, not as in free
> speech.
I'm being given credit for stealing Alif Silpachai's (among others') 
work along with that credit of designing those glyphs which I actually 
designed.? It's not flattering to be given credit for being a thief if 
I'm not one.

>
> There is a potential loss of reputation to James as a font can be a lot
> more than just glyphs.  The shaping has mostly gone missing, and he may
> be criticised for the various consequent shortcomings, whereas there was
> a time when Code2000 was the best Devanagari font I had available on
> my machine.
Thank you!? I hope you will also like the Grantha in the new version of 
Code2001.

> The final issue is that he has been robbed of artistic and technical
> control.  That is indeed an issue if it was not James Kass who put the
> fonts on SourceForge.  Now, if the font had been released under the SIL
> Open Font Licence, he would also have lost control.  One may see the
> name 'Code2003' as impertinent, but at least the font is not being
> paraded as Code2000, Code 2001 or Code2002.
>
It's the last digit in the font name which signifies:
Code2000 is for plane 0
Code2001 is for plane 1
Code2002 is for plane 2
Code2003 should have been for plane 3

Best regards,

James Kass


From jameskasskrv at gmail.com  Sat Aug 15 18:24:44 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Sat, 15 Aug 2020 23:24:44 +0000
Subject: Code2003 is a rip-off
In-Reply-To: <20200815123329.2b1c6f38@JRWUBU2>
References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com>
 <20200815123329.2b1c6f38@JRWUBU2>
Message-ID: <7d8750eb-6c4c-5c7e-b99f-73a55037ceed@gmail.com>


On 2020-08-15 11:33 AM, Richard Wordingham via Unicode wrote:
> Have you read the legal defence at
> https://digiex.net/threads/hello-to-all-i-wish-to-introduce-myself.15144/ ?
> I think the editor's arguments are wrong, but I think 'rip-off' is too
> strong a word.
No, I hadn't seen this and need to read it carefully.? A quick Google 
search for St. Gigafont and so forth failed to get me any contact 
information.? Thank you for the link.?? (Heh, heh, since I was "royally 
miffed" at the time, I may have searched for "St. Gigabyte", or something.)
>
> Artistically, you could be aggrieved by the removal of shaping -
> Devanagari shaping is completely gone.
Probably because of the limitation on the number of glyphs possible in a 
font.? Once you start adding stuff beyond Plane Zero, you run out of room.

>> Some download web sites request donations.? Any donations are going
>> to "St. Gigafont", not to me.
> I hope that at least they are being channelled to St. Gigafont.  I
> found a copy of the font that said, in its name table, both that it
> was licensed under the SIL Open Font Licence, and that it was
> shareware, to be licensed from you for US$5.  Does my licence from you
> cover me for Code2003 so far as your rights are concerned?
That question should probably be directed to the developer of Code2003.

>
> Have you managed to contact "St. Gigafont"?  It's conceivable that some
> of the donations have been set aside for you.  It would seem that "St.
> Gigafont" has been hosting Code2000 and Code2001. You might even
> recover an income trickle.
It's not about the money.? Based on the above link and additional 
information, I will certainly attempt to make the acquaintance of the 
Code2003 "developer".
>
> Bringing the matter more clearly into the scope of this list, is the
> original goal of Code2000 still achievable?  Is it achievable without
> horrendous artistic compromises?  I was recently horrified by how many
> ligatures are needed just to write Pali in the Sinhala script, let
> alone Sanskrit.
>
I believe it is possible.? Not in a single font, of course, unless the 
specs change.? But with the font family / collection I think it can be 
done.? Wish I'd started on this fifty years ago instead of 22 years ago, 
though.? But I didn't even have a computer in 1970.

Best regards,

James Kass


From richard.wordingham at ntlworld.com  Sat Aug 15 20:11:01 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 16 Aug 2020 02:11:01 +0100
Subject: Code2003 is a rip-off
In-Reply-To: <7d8750eb-6c4c-5c7e-b99f-73a55037ceed@gmail.com>
References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com>
 <20200815123329.2b1c6f38@JRWUBU2>
 <7d8750eb-6c4c-5c7e-b99f-73a55037ceed@gmail.com>
Message-ID: <20200816021101.5d79eb3f@JRWUBU2>

On Sat, 15 Aug 2020 23:24:44 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> On 2020-08-15 11:33 AM, Richard Wordingham via Unicode wrote:

>> Does my licence from
>> you cover me for Code2003 so far as your rights are concerned?  

> That question should probably be directed to the developer of
> Code2003.

As to his IPR, the SIL Open Font Licence applies.

> Based on the above link and additional 
> information, I will certainly attempt to make the acquaintance of the 
> Code2003 "developer".

I have confirmed that Code2003's glyphs are unlicensed.  St. Gigafont
clearly doesn't understand the different types of 'free'.

> > Bringing the matter more clearly into the scope of this list, is the
> > original goal of Code2000 still achievable?  Is it achievable
> > without horrendous artistic compromises?

> I believe it is possible.? Not in a single font, of course, unless
> the specs change.

I meant coverage of the assigned BMP in a single font.

Richard.


From jameskasskrv at gmail.com  Sat Aug 15 21:32:28 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Sun, 16 Aug 2020 02:32:28 +0000
Subject: Code2003 is a rip-off
In-Reply-To: <76f4f1dc-0194-3850-bb6b-881661fb2579@gmail.com>
References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com>
 <20200815123329.2b1c6f38@JRWUBU2>
 <7d8750eb-6c4c-5c7e-b99f-73a55037ceed@gmail.com>
 <20200816021101.5d79eb3f@JRWUBU2>
 <76f4f1dc-0194-3850-bb6b-881661fb2579@gmail.com>
Message-ID: <c1a46a6f-a94c-ee89-422f-0f410408f623@gmail.com>

(This was sent off-list to Richard Wordingham but I'd intended to reply 
to the list.)

On 2020-08-16 1:21 AM, James Kass wrote:
>
> On 2020-08-16 1:11 AM, Richard Wordingham via Unicode wrote:
>>>> Bringing the matter more clearly into the scope of this list, is the
>>>> original goal of Code2000 still achievable?? Is it achievable
>>>> without horrendous artistic compromises?
>>> I believe it is possible.? Not in a single font, of course, unless
>>> the specs change.
>> I meant coverage of the assigned BMP in a single font.
> Ahh.? Yes, I think so.? Haven't done the math, though.? Such a font 
> would be well suited for populating charts but complex shaping 
> wouldn't be happening.? So running text in complex scripts would 
> render poorly.? But not supporting any BMP PUA characters in the font 
> might leave enough room for unmapped glyphs such as ligatures to make 
> complex shaping possible for at least some of the BMP scripts.


From hsivonen at hsivonen.fi  Mon Aug 17 01:38:54 2020
From: hsivonen at hsivonen.fi (Henri Sivonen)
Date: Mon, 17 Aug 2020 09:38:54 +0300
Subject: Generating U+FFFD when there's no content between ISO-2022-JP
 escape sequences
In-Reply-To: <CAJ2xs_G5-7a+tMaa+VUcHSiSVBDqdGECCREmMVS1g_qz8Ce1og@mail.gmail.com>
References: <CAJQvAue1NOu-nY4scb4P52k8Qi01Yp=vYUX9CBj8LBJEY4XY9w@mail.gmail.com>
 <CAJQvAuc1_o27EGPSo7=VqK1mT+3LFrPx3x8Z1-4MoDJN74NODA@mail.gmail.com>
 <CAJ2xs_G5-7a+tMaa+VUcHSiSVBDqdGECCREmMVS1g_qz8Ce1og@mail.gmail.com>
Message-ID: <CAJQvAud2WYr0eKzbc++9SZ4JaN27p96gTMyBmnhMKkE08Smp5g@mail.gmail.com>

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ?? <mark at macchiato.com> wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the committee can look at your proposal with an eye to changing http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there
>> is no content between two ISO-2022-JP escape sequences from the WHATWG
>> Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in
>> that case is not a useful security measure when unnecessary
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivonen at hsivonen.fi> wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal
>> > input byte sequence. Instead, it must stop with an error or substitute
>> > a replacement character (such as U+FFFD ( ? ) REPLACEMENT CHARACTER)
>> > or an escape sequence in the output. (See also Section 3.5 Deletion of
>> > Code Points.) It is important to do this not only for byte sequences
>> > that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next shift
>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
>> > require at least one character in a text segment between shift
>> > sequences. Security software written to the formal specification may
>> > not detect malicious text  (for example, "delete" with a
>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
>> > ISO-2022-JP decoder algorithm
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
>> > WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that didn't
>> > implement this U+FFFD generation behavior (uconv), a bug has been
>> > logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP
>> > the unusual and surprising property that concatenating two ISO-2022-JP
>> > outputs from a conforming encoder can result in a byte sequence that
>> > is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
>> > sequence is immediately followed by another ISO-2022-JP escape
>> > sequence. Chrome and Safari do, but their implementations of
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's
>> > decoder implementations generally are informed by the Encoding
>> > Standard (though the ISO-2022-JP decoder specifically might not be
>> > yet), and I suspect that Safari's implementation (ICU) is either
>> > informed by Unicode Security Considerations or vice versa.
>> >
>> > The example given as rationale in Unicode Security Considerations,
>> > obfuscating the ASCII string "delete", could be accomplished by
>> > alternating between the ASCII and Roman states to that every other
>> > character is in the ASCII state and the rest of the Roman state.
>> >
>> > Is the requirement to generate U+FFFD when there is no content between
>> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
>> > transitions or useless transitions between ASCII and Roman are not
>> > also required to generate U+FFFD? Would it even be feasible (in terms
>> > of interop with legacy encoders) to make useless transitions between
>> > ASCII and Roman generate U+FFFD?
>> >
>> > --
>> > Henri Sivonen
>> > hsivonen at hsivonen.fi
>> > https://hsivonen.fi/
>>
>>
>>
>> --
>> Henri Sivonen
>> hsivonen at hsivonen.fi
>> https://hsivonen.fi/
>>


-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/


From harjitmoe at outlook.com  Mon Aug 17 01:59:30 2020
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Mon, 17 Aug 2020 06:59:30 +0000
Subject: Generating U+FFFD when there's no content between ISO-2022-JP
 escape sequences
In-Reply-To: <CAJQvAud2WYr0eKzbc++9SZ4JaN27p96gTMyBmnhMKkE08Smp5g@mail.gmail.com>
References: <CAJQvAue1NOu-nY4scb4P52k8Qi01Yp=vYUX9CBj8LBJEY4XY9w@mail.gmail.com>
 <CAJQvAuc1_o27EGPSo7=VqK1mT+3LFrPx3x8Z1-4MoDJN74NODA@mail.gmail.com>
 <CAJ2xs_G5-7a+tMaa+VUcHSiSVBDqdGECCREmMVS1g_qz8Ce1og@mail.gmail.com>,
 <CAJQvAud2WYr0eKzbc++9SZ4JaN27p96gTMyBmnhMKkE08Smp5g@mail.gmail.com>
Message-ID: <AM6PR0702MB3671D886BB19ACCD8304AEF7B75F0@AM6PR0702MB3671.eurprd07.prod.outlook.com>

In terms of deployed ISO-2022-JP encoders which don't follow WHATWG behaviour, here's Python's (apparently contributed to Python by one Hye-Shik Chang):

>>> "a?bc~?d".encode("iso-2022-jp")
b'a\x1b(J\\\x1b(Bbc~\x1b(J\\\x1b(Bd'

This is so far as I can tell valid per the RFC (and of course ECMA-35 itself), but not per the WHATWG, whose output would be (to use another bytestring literal) b'a\x1b(J\\bc\x1b(B~\x1b(J\\d\x1b(B'. The difference being that Python's encoder appears to be using a preference order of codesets, with ASCII being before JIS-Roman, while the WHATWG logic is to encode the next character in the current codeset if possible, and switch to another if it is not.

-- Har

________________________________
From: Unicode <unicode-bounces at unicode.org> on behalf of Henri Sivonen via Unicode <unicode at unicode.org>
Sent: 17 August 2020 08:38
To: Mark Davis ?? <mark at macchiato.com>
Cc: Unicode Public <unicode at unicode.org>
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ?? <mark at macchiato.com> wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the committee can look at your proposal with an eye to changing http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there
>> is no content between two ISO-2022-JP escape sequences from the WHATWG
>> Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in
>> that case is not a useful security measure when unnecessary
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivonen at hsivonen.fi> wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal
>> > input byte sequence. Instead, it must stop with an error or substitute
>> > a replacement character (such as U+FFFD ( ? ) REPLACEMENT CHARACTER)
>> > or an escape sequence in the output. (See also Section 3.5 Deletion of
>> > Code Points.) It is important to do this not only for byte sequences
>> > that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next shift
>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
>> > require at least one character in a text segment between shift
>> > sequences. Security software written to the formal specification may
>> > not detect malicious text  (for example, "delete" with a
>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
>> > ISO-2022-JP decoder algorithm
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
>> > WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that didn't
>> > implement this U+FFFD generation behavior (uconv), a bug has been
>> > logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP
>> > the unusual and surprising property that concatenating two ISO-2022-JP
>> > outputs from a conforming encoder can result in a byte sequence that
>> > is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
>> > sequence is immediately followed by another ISO-2022-JP escape
>> > sequence. Chrome and Safari do, but their implementations of
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's
>> > decoder implementations generally are informed by the Encoding
>> > Standard (though the ISO-2022-JP decoder specifically might not be
>> > yet), and I suspect that Safari's implementation (ICU) is either
>> > informed by Unicode Security Considerations or vice versa.
>> >
>> > The example given as rationale in Unicode Security Considerations,
>> > obfuscating the ASCII string "delete", could be accomplished by
>> > alternating between the ASCII and Roman states to that every other
>> > character is in the ASCII state and the rest of the Roman state.
>> >
>> > Is the requirement to generate U+FFFD when there is no content between
>> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
>> > transitions or useless transitions between ASCII and Roman are not
>> > also required to generate U+FFFD? Would it even be feasible (in terms
>> > of interop with legacy encoders) to make useless transitions between
>> > ASCII and Roman generate U+FFFD?
>> >
>> > --
>> > Henri Sivonen
>> > hsivonen at hsivonen.fi
>> > https://hsivonen.fi/
>>
>>
>>
>> --
>> Henri Sivonen
>> hsivonen at hsivonen.fi
>> https://hsivonen.fi/
>>


--
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20200817/58b2045b/attachment.htm>

From Shawn.Steele at microsoft.com  Mon Aug 17 02:17:38 2020
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Mon, 17 Aug 2020 07:17:38 +0000
Subject: Generating U+FFFD when there's no content between ISO-2022-JP
 escape sequences
In-Reply-To: <CAJQvAud2WYr0eKzbc++9SZ4JaN27p96gTMyBmnhMKkE08Smp5g@mail.gmail.com>
References: <CAJQvAue1NOu-nY4scb4P52k8Qi01Yp=vYUX9CBj8LBJEY4XY9w@mail.gmail.com>
 <CAJQvAuc1_o27EGPSo7=VqK1mT+3LFrPx3x8Z1-4MoDJN74NODA@mail.gmail.com>
 <CAJ2xs_G5-7a+tMaa+VUcHSiSVBDqdGECCREmMVS1g_qz8Ce1og@mail.gmail.com>
 <CAJQvAud2WYr0eKzbc++9SZ4JaN27p96gTMyBmnhMKkE08Smp5g@mail.gmail.com>
Message-ID: <MWHPR21MB08472C41FCABC702B209ADFF825F0@MWHPR21MB0847.namprd21.prod.outlook.com>

IMO, encodings, particularly ones depending on state such as this, may have multiple ways to output the same, or similar, sequences.  When means that pretty much any time an encoding transforms data any previous security or other validation style checks are no longer valid and any security/validation must be checked for again.  I've seen numerous mistakes due to people expecting encodings to play nicely, particularly if there are different endpoints that may use different implementations with slightly different behaviors.

-Shawn

-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of Henri Sivonen via Unicode
Sent: Sunday, August 16, 2020 11:39 PM
To: Mark Davis ?? <mark at macchiato.com>
Cc: Unicode Public <unicode at unicode.org>
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ?? <mark at macchiato.com> wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the committee can look at your proposal with an eye to changing http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there 
>> is no content between two ISO-2022-JP escape sequences from the 
>> WHATWG Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in 
>> that case is not a useful security measure when unnecessary 
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivonen at hsivonen.fi> wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal 
>> > input byte sequence. Instead, it must stop with an error or 
>> > substitute a replacement character (such as U+FFFD (   ) 
>> > REPLACEMENT CHARACTER) or an escape sequence in the output. (See 
>> > also Section 3.5 Deletion of Code Points.) It is important to do 
>> > this not only for byte sequences that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next 
>> > shift sequence. The formal syntaxes for HZ and most CJK ISO-2022 
>> > variants require at least one character in a text segment between 
>> > shift sequences. Security software written to the formal 
>> > specification may not detect malicious text  (for example, "delete" 
>> > with a shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of 
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into 
>> > its ISO-2022-JP decoder algorithm 
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements 
>> > the WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that 
>> > didn't implement this U+FFFD generation behavior (uconv), a bug has 
>> > been logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-44066140
>> > 3
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP 
>> > the unusual and surprising property that concatenating two 
>> > ISO-2022-JP outputs from a conforming encoder can result in a byte 
>> > sequence that is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP 
>> > escape sequence is immediately followed by another ISO-2022-JP 
>> > escape sequence. Chrome and Safari do, but their implementations of 
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's 
>> > decoder implementations generally are informed by the Encoding 
>> > Standard (though the ISO-2022-JP decoder specifically might not be 
>> > yet), and I suspect that Safari's implementation (ICU) is either 
>> > informed by Unicode Security Considerations or vice versa.
>> >
>> > The example given as rationale in Unicode Security Considerations, 
>> > obfuscating the ASCII string "delete", could be accomplished by 
>> > alternating between the ASCII and Roman states to that every other 
>> > character is in the ASCII state and the rest of the Roman state.
>> >
>> > Is the requirement to generate U+FFFD when there is no content 
>> > between ISO-2022-JP escape sequences useful if useless 
>> > ASCII-to-ASCII transitions or useless transitions between ASCII and 
>> > Roman are not also required to generate U+FFFD? Would it even be 
>> > feasible (in terms of interop with legacy encoders) to make useless 
>> > transitions between ASCII and Roman generate U+FFFD?
>> >
>> > --
>> > Henri Sivonen
>> > hsivonen at hsivonen.fi
>> > https://hsivonen.fi/
>>
>>
>>
>> --
>> Henri Sivonen
>> hsivonen at hsivonen.fi
>> https://hsivonen.fi/
>>


--
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/


From richard.wordingham at ntlworld.com  Mon Aug 17 03:12:54 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 17 Aug 2020 09:12:54 +0100
Subject: Code2003 is a rip-off
In-Reply-To: <c1a46a6f-a94c-ee89-422f-0f410408f623@gmail.com>
References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com>
 <20200815123329.2b1c6f38@JRWUBU2>
 <7d8750eb-6c4c-5c7e-b99f-73a55037ceed@gmail.com>
 <20200816021101.5d79eb3f@JRWUBU2>
 <76f4f1dc-0194-3850-bb6b-881661fb2579@gmail.com>
 <c1a46a6f-a94c-ee89-422f-0f410408f623@gmail.com>
Message-ID: <20200817091254.7d5201c1@JRWUBU2>

On Sun, 16 Aug 2020 02:32:28 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> (This was sent off-list to Richard Wordingham but I'd intended to
> reply to the list.)

> On 2020-08-16 1:21 AM, James Kass wrote:

> > On 2020-08-16 1:11 AM, Richard Wordingham via Unicode wrote:  

>>>> Bringing the matter more clearly into the scope of this list, is
>>>> the original goal of Code2000 still achievable?? Is it achievable
>>>> without horrendous artistic compromises?  

>>> I believe it is possible.? Not in a single font, of course, unless
>>> the specs change.  

>> I meant coverage of the assigned BMP in a single font.  

> Ahh.? Yes, I think so.? Haven't done the math, though.? Such a font 
> would be well suited for populating charts but complex shaping 
> wouldn't be happening.? So running text in complex scripts would 
> render poorly.? But not supporting any BMP PUA characters in the
> font might leave enough room for unmapped glyphs such as ligatures
> to make complex shaping possible for at least some of the BMP
> scripts.  

Chopping out most complex script support was one instance of unethical
behaviour in Code2003!  I'm not sure how far one can cut Devanagari
support back, but I think one has to support at least repha for an
honest claim to support Devanagari. Other Indic scripts are less
forgiving - an 'invisible stacker' (of which there are five in the BMP)
generally compels a change of shape, though a ghastly font might be
able to do tricks for some characters by positioning base glyphs below
instead of having to have a subscript glyph.  The visible glyphs of
invisible stackers are meant to be reminders that character input is
still in progress.

Of course, shaping in 'simple' scripts can need extra glyphs as well -
the 5 IPA tone characters in the Spacing Modifer Letters need at
least an extra 20 glyph IDs.  

Richard.


From richard.wordingham at ntlworld.com  Mon Aug 17 03:37:47 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 17 Aug 2020 09:37:47 +0100
Subject: Dedotted I and dotlessi
Message-ID: <20200817093747.0a89417a@JRWUBU2>

There is a recommendation around that fonts should generate different
glyph ID sequences for canonically inequivalent character sequences.
Is this still a reasonable requirement?

The most obvious reason for this is that in simple scripts, the glyphs
in the glyph stream follow the order of characters in the character
stream, and therefore processes might hope to convert the glyph stream
back to the character stream.  Now, <i, U+0302 COMBINING CIRCUMFLEX
ACCENT> and <U+0131 LATIN SMALL LETTER DOTLESS I, U+0302> should render
the same, and one shaping trick is to convert both base characters to
the same glyph, commonly called dotlessi.  Glyph stream to character
stream conversions were used in the generation of PDFs and the logic
for extracting text from them.

Is the recommendation still valid, or have things moved on?  Is the
recommendation applicable to Indic scripts, where glyph stream to
character stream conversion may be as complicated as the
reverse direction and there is a natural tendency for distinctions to
be lost.  (In Devanagari, the distinction between mandated and
fallback half-forms is one example.)

Richard.

From dr.khaled.hosny at gmail.com  Mon Aug 17 09:58:40 2020
From: dr.khaled.hosny at gmail.com (Khaled Hosny)
Date: Mon, 17 Aug 2020 16:58:40 +0200
Subject: Dedotted I and dotlessi
In-Reply-To: <20200817093747.0a89417a@JRWUBU2>
References: <20200817093747.0a89417a@JRWUBU2>
Message-ID: <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>


> On Aug 17, 2020, at 10:37 AM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> There is a recommendation around that fonts should generate different
> glyph ID sequences for canonically inequivalent character sequences.
> Is this still a reasonable requirement?
> 
> The most obvious reason for this is that in simple scripts, the glyphs
> in the glyph stream follow the order of characters in the character
> stream, and therefore processes might hope to convert the glyph stream
> back to the character stream.  Now, <i, U+0302 COMBINING CIRCUMFLEX
> ACCENT> and <U+0131 LATIN SMALL LETTER DOTLESS I, U+0302> should render
> the same, and one shaping trick is to convert both base characters to
> the same glyph, commonly called dotlessi.  Glyph stream to character
> stream conversions were used in the generation of PDFs and the logic
> for extracting text from them.
> 
> Is the recommendation still valid, or have things moved on?

For some PDF work flows, yes.

>  Is the
> recommendation applicable to Indic scripts, where glyph stream to
> character stream conversion may be as complicated as the
> reverse direction and there is a natural tendency for distinctions to
> be lost.  (In Devanagari, the distinction between mandated and
> fallback half-forms is one example.)

Same workflows can?t handle one to many substitution, or reordering, so when I?m doing fonts that need these I usually just give up on the ?unique glyph per code point? requirement. I also forget about it when making Arabic fonts, because extracting Arabic text reliably from PDFs generated with such workflows is a lost cause already.

Regards,
Khaled 

From bobby_devos at sil.org  Mon Aug 17 12:59:14 2020
From: bobby_devos at sil.org (Bobby de Vos)
Date: Mon, 17 Aug 2020 11:59:14 -0600
Subject: Dedotted I and dotlessi
In-Reply-To: <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
References: <20200817093747.0a89417a@JRWUBU2>
 <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
Message-ID: <8d950082-380a-8a9c-553b-0d1cf96d4b19@sil.org>

On 2020-08-17 8:58 a.m., Khaled Hosny via Unicode wrote:

>> On Aug 17, 2020, at 10:37 AM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
>>
>>  Is the
>> recommendation applicable to Indic scripts, where glyph stream to
>> character stream conversion may be as complicated as the
>> reverse direction and there is a natural tendency for distinctions to
>> be lost.  (In Devanagari, the distinction between mandated and
>> fallback half-forms is one example.)
> Same workflows can?t handle one to many substitution, or reordering, so when I?m doing fonts that need these I usually just give up on the ?unique glyph per code point? requirement. I also forget about it when making Arabic fonts, because extracting Arabic text reliably from PDFs generated with such workflows is a lost cause already.

A particular workflow might be enhanced to handle, for example, U+093F
DEVANAGARI VOWEL SIGN I where the glyph for this character is re-ordered
compared to the codepoints. I don't see how a workflow would be able to
handle [1] where in Kannada script, codepoints are re-ordered to handle
changing conventions in encoding. That is, the codepoints are re-ordered
before mapping to glyphs, so two different sequences of codepoints will
produce the same glyph stream, IIUC.

[1] https://github.com/harfbuzz/harfbuzz/issues/435#issuecomment-335560167

Regards, Bobby

-- 
Bobby de Vos
/bobby_devos at sil.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20200817/4d1e7a67/attachment.htm>

From markus.icu at gmail.com  Mon Aug 17 13:00:31 2020
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 17 Aug 2020 11:00:31 -0700
Subject: Dedotted I and dotlessi
In-Reply-To: <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
References: <20200817093747.0a89417a@JRWUBU2>
 <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
Message-ID: <CAN49p6oG7ybGWPUgr13R9jCr1FWMHwsbEEmQL4mCtF+yuYPPFg@mail.gmail.com>

PDFs *should* be generated with Unicode strings, so that copy-and-paste
etc. need not try to map back from glyphs.
Of course, that's optional, and some tools don't bother.
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20200817/17cd6e7d/attachment.htm>

From dr.khaled.hosny at gmail.com  Mon Aug 17 13:53:01 2020
From: dr.khaled.hosny at gmail.com (Khaled Hosny)
Date: Mon, 17 Aug 2020 20:53:01 +0200
Subject: Dedotted I and dotlessi
In-Reply-To: <CAN49p6oG7ybGWPUgr13R9jCr1FWMHwsbEEmQL4mCtF+yuYPPFg@mail.gmail.com>
References: <20200817093747.0a89417a@JRWUBU2>
 <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
 <CAN49p6oG7ybGWPUgr13R9jCr1FWMHwsbEEmQL4mCtF+yuYPPFg@mail.gmail.com>
Message-ID: <AC30B683-803F-4CF0-8EE0-D5DBF42F73BE@gmail.com>

Easier said than done. Even for tools that want to do this, the only reliable way is tagging with /ActualText, but this has to be done per grapheme cluster as PDF viewers can?t select or highlight parts of text tagged with /ActualText, so Arabic excluded since PDF stores glyphs in visual order and you don?t want to tag full paragraphs. In case of reordering, you will also need to tag the whole reordered sequence as one unit since you can?t tell which glyphs belongs to which character any more. People will also complain about increased file size, so you will have to do tagging selectively for cases than can?t be handled in a different way.

In short, text extraction from PDF is a mess. 

> On Aug 17, 2020, at 8:00 PM, Markus Scherer <markus.icu at gmail.com> wrote:
> 
> PDFs *should* be generated with Unicode strings, so that copy-and-paste etc. need not try to map back from glyphs.
> Of course, that's optional, and some tools don't bother.
> markus

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20200817/d0f1da37/attachment.htm>

From richard.wordingham at ntlworld.com  Mon Aug 17 17:15:50 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 17 Aug 2020 23:15:50 +0100
Subject: Dedotted I and dotlessi
In-Reply-To: <AC30B683-803F-4CF0-8EE0-D5DBF42F73BE@gmail.com>
References: <20200817093747.0a89417a@JRWUBU2>
 <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
 <CAN49p6oG7ybGWPUgr13R9jCr1FWMHwsbEEmQL4mCtF+yuYPPFg@mail.gmail.com>
 <AC30B683-803F-4CF0-8EE0-D5DBF42F73BE@gmail.com>
Message-ID: <20200817231550.21c17200@JRWUBU2>

On Mon, 17 Aug 2020 20:53:01 +0200
Khaled Hosny via Unicode <unicode at unicode.org> wrote:

> Easier said than done. Even for tools that want to do this, the only
> reliable way is tagging with /ActualText, but this has to be done per
> grapheme cluster as PDF viewers can?t select or highlight parts of
> text tagged with /ActualText, so Arabic excluded since PDF stores
> glyphs in visual order and you don?t want to tag full paragraphs.

That's a nasty bug.  Has it been established that negative
(advance)widths are "inconsistent" TrueType and CFF fonts?  I woud have
said that a PDF width of -573 was entirely consistent with a TrueType
width of 573.

> In
> case of reordering, you will also need to tag the whole reordered
> sequence as one unit since you can?t tell which glyphs belongs to
> which character any more. People will also complain about increased
> file size, so you will have to do tagging selectively for cases than
> can?t be handled in a different way.

I don't know if it's due to another feature (or even merely a bug), but
I did notice that LibreOffice-exported PDFs swell enormously if one uses
PDF/A to make Indic text extractable.  This was with a series of
documents that were at least 90% English (in the Latin script).  Zipping
was ineffective.

Richard.


From jameskasskrv at gmail.com  Mon Aug 17 17:31:53 2020
From: jameskasskrv at gmail.com (James Kass)
Date: Mon, 17 Aug 2020 22:31:53 +0000
Subject: Dedotted I and dotlessi
In-Reply-To: <AC30B683-803F-4CF0-8EE0-D5DBF42F73BE@gmail.com>
References: <20200817093747.0a89417a@JRWUBU2>
 <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
 <CAN49p6oG7ybGWPUgr13R9jCr1FWMHwsbEEmQL4mCtF+yuYPPFg@mail.gmail.com>
 <AC30B683-803F-4CF0-8EE0-D5DBF42F73BE@gmail.com>
Message-ID: <0cd79dc8-3ba5-5351-2044-eb6ce2325766@gmail.com>


On 2020-08-17 6:53 PM, Khaled Hosny via Unicode wrote:
> In short, text extraction from PDF is a mess.

Search engines such as Google index text from PDFs and offer PDF links 
in the search results.? I wonder how Google handles Arabic (and other 
complex scripts) PDFs.? Have they worked out some kind of method, or are 
such PDFs considered non-indexable?? Maybe OCR from the display?


From richard.wordingham at ntlworld.com  Mon Aug 17 17:59:46 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 17 Aug 2020 23:59:46 +0100
Subject: Dedotted I and dotlessi
In-Reply-To: <8d950082-380a-8a9c-553b-0d1cf96d4b19@sil.org>
References: <20200817093747.0a89417a@JRWUBU2>
 <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
 <8d950082-380a-8a9c-553b-0d1cf96d4b19@sil.org>
Message-ID: <20200817235946.039c64bd@JRWUBU2>

On Mon, 17 Aug 2020 11:59:14 -0600
Bobby de Vos via Unicode <unicode at unicode.org> wrote:

> A particular workflow might be enhanced to handle, for example, U+093F
> DEVANAGARI VOWEL SIGN I where the glyph for this character is
> re-ordered compared to the codepoints. I don't see how a workflow
> would be able to handle [1] where in Kannada script, codepoints are
> re-ordered to handle changing conventions in encoding. That is, the
> codepoints are re-ordered before mapping to glyphs, so two different
> sequences of codepoints will produce the same glyph stream, IIUC.

> [1]
> https://github.com/harfbuzz/harfbuzz/issues/435#issuecomment-335560167

Well, if the reordering is done by the shaping engine, it would be
difficult.  (There are moves afoot to move Indian Indic rendering to
the USE, in which case they might reach the font.)  However, in this
case I would view the rearrangement as akin to canonical equivalence,
where there is no guarantee that copying a string won't change its
encoding.

However, I suspect a Graphite font could leave no trace of virama and
ZWJ in Sinhala script touching conjuncts.  In Graphite, glyphs can
have state, so touching conjunction could be implemented as a type of
kerning.

Richard.

From dr.khaled.hosny at gmail.com  Mon Aug 17 18:33:52 2020
From: dr.khaled.hosny at gmail.com (Khaled Hosny)
Date: Tue, 18 Aug 2020 01:33:52 +0200
Subject: Dedotted I and dotlessi
In-Reply-To: <20200817231550.21c17200@JRWUBU2>
References: <20200817093747.0a89417a@JRWUBU2>
 <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
 <CAN49p6oG7ybGWPUgr13R9jCr1FWMHwsbEEmQL4mCtF+yuYPPFg@mail.gmail.com>
 <AC30B683-803F-4CF0-8EE0-D5DBF42F73BE@gmail.com>
 <20200817231550.21c17200@JRWUBU2>
Message-ID: <E2D8EC2A-C12B-4764-8ED8-C4CF07F04E96@gmail.com>


> On Aug 18, 2020, at 12:15 AM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Mon, 17 Aug 2020 20:53:01 +0200
> Khaled Hosny via Unicode <unicode at unicode.org> wrote:
> 
>> Easier said than done. Even for tools that want to do this, the only
>> reliable way is tagging with /ActualText, but this has to be done per
>> grapheme cluster as PDF viewers can?t select or highlight parts of
>> text tagged with /ActualText, so Arabic excluded since PDF stores
>> glyphs in visual order and you don?t want to tag full paragraphs.
> 
> That's a nasty bug.  Has it been established that negative
> (advance)widths are "inconsistent" TrueType and CFF fonts?  I woud have
> said that a PDF width of -573 was entirely consistent with a TrueType
> width of 573.

It is possible to store glyphs in logical order and adjust their positions so they appear in visual order, but this all break in PDF readers that expect Arabic to be in visual order (since this is what almost all PDF creators do) and try to reverse the Arabic string again to get the logical string (which is not always reliable since there is no standard reverse BiDi algorithm).

>> In
>> case of reordering, you will also need to tag the whole reordered
>> sequence as one unit since you can?t tell which glyphs belongs to
>> which character any more. People will also complain about increased
>> file size, so you will have to do tagging selectively for cases than
>> can?t be handled in a different way.
> 
> I don't know if it's due to another feature (or even merely a bug), but
> I did notice that LibreOffice-exported PDFs swell enormously if one uses
> PDF/A to make Indic text extractable.  This was with a series of
> documents that were at least 90% English (in the Latin script).  Zipping
> was ineffective.

LibreOffice does exactly the selective handling I described: unique one to one and many to one mappings use the font?s /ToUnicode, everything else uses /ActualText tasing per cluster (HarfBuzz cluster which is not always the same as grapheme clusters). As it happens, I wrote that code in LibreOffice.


From dr.khaled.hosny at gmail.com  Mon Aug 17 18:36:34 2020
From: dr.khaled.hosny at gmail.com (Khaled Hosny)
Date: Tue, 18 Aug 2020 01:36:34 +0200
Subject: Dedotted I and dotlessi
In-Reply-To: <0cd79dc8-3ba5-5351-2044-eb6ce2325766@gmail.com>
References: <20200817093747.0a89417a@JRWUBU2>
 <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
 <CAN49p6oG7ybGWPUgr13R9jCr1FWMHwsbEEmQL4mCtF+yuYPPFg@mail.gmail.com>
 <AC30B683-803F-4CF0-8EE0-D5DBF42F73BE@gmail.com>
 <0cd79dc8-3ba5-5351-2044-eb6ce2325766@gmail.com>
Message-ID: <52940223-4680-4058-B4DB-043CC15C9FFD@gmail.com>


> On Aug 18, 2020, at 12:31 AM, James Kass via Unicode <unicode at unicode.org> wrote:
> 
> 
> 
> On 2020-08-17 6:53 PM, Khaled Hosny via Unicode wrote:
>> In short, text extraction from PDF is a mess.
> 
> Search engines such as Google index text from PDFs and offer PDF links in the search results.  I wonder how Google handles Arabic (and other complex scripts) PDFs.  Have they worked out some kind of method, or are such PDFs considered non-indexable?  Maybe OCR from the display?

I don?t know what Google does, but the result is often just a garbage of meaningless characters. What tools I have seen their code do is try to recognize runs of Arabic text and reverse the strings to get an approximation of the original logical text that is completely loss.

From dr.khaled.hosny at gmail.com  Mon Aug 17 18:39:10 2020
From: dr.khaled.hosny at gmail.com (Khaled Hosny)
Date: Tue, 18 Aug 2020 01:39:10 +0200
Subject: Dedotted I and dotlessi
In-Reply-To: <E2D8EC2A-C12B-4764-8ED8-C4CF07F04E96@gmail.com>
References: <20200817093747.0a89417a@JRWUBU2>
 <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
 <CAN49p6oG7ybGWPUgr13R9jCr1FWMHwsbEEmQL4mCtF+yuYPPFg@mail.gmail.com>
 <AC30B683-803F-4CF0-8EE0-D5DBF42F73BE@gmail.com>
 <20200817231550.21c17200@JRWUBU2>
 <E2D8EC2A-C12B-4764-8ED8-C4CF07F04E96@gmail.com>
Message-ID: <F81F3C4A-ECFC-4170-B8D9-DB7ACB8234C0@gmail.com>


> On Aug 18, 2020, at 1:33 AM, Khaled Hosny <dr.khaled.hosny at gmail.com> wrote:
> 
> 
> 
>> On Aug 18, 2020, at 12:15 AM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
>> 
>> On Mon, 17 Aug 2020 20:53:01 +0200 Khaled Hosny via Unicode <unicode at unicode.org> wrote:
>> 
>>> In
>>> case of reordering, you will also need to tag the whole reordered
>>> sequence as one unit since you can?t tell which glyphs belongs to
>>> which character any more. People will also complain about increased
>>> file size, so you will have to do tagging selectively for cases than
>>> can?t be handled in a different way.
>> 
>> I don't know if it's due to another feature (or even merely a bug), but
>> I did notice that LibreOffice-exported PDFs swell enormously if one uses
>> PDF/A to make Indic text extractable.  This was with a series of
>> documents that were at least 90% English (in the Latin script).  Zipping
>> was ineffective.
> 
> LibreOffice does exactly the selective handling I described: unique one to one and many to one mappings use the font?s /ToUnicode, everything else uses /ActualText tasing per cluster (HarfBuzz cluster which is not always the same as grapheme clusters). As it happens, I wrote that code in LibreOffice.

The PDF/A issue is probably unrelated, since what I?m describing above happens with any PDF export profile.

From richard.wordingham at ntlworld.com  Tue Aug 18 04:24:28 2020
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 18 Aug 2020 10:24:28 +0100
Subject: Dedotted I and dotlessi
In-Reply-To: <F81F3C4A-ECFC-4170-B8D9-DB7ACB8234C0@gmail.com>
References: <20200817093747.0a89417a@JRWUBU2>
 <EF8B7E64-B619-42A9-8811-1FD1DECD46B5@gmail.com>
 <CAN49p6oG7ybGWPUgr13R9jCr1FWMHwsbEEmQL4mCtF+yuYPPFg@mail.gmail.com>
 <AC30B683-803F-4CF0-8EE0-D5DBF42F73BE@gmail.com>
 <20200817231550.21c17200@JRWUBU2>
 <E2D8EC2A-C12B-4764-8ED8-C4CF07F04E96@gmail.com>
 <F81F3C4A-ECFC-4170-B8D9-DB7ACB8234C0@gmail.com>
Message-ID: <20200818102428.668d012e@JRWUBU2>

On Tue, 18 Aug 2020 01:39:10 +0200
Khaled Hosny via Unicode <unicode at unicode.org> wrote:

> >> On Aug 18, 2020, at 12:15 AM, Richard Wordingham via Unicode
> >> <unicode at unicode.org> wrote:

> >> I don't know if it's due to another feature (or even merely a
> >> bug), but I did notice that LibreOffice-exported PDFs swell
> >> enormously if one uses PDF/A to make Indic text extractable.  This
> >> was with a series of documents that were at least 90% English (in
> >> the Latin script).  Zipping was ineffective.  

> The PDF/A issue is probably unrelated, since what I?m describing
> above happens with any PDF export profile.

Indeed, it turns out that Indic text extraction had improved
dramatically since I has last tried it out, and using PDF/A made no
difference to lurking bugs.  (In case it be relevant, I'm using
HarfBuzz 1.2.7 as the system HarfBuzz library, which is the latest I
can get on my Ubuntu 16.04.3 machine using the Debian build system on
the Ubuntu distribution system.  It's probably time to risk an upgrade.)

Richard.


From naa.ganesan at gmail.com  Tue Aug 25 23:19:57 2020
From: naa.ganesan at gmail.com (N. Ganesan)
Date: Tue, 25 Aug 2020 23:19:57 -0500
Subject: 8th century Nagari script in gold coin
In-Reply-To: <CAA+QEUd+iH63wd9PUKds2mshQrSnKha4_fcm8Meh_wrdDwSU7g@mail.gmail.com>
References: <CAA+QEUd+iH63wd9PUKds2mshQrSnKha4_fcm8Meh_wrdDwSU7g@mail.gmail.com>
Message-ID: <CAA+QEUeST9=QtrnqJnz-EsRS241F13oLM+FSU_6beZy-qpZfKw@mail.gmail.com>

Namaste. Can you please read the script on this gold coin?
It is from Bappa Rawal, the founder of Rajput kingdoms (Mewar) in
Rajasthan, India.
About 1250 years old & rare. In the market for $ 10-20 K.

"Sri Moghara" ?
or, "Sri Voppa" ?

Thanks,
N. Ganesan
[image: WhatsApp Image 2020-08-24 at 11.21.22 PM.jpeg]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20200825/5cf5ec36/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: WhatsApp Image 2020-08-24 at 11.21.22 PM.jpeg
Type: image/jpeg
Size: 62179 bytes
Desc: not available
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20200825/5cf5ec36/attachment-0001.jpeg>