From steffen at sdaoden.eu Wed Aug 5 09:40:25 2020 From: steffen at sdaoden.eu (Steffen Nurpmeso) Date: Wed, 05 Aug 2020 16:40:25 +0200 Subject: [Gossip] unicode@unicode.org no longer archived .. In-Reply-To: <20200725174622.QD5fB%steffen@sdaoden.eu> References: <20200720152252.KyZcL%steffen@sdaoden.eu> <20200725174622.QD5fB%steffen@sdaoden.eu> Message-ID: <20200805144025.S48Uc%steffen@sdaoden.eu> Hello Jeff. Steffen Nurpmeso wrote in <20200725174622.QD5fB%steffen at sdaoden.eu>: |Jeff Breidenbach wrote in |: ||I checked and for whatever reason, they simply aren't sending email to ||archive at mail-archive.com. No idea why. If you can help make that happen, ||archiving will work again. | |I will forward it. | ||While we are talking, I wants folks to know that the Mail Archive \ ||itself is ||running fine (knock on wood). But I can't keep up with the support \ ||requests ||though, so sorry to everyone affected by that. | |It's a pity i do not have more money to spend also for "hobby" |projects like MailArchive. |Thank You! I am sorry, i did not get any response from Unicode officials, it seems the industry backing Unicode is no longer interested, and they left for web forums etc., leaving behind rag rugs of the past. Sorry for the noise, and thanks again for the MailArchive!! Ciao from Germany, --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From doug at ewellic.org Wed Aug 5 10:25:44 2020 From: doug at ewellic.org (Doug Ewell) Date: Wed, 5 Aug 2020 09:25:44 -0600 Subject: [Gossip] unicode@unicode.org no longer archived .. In-Reply-To: <20200805144025.S48Uc%steffen@sdaoden.eu> References: <20200720152252.KyZcL%steffen@sdaoden.eu> <20200725174622.QD5fB%steffen@sdaoden.eu> <20200805144025.S48Uc%steffen@sdaoden.eu> Message-ID: <000001d66b3c$b0862740$119275c0$@ewellic.org> Steffen Nurpmeso wrote: > I am sorry, i did not get any response from Unicode officials, it > seems the industry backing Unicode is no longer interested, and they > left for web forums etc., leaving behind rag rugs of the past. Sometimes the magic happens without a lot of fanfare. Between these three links: 1) https://corp.unicode.org/pipermail/unicode/ 2) https://www.unicode.org/mail-arch/unicode-ml/ 3) https://www.unicode.org/mail-arch/unicode-ml/Archives-Old/ you should now have login-free access to web archives for the entire history of the Unicode public mailing list. -- Doug Ewell | Thornton, CO, US | ewellic.org From naa.ganesan at gmail.com Sun Aug 9 13:30:44 2020 From: naa.ganesan at gmail.com (N. Ganesan) Date: Sun, 9 Aug 2020 13:30:44 -0500 Subject: Tamil Brahmi Virama at U+11070 Message-ID: I read in this month's UTC meeting minutes, *>Consensus:* Accept U+11070 BRAHMI SIGN OLD TAMIL VIRAMA for encoding >in a future version of the standard. We thank UTC for this decision. It will be far easier to know which is Tamil Brahmi Virama (from other diacritics) when a plain-text message is posted in social media etc., in the future. Just last week, a Shiva Linga with the words, "ekan aatan kOTTam" in Tamil Brahmi inscription has been found in a Shiva temple at Kinnimangalam, near Madurai, Tamil Nadu. Because of the paleography with a fully developed PuLLi system (5 puLLis!), this Lingam ( https://en.wikipedia.org/wiki/Lingam ) can be dated to being ~1800 years old. There are 3 Lingas with Tamil Brahmi inscribed on them at (1) Netrambakkam near Madras (2) Kinnimangalam near Madurai and (3) Inuvil in Sri Lanka island and they form the source for all later PaLLippadai memorial temples in Pallava period and in South East Asia, called as Devaraja cult temples such as in Cambodia. http://nganesan.blogspot.com/2020/07/ekamukha-linga-with-tamil-brahmi.html http://nganesan.blogspot.com/2020/07/kinnimangalam-linga-brahmi-pulli.html N. Ganesan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Fri Aug 14 18:23:30 2020 From: jameskasskrv at gmail.com (James Kass) Date: Fri, 14 Aug 2020 23:23:30 +0000 Subject: [off topic] Code2003 is a rip-off Message-ID: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com> There's a font called Code2003 which is available for download on various web sites.? Most of its glyphs were stolen from my fonts Code2000 and Code2001.? Several ranges included in the font which were not covered by my fonts were likely stolen from elsewhere.? For example, for the range "Miscellaneous Symbols and Pictographs", its "developer" simply stole the glyphs from the Unicode chart for that range as found on the Unicode web site.? (Although some of those glyphs were modified by mirroring or slight rotation.? Please see attached graphic.) The extended font information contained within Code2003 lists me as its developer and contains broken links to my old web site and e-mail address.? I am not affiliated with "St. Gigafont", I do not steal glyphs from the Unicode web site charts, and Code2003 is being distributed without my permission or authorization. Some download web sites request donations.? Any donations are going to "St. Gigafont", not to me. This e-mail is a "heads-up" both to other font developers whose work may have been stolen and to The Unicode Consortium itself because the PDF charts are copyrighted and may use copyrighted fonts. Please forward this e-mail to interested parties. Best regards, James Kass -------------- next part -------------- A non-text attachment was scrubbed... Name: 20200814_5_Capture.jpg Type: image/jpeg Size: 58025 bytes Desc: not available URL: From richard.wordingham at ntlworld.com Sat Aug 15 06:33:29 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 15 Aug 2020 12:33:29 +0100 Subject: Code2003 is a rip-off In-Reply-To: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com> References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com> Message-ID: <20200815123329.2b1c6f38@JRWUBU2> On Fri, 14 Aug 2020 23:23:30 +0000 James Kass via Unicode wrote: > There's a font called Code2003 which is available for download on > various web sites.? Most of its glyphs were stolen from my fonts > Code2000 and Code2001.? Several ranges included in the font which > were not covered by my fonts were likely stolen from elsewhere.? For > example, for the range "Miscellaneous Symbols and Pictographs", its > "developer" simply stole the glyphs from the Unicode chart for that > range as found on the Unicode web site.? (Although some of those > glyphs were modified by mirroring or slight rotation.? Please see > attached graphic.) Have you read the legal defence at https://digiex.net/threads/hello-to-all-i-wish-to-introduce-myself.15144/ ? I think the editor's arguments are wrong, but I think 'rip-off' is too strong a word. He may have made what would be legal use of other fonts if they were themselves legal. > The extended font information contained within Code2003 lists me as > its developer and contains broken links to my old web site and e-mail > address.? I am not affiliated with "St. Gigafont", I do not steal > glyphs from the Unicode web site charts, and Code2003 is being > distributed without my permission or authorization. The allocation of plaudits and brickbats is a tricky task. Where do we stand on getting Code2000 licenses? Have you made representations to St. Gigafont about his implicit accusation of copying? Artistically, you could be aggrieved by the removal of shaping - Devanagari shaping is completely gone. > Some download web sites request donations.? Any donations are going > to "St. Gigafont", not to me. I hope that at least they are being channelled to St. Gigafont. I found a copy of the font that said, in its name table, both that it was licensed under the SIL Open Font Licence, and that it was shareware, to be licensed from you for US$5. Does my licence from you cover me for Code2003 so far as your rights are concerned? Have you managed to contact "St. Gigafont"? It's conceivable that some of the donations have been set aside for you. It would seem that "St. Gigafont" has been hosting Code2000 and Code2001. You might even recover an income trickle. > This e-mail is a "heads-up" both to other font developers whose work > may have been stolen and to The Unicode Consortium itself because the > PDF charts are copyrighted and may use copyrighted fonts. I once found that one of my fonts released under the SIL Open font Licence was being redistributed under the same name but with modifications and no hint of them in the name table. I have wondered whether that constituted a donation of the changes to me. Bringing the matter more clearly into the scope of this list, is the original goal of Code2000 still achievable? Is it achievable without horrendous artistic compromises? I was recently horrified by how many ligatures are needed just to write Pali in the Sinhala script, let alone Sanskrit. Richard. From jk at koremail.com Sat Aug 15 08:39:10 2020 From: jk at koremail.com (jk at koremail.com) Date: Sat, 15 Aug 2020 21:39:10 +0800 Subject: Code2003 is a rip-off In-Reply-To: <20200815123329.2b1c6f38@JRWUBU2> References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com> <20200815123329.2b1c6f38@JRWUBU2> Message-ID: <20195fd5a2c9e90ebeac2f92e7a0e2f3@koremail.com> On 2020-08-15 19:33, Richard Wordingham via Unicode wrote: > On Fri, 14 Aug 2020 23:23:30 +0000 > James Kass via Unicode wrote: > >> There's a font called Code2003 which is available for download on >> various web sites.? Most of its glyphs were stolen from my fonts >> Code2000 and Code2001.? Several ranges included in the font which >> were not covered by my fonts were likely stolen from elsewhere.? For >> example, for the range "Miscellaneous Symbols and Pictographs", its >> "developer" simply stole the glyphs from the Unicode chart for that >> range as found on the Unicode web site.? (Although some of those >> glyphs were modified by mirroring or slight rotation.? Please see >> attached graphic.) > > Have you read the legal defence at > https://digiex.net/threads/hello-to-all-i-wish-to-introduce-myself.15144/ > ? > I think the editor's arguments are wrong, but I think 'rip-off' is too > strong a word. > There are many stronger words that one could use, James has be very restrained here considering the thousands of hours invested. It is far to common that people ignore the licenses of software. Many of us can recount similar events. Such behaviour is very disheartening for independent developers. John Knightley From richard.wordingham at ntlworld.com Sat Aug 15 10:13:09 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 15 Aug 2020 16:13:09 +0100 Subject: Code2003 is a rip-off In-Reply-To: <20195fd5a2c9e90ebeac2f92e7a0e2f3@koremail.com> References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com> <20200815123329.2b1c6f38@JRWUBU2> <20195fd5a2c9e90ebeac2f92e7a0e2f3@koremail.com> Message-ID: <20200815161309.6dc2b9ed@JRWUBU2> On Sat, 15 Aug 2020 21:39:10 +0800 John Knightley via Unicode wrote: > On 2020-08-15 19:33, Richard Wordingham via Unicode wrote: > > On Fri, 14 Aug 2020 23:23:30 +0000 > > James Kass via Unicode wrote: > > > >> There's a font called Code2003 which is available for download on > >> various web sites.? Most of its glyphs were stolen from my fonts > >> Code2000 and Code2001.? Several ranges included in the font which > >> were not covered by my fonts were likely stolen from elsewhere. > >> For example, for the range "Miscellaneous Symbols and > >> Pictographs", its "developer" simply stole the glyphs from the > >> Unicode chart for that range as found on the Unicode web site. > >> (Although some of those glyphs were modified by mirroring or > >> slight rotation.? Please see attached graphic.) > > > > Have you read the legal defence at > > https://digiex.net/threads/hello-to-all-i-wish-to-introduce-myself.15144/ > > ? > > I think the editor's arguments are wrong, but I think 'rip-off' is > > too strong a word. > > > > There are many stronger words that one could use, James has be very > restrained here considering the thousands of hours invested. It is > far to common that people ignore the licenses of software. Many of us > can recount similar events. Such behaviour is very disheartening for > independent developers. The first point here is that James is not being robbed of income. There does not seem to be any way for the general public to licence the font(s) from him. Now, James is still being given credit for the glyphs. That doesn't seem to be true for other glyph creators, e.g. Alif Silpachai for the Tai Tham glyphs. Alif's font is free as in free beer, not as in free speech. There is a potential loss of reputation to James as a font can be a lot more than just glyphs. The shaping has mostly gone missing, and he may be criticised for the various consequent shortcomings, whereas there was a time when Code2000 was the best Devanagari font I had available on my machine. The final issue is that he has been robbed of artistic and technical control. That is indeed an issue if it was not James Kass who put the fonts on SourceForge. Now, if the font had been released under the SIL Open Font Licence, he would also have lost control. One may see the name 'Code2003' as impertinent, but at least the font is not being paraded as Code2000, Code 2001 or Code2002. Richard. From jameskasskrv at gmail.com Sat Aug 15 17:57:29 2020 From: jameskasskrv at gmail.com (James Kass) Date: Sat, 15 Aug 2020 22:57:29 +0000 Subject: Code2003 is a rip-off In-Reply-To: <20200815161309.6dc2b9ed@JRWUBU2> References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com> <20200815123329.2b1c6f38@JRWUBU2> <20195fd5a2c9e90ebeac2f92e7a0e2f3@koremail.com> <20200815161309.6dc2b9ed@JRWUBU2> Message-ID: <9c1603f4-0de1-b18c-cee9-4480bc53507c@gmail.com> On 2020-08-15 3:13 PM, Richard Wordingham via Unicode wrote: > The first point here is that James is not being robbed of income. > There does not seem to be any way for the general public to licence the > font(s) from him. I'm putting my old web site back on line.? Code2001 will remain freeware. > Now, James is still being given credit for the glyphs. That doesn't > seem to be true for other glyph creators, e.g. Alif Silpachai for the > Tai Tham glyphs. Alif's font is free as in free beer, not as in free > speech. I'm being given credit for stealing Alif Silpachai's (among others') work along with that credit of designing those glyphs which I actually designed.? It's not flattering to be given credit for being a thief if I'm not one. > > There is a potential loss of reputation to James as a font can be a lot > more than just glyphs. The shaping has mostly gone missing, and he may > be criticised for the various consequent shortcomings, whereas there was > a time when Code2000 was the best Devanagari font I had available on > my machine. Thank you!? I hope you will also like the Grantha in the new version of Code2001. > The final issue is that he has been robbed of artistic and technical > control. That is indeed an issue if it was not James Kass who put the > fonts on SourceForge. Now, if the font had been released under the SIL > Open Font Licence, he would also have lost control. One may see the > name 'Code2003' as impertinent, but at least the font is not being > paraded as Code2000, Code 2001 or Code2002. > It's the last digit in the font name which signifies: Code2000 is for plane 0 Code2001 is for plane 1 Code2002 is for plane 2 Code2003 should have been for plane 3 Best regards, James Kass From jameskasskrv at gmail.com Sat Aug 15 18:24:44 2020 From: jameskasskrv at gmail.com (James Kass) Date: Sat, 15 Aug 2020 23:24:44 +0000 Subject: Code2003 is a rip-off In-Reply-To: <20200815123329.2b1c6f38@JRWUBU2> References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com> <20200815123329.2b1c6f38@JRWUBU2> Message-ID: <7d8750eb-6c4c-5c7e-b99f-73a55037ceed@gmail.com> On 2020-08-15 11:33 AM, Richard Wordingham via Unicode wrote: > Have you read the legal defence at > https://digiex.net/threads/hello-to-all-i-wish-to-introduce-myself.15144/ ? > I think the editor's arguments are wrong, but I think 'rip-off' is too > strong a word. No, I hadn't seen this and need to read it carefully.? A quick Google search for St. Gigafont and so forth failed to get me any contact information.? Thank you for the link.?? (Heh, heh, since I was "royally miffed" at the time, I may have searched for "St. Gigabyte", or something.) > > Artistically, you could be aggrieved by the removal of shaping - > Devanagari shaping is completely gone. Probably because of the limitation on the number of glyphs possible in a font.? Once you start adding stuff beyond Plane Zero, you run out of room. >> Some download web sites request donations.? Any donations are going >> to "St. Gigafont", not to me. > I hope that at least they are being channelled to St. Gigafont. I > found a copy of the font that said, in its name table, both that it > was licensed under the SIL Open Font Licence, and that it was > shareware, to be licensed from you for US$5. Does my licence from you > cover me for Code2003 so far as your rights are concerned? That question should probably be directed to the developer of Code2003. > > Have you managed to contact "St. Gigafont"? It's conceivable that some > of the donations have been set aside for you. It would seem that "St. > Gigafont" has been hosting Code2000 and Code2001. You might even > recover an income trickle. It's not about the money.? Based on the above link and additional information, I will certainly attempt to make the acquaintance of the Code2003 "developer". > > Bringing the matter more clearly into the scope of this list, is the > original goal of Code2000 still achievable? Is it achievable without > horrendous artistic compromises? I was recently horrified by how many > ligatures are needed just to write Pali in the Sinhala script, let > alone Sanskrit. > I believe it is possible.? Not in a single font, of course, unless the specs change.? But with the font family / collection I think it can be done.? Wish I'd started on this fifty years ago instead of 22 years ago, though.? But I didn't even have a computer in 1970. Best regards, James Kass From richard.wordingham at ntlworld.com Sat Aug 15 20:11:01 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 16 Aug 2020 02:11:01 +0100 Subject: Code2003 is a rip-off In-Reply-To: <7d8750eb-6c4c-5c7e-b99f-73a55037ceed@gmail.com> References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com> <20200815123329.2b1c6f38@JRWUBU2> <7d8750eb-6c4c-5c7e-b99f-73a55037ceed@gmail.com> Message-ID: <20200816021101.5d79eb3f@JRWUBU2> On Sat, 15 Aug 2020 23:24:44 +0000 James Kass via Unicode wrote: > On 2020-08-15 11:33 AM, Richard Wordingham via Unicode wrote: >> Does my licence from >> you cover me for Code2003 so far as your rights are concerned? > That question should probably be directed to the developer of > Code2003. As to his IPR, the SIL Open Font Licence applies. > Based on the above link and additional > information, I will certainly attempt to make the acquaintance of the > Code2003 "developer". I have confirmed that Code2003's glyphs are unlicensed. St. Gigafont clearly doesn't understand the different types of 'free'. > > Bringing the matter more clearly into the scope of this list, is the > > original goal of Code2000 still achievable? Is it achievable > > without horrendous artistic compromises? > I believe it is possible.? Not in a single font, of course, unless > the specs change. I meant coverage of the assigned BMP in a single font. Richard. From jameskasskrv at gmail.com Sat Aug 15 21:32:28 2020 From: jameskasskrv at gmail.com (James Kass) Date: Sun, 16 Aug 2020 02:32:28 +0000 Subject: Code2003 is a rip-off In-Reply-To: <76f4f1dc-0194-3850-bb6b-881661fb2579@gmail.com> References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com> <20200815123329.2b1c6f38@JRWUBU2> <7d8750eb-6c4c-5c7e-b99f-73a55037ceed@gmail.com> <20200816021101.5d79eb3f@JRWUBU2> <76f4f1dc-0194-3850-bb6b-881661fb2579@gmail.com> Message-ID: (This was sent off-list to Richard Wordingham but I'd intended to reply to the list.) On 2020-08-16 1:21 AM, James Kass wrote: > > On 2020-08-16 1:11 AM, Richard Wordingham via Unicode wrote: >>>> Bringing the matter more clearly into the scope of this list, is the >>>> original goal of Code2000 still achievable?? Is it achievable >>>> without horrendous artistic compromises? >>> I believe it is possible.? Not in a single font, of course, unless >>> the specs change. >> I meant coverage of the assigned BMP in a single font. > Ahh.? Yes, I think so.? Haven't done the math, though.? Such a font > would be well suited for populating charts but complex shaping > wouldn't be happening.? So running text in complex scripts would > render poorly.? But not supporting any BMP PUA characters in the font > might leave enough room for unmapped glyphs such as ligatures to make > complex shaping possible for at least some of the BMP scripts. From hsivonen at hsivonen.fi Mon Aug 17 01:38:54 2020 From: hsivonen at hsivonen.fi (Henri Sivonen) Date: Mon, 17 Aug 2020 09:38:54 +0300 Subject: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences In-Reply-To: References: Message-ID: Sorry about the delay. There is now https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ?? wrote: > > I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended. > > Can you file this at http://www.unicode.org/reporting.html so that the committee can look at your proposal with an eye to changing http://www.unicode.org/reports/tr36/? > > Mark > > > On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode wrote: >> >> We're about to remove the U+FFFD generation for the case where there >> is no content between two ISO-2022-JP escape sequences from the WHATWG >> Encoding Standard. >> >> Is there anything wrong with my analysis that U+FFFD generation in >> that case is not a useful security measure when unnecessary >> transitions between the ASCII and Roman states do not generate U+FFFD? >> >> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen wrote: >> > >> > Context: https://github.com/whatwg/encoding/issues/115 >> > >> > Unicode Security Considerations say: >> > "3.6.2 Some Output For All Input >> > >> > Character encoding conversion must also not simply skip an illegal >> > input byte sequence. Instead, it must stop with an error or substitute >> > a replacement character (such as U+FFFD ( ? ) REPLACEMENT CHARACTER) >> > or an escape sequence in the output. (See also Section 3.5 Deletion of >> > Code Points.) It is important to do this not only for byte sequences >> > that encode characters, but also for unrecognized or "empty" >> > state-change sequences. For example: >> > [...] >> > ISO-2022 shift sequences without text characters before the next shift >> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants >> > require at least one character in a text segment between shift >> > sequences. Security software written to the formal specification may >> > not detect malicious text (for example, "delete" with a >> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)." >> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) >> > >> > The WHATWG Encoding Standard bakes this requirement by the means of >> > "ISO-2022-JP output flag" >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its >> > ISO-2022-JP decoder algorithm >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). >> > >> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the >> > WHATWG spec. >> > >> > After Gecko switched to encoding_rs from an implementation that didn't >> > implement this U+FFFD generation behavior (uconv), a bug has been >> > logged in the context of decoding Japanese email in Thunderbird: >> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 >> > >> > Ken Lunde also recalls seeing such email: >> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403 >> > >> > The root problem seems to be that the requirement gives ISO-2022-JP >> > the unusual and surprising property that concatenating two ISO-2022-JP >> > outputs from a conforming encoder can result in a byte sequence that >> > is non-conforming as input to a ISO-2022-JP decoder. >> > >> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape >> > sequence is immediately followed by another ISO-2022-JP escape >> > sequence. Chrome and Safari do, but their implementations of >> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's >> > decoder implementations generally are informed by the Encoding >> > Standard (though the ISO-2022-JP decoder specifically might not be >> > yet), and I suspect that Safari's implementation (ICU) is either >> > informed by Unicode Security Considerations or vice versa. >> > >> > The example given as rationale in Unicode Security Considerations, >> > obfuscating the ASCII string "delete", could be accomplished by >> > alternating between the ASCII and Roman states to that every other >> > character is in the ASCII state and the rest of the Roman state. >> > >> > Is the requirement to generate U+FFFD when there is no content between >> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII >> > transitions or useless transitions between ASCII and Roman are not >> > also required to generate U+FFFD? Would it even be feasible (in terms >> > of interop with legacy encoders) to make useless transitions between >> > ASCII and Roman generate U+FFFD? >> > >> > -- >> > Henri Sivonen >> > hsivonen at hsivonen.fi >> > https://hsivonen.fi/ >> >> >> >> -- >> Henri Sivonen >> hsivonen at hsivonen.fi >> https://hsivonen.fi/ >> -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From harjitmoe at outlook.com Mon Aug 17 01:59:30 2020 From: harjitmoe at outlook.com (Harriet Riddle) Date: Mon, 17 Aug 2020 06:59:30 +0000 Subject: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences In-Reply-To: References: , Message-ID: In terms of deployed ISO-2022-JP encoders which don't follow WHATWG behaviour, here's Python's (apparently contributed to Python by one Hye-Shik Chang): >>> "a?bc~?d".encode("iso-2022-jp") b'a\x1b(J\\\x1b(Bbc~\x1b(J\\\x1b(Bd' This is so far as I can tell valid per the RFC (and of course ECMA-35 itself), but not per the WHATWG, whose output would be (to use another bytestring literal) b'a\x1b(J\\bc\x1b(B~\x1b(J\\d\x1b(B'. The difference being that Python's encoder appears to be using a preference order of codesets, with ASCII being before JIS-Roman, while the WHATWG logic is to encode the next character in the current codeset if possible, and switch to another if it is not. -- Har ________________________________ From: Unicode on behalf of Henri Sivonen via Unicode Sent: 17 August 2020 08:38 To: Mark Davis ?? Cc: Unicode Public Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences Sorry about the delay. There is now https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ?? wrote: > > I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended. > > Can you file this at http://www.unicode.org/reporting.html so that the committee can look at your proposal with an eye to changing http://www.unicode.org/reports/tr36/? > > Mark > > > On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode wrote: >> >> We're about to remove the U+FFFD generation for the case where there >> is no content between two ISO-2022-JP escape sequences from the WHATWG >> Encoding Standard. >> >> Is there anything wrong with my analysis that U+FFFD generation in >> that case is not a useful security measure when unnecessary >> transitions between the ASCII and Roman states do not generate U+FFFD? >> >> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen wrote: >> > >> > Context: https://github.com/whatwg/encoding/issues/115 >> > >> > Unicode Security Considerations say: >> > "3.6.2 Some Output For All Input >> > >> > Character encoding conversion must also not simply skip an illegal >> > input byte sequence. Instead, it must stop with an error or substitute >> > a replacement character (such as U+FFFD ( ? ) REPLACEMENT CHARACTER) >> > or an escape sequence in the output. (See also Section 3.5 Deletion of >> > Code Points.) It is important to do this not only for byte sequences >> > that encode characters, but also for unrecognized or "empty" >> > state-change sequences. For example: >> > [...] >> > ISO-2022 shift sequences without text characters before the next shift >> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants >> > require at least one character in a text segment between shift >> > sequences. Security software written to the formal specification may >> > not detect malicious text (for example, "delete" with a >> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)." >> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) >> > >> > The WHATWG Encoding Standard bakes this requirement by the means of >> > "ISO-2022-JP output flag" >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its >> > ISO-2022-JP decoder algorithm >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). >> > >> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the >> > WHATWG spec. >> > >> > After Gecko switched to encoding_rs from an implementation that didn't >> > implement this U+FFFD generation behavior (uconv), a bug has been >> > logged in the context of decoding Japanese email in Thunderbird: >> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 >> > >> > Ken Lunde also recalls seeing such email: >> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403 >> > >> > The root problem seems to be that the requirement gives ISO-2022-JP >> > the unusual and surprising property that concatenating two ISO-2022-JP >> > outputs from a conforming encoder can result in a byte sequence that >> > is non-conforming as input to a ISO-2022-JP decoder. >> > >> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape >> > sequence is immediately followed by another ISO-2022-JP escape >> > sequence. Chrome and Safari do, but their implementations of >> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's >> > decoder implementations generally are informed by the Encoding >> > Standard (though the ISO-2022-JP decoder specifically might not be >> > yet), and I suspect that Safari's implementation (ICU) is either >> > informed by Unicode Security Considerations or vice versa. >> > >> > The example given as rationale in Unicode Security Considerations, >> > obfuscating the ASCII string "delete", could be accomplished by >> > alternating between the ASCII and Roman states to that every other >> > character is in the ASCII state and the rest of the Roman state. >> > >> > Is the requirement to generate U+FFFD when there is no content between >> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII >> > transitions or useless transitions between ASCII and Roman are not >> > also required to generate U+FFFD? Would it even be feasible (in terms >> > of interop with legacy encoders) to make useless transitions between >> > ASCII and Roman generate U+FFFD? >> > >> > -- >> > Henri Sivonen >> > hsivonen at hsivonen.fi >> > https://hsivonen.fi/ >> >> >> >> -- >> Henri Sivonen >> hsivonen at hsivonen.fi >> https://hsivonen.fi/ >> -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Mon Aug 17 02:17:38 2020 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Mon, 17 Aug 2020 07:17:38 +0000 Subject: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences In-Reply-To: References: Message-ID: IMO, encodings, particularly ones depending on state such as this, may have multiple ways to output the same, or similar, sequences. When means that pretty much any time an encoding transforms data any previous security or other validation style checks are no longer valid and any security/validation must be checked for again. I've seen numerous mistakes due to people expecting encodings to play nicely, particularly if there are different endpoints that may use different implementations with slightly different behaviors. -Shawn -----Original Message----- From: Unicode On Behalf Of Henri Sivonen via Unicode Sent: Sunday, August 16, 2020 11:39 PM To: Mark Davis ?? Cc: Unicode Public Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences Sorry about the delay. There is now https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ?? wrote: > > I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended. > > Can you file this at http://www.unicode.org/reporting.html so that the committee can look at your proposal with an eye to changing http://www.unicode.org/reports/tr36/? > > Mark > > > On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode wrote: >> >> We're about to remove the U+FFFD generation for the case where there >> is no content between two ISO-2022-JP escape sequences from the >> WHATWG Encoding Standard. >> >> Is there anything wrong with my analysis that U+FFFD generation in >> that case is not a useful security measure when unnecessary >> transitions between the ASCII and Roman states do not generate U+FFFD? >> >> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen wrote: >> > >> > Context: https://github.com/whatwg/encoding/issues/115 >> > >> > Unicode Security Considerations say: >> > "3.6.2 Some Output For All Input >> > >> > Character encoding conversion must also not simply skip an illegal >> > input byte sequence. Instead, it must stop with an error or >> > substitute a replacement character (such as U+FFFD ( ) >> > REPLACEMENT CHARACTER) or an escape sequence in the output. (See >> > also Section 3.5 Deletion of Code Points.) It is important to do >> > this not only for byte sequences that encode characters, but also for unrecognized or "empty" >> > state-change sequences. For example: >> > [...] >> > ISO-2022 shift sequences without text characters before the next >> > shift sequence. The formal syntaxes for HZ and most CJK ISO-2022 >> > variants require at least one character in a text segment between >> > shift sequences. Security software written to the formal >> > specification may not detect malicious text (for example, "delete" >> > with a shift-to-double-byte then an immediate shift-to-ASCII in the middle)." >> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) >> > >> > The WHATWG Encoding Standard bakes this requirement by the means of >> > "ISO-2022-JP output flag" >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into >> > its ISO-2022-JP decoder algorithm >> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). >> > >> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements >> > the WHATWG spec. >> > >> > After Gecko switched to encoding_rs from an implementation that >> > didn't implement this U+FFFD generation behavior (uconv), a bug has >> > been logged in the context of decoding Japanese email in Thunderbird: >> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 >> > >> > Ken Lunde also recalls seeing such email: >> > https://github.com/whatwg/encoding/issues/115#issuecomment-44066140 >> > 3 >> > >> > The root problem seems to be that the requirement gives ISO-2022-JP >> > the unusual and surprising property that concatenating two >> > ISO-2022-JP outputs from a conforming encoder can result in a byte >> > sequence that is non-conforming as input to a ISO-2022-JP decoder. >> > >> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP >> > escape sequence is immediately followed by another ISO-2022-JP >> > escape sequence. Chrome and Safari do, but their implementations of >> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's >> > decoder implementations generally are informed by the Encoding >> > Standard (though the ISO-2022-JP decoder specifically might not be >> > yet), and I suspect that Safari's implementation (ICU) is either >> > informed by Unicode Security Considerations or vice versa. >> > >> > The example given as rationale in Unicode Security Considerations, >> > obfuscating the ASCII string "delete", could be accomplished by >> > alternating between the ASCII and Roman states to that every other >> > character is in the ASCII state and the rest of the Roman state. >> > >> > Is the requirement to generate U+FFFD when there is no content >> > between ISO-2022-JP escape sequences useful if useless >> > ASCII-to-ASCII transitions or useless transitions between ASCII and >> > Roman are not also required to generate U+FFFD? Would it even be >> > feasible (in terms of interop with legacy encoders) to make useless >> > transitions between ASCII and Roman generate U+FFFD? >> > >> > -- >> > Henri Sivonen >> > hsivonen at hsivonen.fi >> > https://hsivonen.fi/ >> >> >> >> -- >> Henri Sivonen >> hsivonen at hsivonen.fi >> https://hsivonen.fi/ >> -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From richard.wordingham at ntlworld.com Mon Aug 17 03:12:54 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 17 Aug 2020 09:12:54 +0100 Subject: Code2003 is a rip-off In-Reply-To: References: <3e0efba0-9f63-feb2-becb-cb88ebfe4b87@gmail.com> <20200815123329.2b1c6f38@JRWUBU2> <7d8750eb-6c4c-5c7e-b99f-73a55037ceed@gmail.com> <20200816021101.5d79eb3f@JRWUBU2> <76f4f1dc-0194-3850-bb6b-881661fb2579@gmail.com> Message-ID: <20200817091254.7d5201c1@JRWUBU2> On Sun, 16 Aug 2020 02:32:28 +0000 James Kass via Unicode wrote: > (This was sent off-list to Richard Wordingham but I'd intended to > reply to the list.) > On 2020-08-16 1:21 AM, James Kass wrote: > > On 2020-08-16 1:11 AM, Richard Wordingham via Unicode wrote: >>>> Bringing the matter more clearly into the scope of this list, is >>>> the original goal of Code2000 still achievable?? Is it achievable >>>> without horrendous artistic compromises? >>> I believe it is possible.? Not in a single font, of course, unless >>> the specs change. >> I meant coverage of the assigned BMP in a single font. > Ahh.? Yes, I think so.? Haven't done the math, though.? Such a font > would be well suited for populating charts but complex shaping > wouldn't be happening.? So running text in complex scripts would > render poorly.? But not supporting any BMP PUA characters in the > font might leave enough room for unmapped glyphs such as ligatures > to make complex shaping possible for at least some of the BMP > scripts. Chopping out most complex script support was one instance of unethical behaviour in Code2003! I'm not sure how far one can cut Devanagari support back, but I think one has to support at least repha for an honest claim to support Devanagari. Other Indic scripts are less forgiving - an 'invisible stacker' (of which there are five in the BMP) generally compels a change of shape, though a ghastly font might be able to do tricks for some characters by positioning base glyphs below instead of having to have a subscript glyph. The visible glyphs of invisible stackers are meant to be reminders that character input is still in progress. Of course, shaping in 'simple' scripts can need extra glyphs as well - the 5 IPA tone characters in the Spacing Modifer Letters need at least an extra 20 glyph IDs. Richard. From richard.wordingham at ntlworld.com Mon Aug 17 03:37:47 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 17 Aug 2020 09:37:47 +0100 Subject: Dedotted I and dotlessi Message-ID: <20200817093747.0a89417a@JRWUBU2> There is a recommendation around that fonts should generate different glyph ID sequences for canonically inequivalent character sequences. Is this still a reasonable requirement? The most obvious reason for this is that in simple scripts, the glyphs in the glyph stream follow the order of characters in the character stream, and therefore processes might hope to convert the glyph stream back to the character stream. Now, and should render the same, and one shaping trick is to convert both base characters to the same glyph, commonly called dotlessi. Glyph stream to character stream conversions were used in the generation of PDFs and the logic for extracting text from them. Is the recommendation still valid, or have things moved on? Is the recommendation applicable to Indic scripts, where glyph stream to character stream conversion may be as complicated as the reverse direction and there is a natural tendency for distinctions to be lost. (In Devanagari, the distinction between mandated and fallback half-forms is one example.) Richard. From dr.khaled.hosny at gmail.com Mon Aug 17 09:58:40 2020 From: dr.khaled.hosny at gmail.com (Khaled Hosny) Date: Mon, 17 Aug 2020 16:58:40 +0200 Subject: Dedotted I and dotlessi In-Reply-To: <20200817093747.0a89417a@JRWUBU2> References: <20200817093747.0a89417a@JRWUBU2> Message-ID: > On Aug 17, 2020, at 10:37 AM, Richard Wordingham via Unicode wrote: > > There is a recommendation around that fonts should generate different > glyph ID sequences for canonically inequivalent character sequences. > Is this still a reasonable requirement? > > The most obvious reason for this is that in simple scripts, the glyphs > in the glyph stream follow the order of characters in the character > stream, and therefore processes might hope to convert the glyph stream > back to the character stream. Now, ACCENT> and should render > the same, and one shaping trick is to convert both base characters to > the same glyph, commonly called dotlessi. Glyph stream to character > stream conversions were used in the generation of PDFs and the logic > for extracting text from them. > > Is the recommendation still valid, or have things moved on? For some PDF work flows, yes. > Is the > recommendation applicable to Indic scripts, where glyph stream to > character stream conversion may be as complicated as the > reverse direction and there is a natural tendency for distinctions to > be lost. (In Devanagari, the distinction between mandated and > fallback half-forms is one example.) Same workflows can?t handle one to many substitution, or reordering, so when I?m doing fonts that need these I usually just give up on the ?unique glyph per code point? requirement. I also forget about it when making Arabic fonts, because extracting Arabic text reliably from PDFs generated with such workflows is a lost cause already. Regards, Khaled From bobby_devos at sil.org Mon Aug 17 12:59:14 2020 From: bobby_devos at sil.org (Bobby de Vos) Date: Mon, 17 Aug 2020 11:59:14 -0600 Subject: Dedotted I and dotlessi In-Reply-To: References: <20200817093747.0a89417a@JRWUBU2> Message-ID: <8d950082-380a-8a9c-553b-0d1cf96d4b19@sil.org> On 2020-08-17 8:58 a.m., Khaled Hosny via Unicode wrote: >> On Aug 17, 2020, at 10:37 AM, Richard Wordingham via Unicode wrote: >> >> Is the >> recommendation applicable to Indic scripts, where glyph stream to >> character stream conversion may be as complicated as the >> reverse direction and there is a natural tendency for distinctions to >> be lost. (In Devanagari, the distinction between mandated and >> fallback half-forms is one example.) > Same workflows can?t handle one to many substitution, or reordering, so when I?m doing fonts that need these I usually just give up on the ?unique glyph per code point? requirement. I also forget about it when making Arabic fonts, because extracting Arabic text reliably from PDFs generated with such workflows is a lost cause already. A particular workflow might be enhanced to handle, for example, U+093F DEVANAGARI VOWEL SIGN I where the glyph for this character is re-ordered compared to the codepoints. I don't see how a workflow would be able to handle [1] where in Kannada script, codepoints are re-ordered to handle changing conventions in encoding. That is, the codepoints are re-ordered before mapping to glyphs, so two different sequences of codepoints will produce the same glyph stream, IIUC. [1] https://github.com/harfbuzz/harfbuzz/issues/435#issuecomment-335560167 Regards, Bobby -- Bobby de Vos /bobby_devos at sil.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Aug 17 13:00:31 2020 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 17 Aug 2020 11:00:31 -0700 Subject: Dedotted I and dotlessi In-Reply-To: References: <20200817093747.0a89417a@JRWUBU2> Message-ID: PDFs *should* be generated with Unicode strings, so that copy-and-paste etc. need not try to map back from glyphs. Of course, that's optional, and some tools don't bother. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From dr.khaled.hosny at gmail.com Mon Aug 17 13:53:01 2020 From: dr.khaled.hosny at gmail.com (Khaled Hosny) Date: Mon, 17 Aug 2020 20:53:01 +0200 Subject: Dedotted I and dotlessi In-Reply-To: References: <20200817093747.0a89417a@JRWUBU2> Message-ID: Easier said than done. Even for tools that want to do this, the only reliable way is tagging with /ActualText, but this has to be done per grapheme cluster as PDF viewers can?t select or highlight parts of text tagged with /ActualText, so Arabic excluded since PDF stores glyphs in visual order and you don?t want to tag full paragraphs. In case of reordering, you will also need to tag the whole reordered sequence as one unit since you can?t tell which glyphs belongs to which character any more. People will also complain about increased file size, so you will have to do tagging selectively for cases than can?t be handled in a different way. In short, text extraction from PDF is a mess. > On Aug 17, 2020, at 8:00 PM, Markus Scherer wrote: > > PDFs *should* be generated with Unicode strings, so that copy-and-paste etc. need not try to map back from glyphs. > Of course, that's optional, and some tools don't bother. > markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Aug 17 17:15:50 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 17 Aug 2020 23:15:50 +0100 Subject: Dedotted I and dotlessi In-Reply-To: References: <20200817093747.0a89417a@JRWUBU2> Message-ID: <20200817231550.21c17200@JRWUBU2> On Mon, 17 Aug 2020 20:53:01 +0200 Khaled Hosny via Unicode wrote: > Easier said than done. Even for tools that want to do this, the only > reliable way is tagging with /ActualText, but this has to be done per > grapheme cluster as PDF viewers can?t select or highlight parts of > text tagged with /ActualText, so Arabic excluded since PDF stores > glyphs in visual order and you don?t want to tag full paragraphs. That's a nasty bug. Has it been established that negative (advance)widths are "inconsistent" TrueType and CFF fonts? I woud have said that a PDF width of -573 was entirely consistent with a TrueType width of 573. > In > case of reordering, you will also need to tag the whole reordered > sequence as one unit since you can?t tell which glyphs belongs to > which character any more. People will also complain about increased > file size, so you will have to do tagging selectively for cases than > can?t be handled in a different way. I don't know if it's due to another feature (or even merely a bug), but I did notice that LibreOffice-exported PDFs swell enormously if one uses PDF/A to make Indic text extractable. This was with a series of documents that were at least 90% English (in the Latin script). Zipping was ineffective. Richard. From jameskasskrv at gmail.com Mon Aug 17 17:31:53 2020 From: jameskasskrv at gmail.com (James Kass) Date: Mon, 17 Aug 2020 22:31:53 +0000 Subject: Dedotted I and dotlessi In-Reply-To: References: <20200817093747.0a89417a@JRWUBU2> Message-ID: <0cd79dc8-3ba5-5351-2044-eb6ce2325766@gmail.com> On 2020-08-17 6:53 PM, Khaled Hosny via Unicode wrote: > In short, text extraction from PDF is a mess. Search engines such as Google index text from PDFs and offer PDF links in the search results.? I wonder how Google handles Arabic (and other complex scripts) PDFs.? Have they worked out some kind of method, or are such PDFs considered non-indexable?? Maybe OCR from the display? From richard.wordingham at ntlworld.com Mon Aug 17 17:59:46 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 17 Aug 2020 23:59:46 +0100 Subject: Dedotted I and dotlessi In-Reply-To: <8d950082-380a-8a9c-553b-0d1cf96d4b19@sil.org> References: <20200817093747.0a89417a@JRWUBU2> <8d950082-380a-8a9c-553b-0d1cf96d4b19@sil.org> Message-ID: <20200817235946.039c64bd@JRWUBU2> On Mon, 17 Aug 2020 11:59:14 -0600 Bobby de Vos via Unicode wrote: > A particular workflow might be enhanced to handle, for example, U+093F > DEVANAGARI VOWEL SIGN I where the glyph for this character is > re-ordered compared to the codepoints. I don't see how a workflow > would be able to handle [1] where in Kannada script, codepoints are > re-ordered to handle changing conventions in encoding. That is, the > codepoints are re-ordered before mapping to glyphs, so two different > sequences of codepoints will produce the same glyph stream, IIUC. > [1] > https://github.com/harfbuzz/harfbuzz/issues/435#issuecomment-335560167 Well, if the reordering is done by the shaping engine, it would be difficult. (There are moves afoot to move Indian Indic rendering to the USE, in which case they might reach the font.) However, in this case I would view the rearrangement as akin to canonical equivalence, where there is no guarantee that copying a string won't change its encoding. However, I suspect a Graphite font could leave no trace of virama and ZWJ in Sinhala script touching conjuncts. In Graphite, glyphs can have state, so touching conjunction could be implemented as a type of kerning. Richard. From dr.khaled.hosny at gmail.com Mon Aug 17 18:33:52 2020 From: dr.khaled.hosny at gmail.com (Khaled Hosny) Date: Tue, 18 Aug 2020 01:33:52 +0200 Subject: Dedotted I and dotlessi In-Reply-To: <20200817231550.21c17200@JRWUBU2> References: <20200817093747.0a89417a@JRWUBU2> <20200817231550.21c17200@JRWUBU2> Message-ID: > On Aug 18, 2020, at 12:15 AM, Richard Wordingham via Unicode wrote: > > On Mon, 17 Aug 2020 20:53:01 +0200 > Khaled Hosny via Unicode wrote: > >> Easier said than done. Even for tools that want to do this, the only >> reliable way is tagging with /ActualText, but this has to be done per >> grapheme cluster as PDF viewers can?t select or highlight parts of >> text tagged with /ActualText, so Arabic excluded since PDF stores >> glyphs in visual order and you don?t want to tag full paragraphs. > > That's a nasty bug. Has it been established that negative > (advance)widths are "inconsistent" TrueType and CFF fonts? I woud have > said that a PDF width of -573 was entirely consistent with a TrueType > width of 573. It is possible to store glyphs in logical order and adjust their positions so they appear in visual order, but this all break in PDF readers that expect Arabic to be in visual order (since this is what almost all PDF creators do) and try to reverse the Arabic string again to get the logical string (which is not always reliable since there is no standard reverse BiDi algorithm). >> In >> case of reordering, you will also need to tag the whole reordered >> sequence as one unit since you can?t tell which glyphs belongs to >> which character any more. People will also complain about increased >> file size, so you will have to do tagging selectively for cases than >> can?t be handled in a different way. > > I don't know if it's due to another feature (or even merely a bug), but > I did notice that LibreOffice-exported PDFs swell enormously if one uses > PDF/A to make Indic text extractable. This was with a series of > documents that were at least 90% English (in the Latin script). Zipping > was ineffective. LibreOffice does exactly the selective handling I described: unique one to one and many to one mappings use the font?s /ToUnicode, everything else uses /ActualText tasing per cluster (HarfBuzz cluster which is not always the same as grapheme clusters). As it happens, I wrote that code in LibreOffice. From dr.khaled.hosny at gmail.com Mon Aug 17 18:36:34 2020 From: dr.khaled.hosny at gmail.com (Khaled Hosny) Date: Tue, 18 Aug 2020 01:36:34 +0200 Subject: Dedotted I and dotlessi In-Reply-To: <0cd79dc8-3ba5-5351-2044-eb6ce2325766@gmail.com> References: <20200817093747.0a89417a@JRWUBU2> <0cd79dc8-3ba5-5351-2044-eb6ce2325766@gmail.com> Message-ID: <52940223-4680-4058-B4DB-043CC15C9FFD@gmail.com> > On Aug 18, 2020, at 12:31 AM, James Kass via Unicode wrote: > > > > On 2020-08-17 6:53 PM, Khaled Hosny via Unicode wrote: >> In short, text extraction from PDF is a mess. > > Search engines such as Google index text from PDFs and offer PDF links in the search results. I wonder how Google handles Arabic (and other complex scripts) PDFs. Have they worked out some kind of method, or are such PDFs considered non-indexable? Maybe OCR from the display? I don?t know what Google does, but the result is often just a garbage of meaningless characters. What tools I have seen their code do is try to recognize runs of Arabic text and reverse the strings to get an approximation of the original logical text that is completely loss. From dr.khaled.hosny at gmail.com Mon Aug 17 18:39:10 2020 From: dr.khaled.hosny at gmail.com (Khaled Hosny) Date: Tue, 18 Aug 2020 01:39:10 +0200 Subject: Dedotted I and dotlessi In-Reply-To: References: <20200817093747.0a89417a@JRWUBU2> <20200817231550.21c17200@JRWUBU2> Message-ID: > On Aug 18, 2020, at 1:33 AM, Khaled Hosny wrote: > > > >> On Aug 18, 2020, at 12:15 AM, Richard Wordingham via Unicode wrote: >> >> On Mon, 17 Aug 2020 20:53:01 +0200 Khaled Hosny via Unicode wrote: >> >>> In >>> case of reordering, you will also need to tag the whole reordered >>> sequence as one unit since you can?t tell which glyphs belongs to >>> which character any more. People will also complain about increased >>> file size, so you will have to do tagging selectively for cases than >>> can?t be handled in a different way. >> >> I don't know if it's due to another feature (or even merely a bug), but >> I did notice that LibreOffice-exported PDFs swell enormously if one uses >> PDF/A to make Indic text extractable. This was with a series of >> documents that were at least 90% English (in the Latin script). Zipping >> was ineffective. > > LibreOffice does exactly the selective handling I described: unique one to one and many to one mappings use the font?s /ToUnicode, everything else uses /ActualText tasing per cluster (HarfBuzz cluster which is not always the same as grapheme clusters). As it happens, I wrote that code in LibreOffice. The PDF/A issue is probably unrelated, since what I?m describing above happens with any PDF export profile. From richard.wordingham at ntlworld.com Tue Aug 18 04:24:28 2020 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 18 Aug 2020 10:24:28 +0100 Subject: Dedotted I and dotlessi In-Reply-To: References: <20200817093747.0a89417a@JRWUBU2> <20200817231550.21c17200@JRWUBU2> Message-ID: <20200818102428.668d012e@JRWUBU2> On Tue, 18 Aug 2020 01:39:10 +0200 Khaled Hosny via Unicode wrote: > >> On Aug 18, 2020, at 12:15 AM, Richard Wordingham via Unicode > >> wrote: > >> I don't know if it's due to another feature (or even merely a > >> bug), but I did notice that LibreOffice-exported PDFs swell > >> enormously if one uses PDF/A to make Indic text extractable. This > >> was with a series of documents that were at least 90% English (in > >> the Latin script). Zipping was ineffective. > The PDF/A issue is probably unrelated, since what I?m describing > above happens with any PDF export profile. Indeed, it turns out that Indic text extraction had improved dramatically since I has last tried it out, and using PDF/A made no difference to lurking bugs. (In case it be relevant, I'm using HarfBuzz 1.2.7 as the system HarfBuzz library, which is the latest I can get on my Ubuntu 16.04.3 machine using the Debian build system on the Ubuntu distribution system. It's probably time to risk an upgrade.) Richard. From naa.ganesan at gmail.com Tue Aug 25 23:19:57 2020 From: naa.ganesan at gmail.com (N. Ganesan) Date: Tue, 25 Aug 2020 23:19:57 -0500 Subject: 8th century Nagari script in gold coin In-Reply-To: References: Message-ID: Namaste. Can you please read the script on this gold coin? It is from Bappa Rawal, the founder of Rajput kingdoms (Mewar) in Rajasthan, India. About 1250 years old & rare. In the market for $ 10-20 K. "Sri Moghara" ? or, "Sri Voppa" ? Thanks, N. Ganesan [image: WhatsApp Image 2020-08-24 at 11.21.22 PM.jpeg] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: WhatsApp Image 2020-08-24 at 11.21.22 PM.jpeg Type: image/jpeg Size: 62179 bytes Desc: not available URL: