Joined "ti" coded as "O" in PDF
steve at swales.us
Fri May 6 10:49:09 CDT 2016
This discussion seems to have fizzled out, but I’m concerned that there’s a real world problem here which is at least partially the concern of the consortium, so let me stir the pot and see if there’s still any meat left.
On the current release of MacOS (including the developer beta, for your reference, Peter), if you use Calibri font, for example, in any app (e.g. notes), to write words with “ti” (like internationalization), then press “Print" and “Open PDF in Preview”, you get a PDF document with the joined “ti”. Subsequently cutting and pasting produces mojibake, and searching the document for words with“ti” doesn’t work, as previously noted.
I suppose we can look on this as purely a font handling/MacOS bug, but I’m wondering if we should be providing accommodations or conveniences in Unicode for it to work as desired.
> On Mar 21, 2016, at 1:40 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> Are those PDF supposed to be searchable inside of them ? For archival purpose, the PDF are stored in their final form, and search is performed by creating a database of descriptive metadata. Each time one wants formal details, they have to read the original the way it was presented (many PDFs are jsut scanned facsimiles of old documents which originately were not even in numeric plain-text, they were printed or typewritten, frequently they include graphics, handwritten signatures, stamped seals...)
> Being able to search plain-text inside a PDF is not the main objective (and not the priority). The archival however is a top priority (and there's no money to finance a numerisation and no human resource available to redo this old work, if needed other contributors will recreate a plain-text version, possibly with rich-text features, e.g. in Wikisource for old documents that fall in the public domain).
> PDF/A-1a is meant only for creating new documents from a original plain-text or rich-text document created with modern word-processing applications. But this specification will frequently have to be broken, if there's the need to include handwritten or supplementary elements (signatures, seals...) whose source is not the original electronic document but the printed paper over which the annotations were made: it is this paper document, not the electronic document which is the official final source (we've got some important legal paper whose original has other marks including traces of beer or coffee, or partly burnt, the paper itself has several alterations, but it is the original "as is", and for legal purpose the only acceptable archival form as a PDF must ignore all the PDF/A-1a constraints, not meant to represent originals accurately).
> 2016-03-20 20:52 GMT+01:00 Tom Gewecke <tom at bluesky.org <mailto:tom at bluesky.org>>:
> > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>> wrote:
> > Usually, the archive feature pertains only to the fact that you can reproduce the final form, not to being able to get at the correct source (plain text backbone) for the document.
> My understanding is that PDF/A-1a is supposed to be searchable.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode