Joined "ti" coded as "O" in PDF

Philippe Verdy verdy_p at wanadoo.fr
Mon Mar 21 03:40:15 CDT 2016


Are those PDF supposed to be searchable inside of them ? For archival
purpose, the PDF are stored in their final form, and search is performed by
creating a database of descriptive metadata. Each time one wants formal
details, they have to read the original the way it was presented (many PDFs
are jsut scanned facsimiles of old documents which originately were not
even in numeric plain-text, they were printed or typewritten, frequently
they include graphics, handwritten signatures, stamped seals...)

Being able to search plain-text inside a PDF is not the main objective (and
not the priority). The archival however is a top priority (and there's no
money to finance a numerisation and no human resource available to redo
this old work, if needed other contributors will recreate a plain-text
version, possibly with rich-text features, e.g. in Wikisource for old
documents that fall in the public domain).

PDF/A-1a is meant only for creating new documents from a original
plain-text or rich-text document created with modern word-processing
applications. But this specification will frequently have to be broken, if
there's the need to include handwritten or supplementary elements
(signatures, seals...) whose source is not the original electronic document
but the printed paper over which the annotations were made: it is this
paper document, not the electronic document which is the official final
source (we've got some important legal paper whose original has other marks
including traces of beer or coffee, or partly burnt, the paper itself has
several alterations, but it is the original "as is", and for legal purpose
the only acceptable archival form as a PDF must ignore all the PDF/A-1a
constraints, not meant to represent originals accurately).

2016-03-20 20:52 GMT+01:00 Tom Gewecke <tom at bluesky.org>:

>
> > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) <asmus-inc at ix.netcom.com>
> wrote:
> >
> > Usually, the archive feature pertains only to the fact that you can
> reproduce the final form, not to being able to get at the correct source
> (plain text backbone) for the document.
>
> My understanding is that PDF/A-1a is supposed to be searchable.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160321/8f9e07ec/attachment.html>


More information about the Unicode mailing list