Joined "ti" coded as "O" in PDF

Sun May 8 08:35:15 CDT 2016

2016-05-08 14:42 GMT+02:00 Don Osborn <dzo at bisharat.net>:

> Some earlier posts in this thread made the observation that PDF is for
> presentation not archiving.
>
I tend to disagree. PDF are hugely used for archiving and for that purpose
it does not matter how it was generated, it is only meant to be a
facsimile, possibly with equal value as the original (printed) paper. The
initial numeric format is just a working draft with no legal value in most
cases.

That's why PDF files can contain a digistal signature, to give them the
same value as the original paper. The initial numeric draft has no value,
even if it's easier to search in it.

Many (many!) laws and treaties in the world are kept only as PDF, not all
of them being searchable in plain text, unless there's been some OCR (and
often correction to this process). The original papers (which have legal
value) are kept in museums or official national libraries and no longer
freely accessible to the public and that's why there are facsimile PDF
created to make them accessible (and possibly signed numerically by the
official library or some national authority).

Lots of organisations are only archiving their legal papers as PDF and
recycle their original paper. This is authorized by national laws, provided
they insert a verificable signature in them, certifying their date. No
alteration of the content is then authorized  as these PDF become the new
original (except adding new digital signatures, or possibly dropping some
of them except the initial dated one whose security may have become loose
over time, and for which it is needed to add new stronger signatures by the
legitimate right holder; the history of signatures will be kept).

Being able to search in a PDF is a distinct goal, not meant directly for
archiving, but for using PDFs isolately as *working* documents. However for
archives, the ability of searching in them may be provided by separate data
(without legal bindings) stored in the archive index, along with the
unaltered (and legal) PDF.

PDFs are not being meant to be used for presentation (there are much better
way to present the content and *adapt* it to the audience or presentation
medium. But presentation is also a different goal than being able to search
in it. A PDF is just a collection of rendered pages (possibly with a
limited resolution, where rendered characters may be a bit fuzzy or some
non meaningful color distinctions may be voluntarily lost) to be used "as
is" and meant to be read by human eyes (even being able to produce an
accurate OCR is not a goal of this format).

When producing the PDF, there's choice by the human editor to reduce the
resolution, reduce the colorspace and so on if this helps reducing the
numeric storage size and helps archiving, or helps protecting the author's
rights

E.g. there are different PDF versions for free online editions of
newspapers, where text may be to fuzzy to be read. But there are versions
for subscribers with much better quality (but possibly less ads), and kept
in archives if needed, but still not really meant to be searchable in plain
text; in fact the producer may want to limit the searchability so that
readers will have to look at the pages directly, and see the embedded
advertizing boxes even if they are not related directly to what is being
searched for; the producer may provide only a limited plain-text index for
some headings, but not for the content itself: readers have to scan it
visually so that they cannot completely ignore the surrounding context.

The producer of the PDF then has the choice of the different options. It
has different goals for the document. For legal use, there are some goals
to follow, but this does not (most often) include the need to perform plain
text search in them. May this means that some OCR or human work will be
needed later in order to index it, but this operation may be limited by
author's rights and the user will assume its own respondability if he makes
a false interpretation when using only automated tools. PDFs are maent to
be read and interpreted by humans, not machines.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160508/b21d7baa/attachment.html>