Joined "ti" coded as "O" in PDF

Andrew Cunningham lang.support at gmail.com
Fri May 6 16:22:16 CDT 2016


My understand ing is searchability comes down to twho factors:

1) the ToUnicode mapping ...I which mapps glyphs in the font or subsetted
font to Unicode codepoints. Mappings take the form of one glyph to one
codepoint or one glyph to two or more codepoints.

Obviously any glyph that doesnt resolve by default to a codepoint isn't in
the mapping , nor does the mapping handle glyphs that have been visually
reordered during rendering.

2) the next step is to tag the PDF then use the ActualText label of each
tag.

So for some languages with the right fonts step one is all that is needed.
And this is fairly standard in pdf generation tools. The font itself can
impact on this of course.

But for other languages you need to go to the second step.

Woth languages I work with I might have some pdfs tat just require the
visible text layer.others will have a visible text layer. For the pdf to be
eearchable, the search tools not only need to be able to handle the text
layer but also actualtext attributes when necessary.

And that all comes down to decisions the tool developer has taken on how to
handle searching when both visible text layers and ActualText labels are
present.

I have been told in accessibility lists that the PDF specs leave that
implementation detail to the developer based on their requirements.

So in some cases you need to go the extra step and ActualText. But you also
need to evaluate your search tools to ensure they fo what you expect.

Andrew



On Saturday, 7 May 2016, Steve Swales <steve at swales.us> wrote:
> This discussion seems to have fizzled out, but I’m concerned that there’s
a real world problem here which is at least partially the concern of the
consortium, so let me stir the pot and see if there’s still any meat left.
> On the current release of MacOS (including the developer beta, for your
reference, Peter), if you use Calibri font, for example, in any app (e.g.
notes), to write words with “ti” (like internationalization), then press
“Print" and “Open PDF in Preview”, you get a PDF document with the joined
“ti”.  Subsequently cutting and pasting produces mojibake, and searching
the document for words with“ti” doesn’t work, as previously noted.
> I suppose we can look on this as purely a font handling/MacOS bug, but
I’m wondering if we should be providing accommodations or conveniences in
Unicode for it to work as desired.
> -steve
>
>
> On Mar 21, 2016, at 1:40 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> Are those PDF supposed to be searchable inside of them ? For archival
purpose, the PDF are stored in their final form, and search is performed by
creating a database of descriptive metadata. Each time one wants formal
details, they have to read the original the way it was presented (many PDFs
are jsut scanned facsimiles of old documents which originately were not
even in numeric plain-text, they were printed or typewritten, frequently
they include graphics, handwritten signatures, stamped seals...)
> Being able to search plain-text inside a PDF is not the main objective
(and not the priority). The archival however is a top priority (and there's
no money to finance a numerisation and no human resource available to redo
this old work, if needed other contributors will recreate a plain-text
version, possibly with rich-text features, e.g. in Wikisource for old
documents that fall in the public domain).
> PDF/A-1a is meant only for creating new documents from a original
plain-text or rich-text document created with modern word-processing
applications. But this specification will frequently have to be broken, if
there's the need to include handwritten or supplementary elements
(signatures, seals...) whose source is not the original electronic document
but the printed paper over which the annotations were made: it is this
paper document, not the electronic document which is the official final
source (we've got some important legal paper whose original has other marks
including traces of beer or coffee, or partly burnt, the paper itself has
several alterations, but it is the original "as is", and for legal purpose
the only acceptable archival form as a PDF must ignore all the PDF/A-1a
constraints, not meant to represent originals accurately).
> 2016-03-20 20:52 GMT+01:00 Tom Gewecke <tom at bluesky.org>:
>>
>> > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) <
asmus-inc at ix.netcom.com> wrote:
>> >
>> > Usually, the archive feature pertains only to the fact that you can
reproduce the final form, not to being able to get at the correct source
(plain text backbone) for the document.
>>
>> My understanding is that PDF/A-1a is supposed to be searchable.
>>
>>
>>
>
>
>

-- 
Andrew Cunningham
lang.support at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160507/13228f1e/attachment.html>


More information about the Unicode mailing list