Re: Joined "ti" coded as "Ɵ" in PDF

Thu Mar 17 18:34:04 CDT 2016

There are a few things going on.

In the first instance, it may be the font itself that is the source of the
problem.

My understanding is that PDF files contain a sequence of glyphs. A PDF file
will contain a ToUnicode mapping between glyphs and codepoints. This
iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides
support for ligatures and variation sequences.

I assume it uses the data in the font's cmap table. If the ligature isn't
mapped then you will have problems. I guess the problem could be either the
font or the font subsetting and embedding performed when the PDF is
generated.

Although, it is worth noting that in opentype fonts not all glyphs will
have mappings in the cmap file.

The remedy, is to extensively tag the PDF and add ActualText attributes to
the tags.

But the PDF specs leave it up to the developer to decide what happens in
there is both a visible text layer and ActualText. So even in an ideal PDF,
tesults will vary from software to software when copying text or searching
a PDF.

At least thatsmy current understanding.

Andrew
On 18 Mar 2016 7:47 am, "Don Osborn" <dzo at bisharat.net> wrote:

> Thanks all for the feedback.
>
> Doug, It may well be my clipboard (running Windows 7 on this particular
> laptop). Get same results pasting into Word and EmEditor.
>
> So, when I did a web search on "internaƟonal," as previously mentioned,
> and come up with a lot of results (mostly PDFs), were those also a
> consequence of many not fully Unicode compliant conversions by others?
>
> A web search on what you came up with - "Interna��onal" - yielded many
> more (82k+) results, again mostly PDFs, with terms like "interna onal"
> (such as what Steve noted) and "interna<onal" and perhaps others (given the
> nature of, or how Google interprets, the private use character?).
>
> Searching within the PDF document already mentioned, "international" comes
> up with nothing (which is a major fail as far as usability). Searching the
> PDF in a Firefox browser window, only "internaƟonal" finds the occurrences
> of what displays as "international." However after downloading the document
> and searching it in Acrobat, only a search for "interna��onal" will find
> what displays as "international."
>
> A separate web search on "Eīects" came up with 300+ results, including
> some GoogleBooks which in the texts display "effects" (as far as I
> checked). So this is not limited to Adobe?
>
> Jörg, With regard to "Identity H," a quick search gives the impression
> that this encoding has had a fairly wide and not so happy impact, even if
> on the surface level it may have facilitated display in a particular style
> of font in ways that no one complains about.
>
> Altogether a mess, from my limited encounter with it. There must have been
> a good reason for or saving grace of this solution?
>
> Don
>
> On 3/17/2016 2:17 PM, Steve Swales wrote:
>
>> Yes, it seems like your mileage varies with the PDF
>> viewer/interpreter/converter.  Text copied from Preview on the Mac replaces
>> the ti ligature with a space.  Certainly not a Unicode problem, per se, but
>> an interesting problem nevertheless.
>>
>> -steve
>>
>> On Mar 17, 2016, at 11:11 AM, Doug Ewell <doug at ewellic.org> wrote:
>>>
>>> Don Osborn wrote:
>>>
>>> Odd result when copy/pasting text from a PDF: For some reason "ti" in
>>>> the (English) text of the document at
>>>>
>>>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
>>>> is coded as "Ɵ". Looking more closely at the original text, it does
>>>> appear that the glyph is a "ti" ligature (which afaik is not coded as
>>>> such in Unicode).
>>>>
>>> When I copy and paste the PDF text in question into BabelPad, I get:
>>>
>>> Interna��onal Order and the Distribu��on of Iden��ty in 1950 (By
>>>> invita��on only)
>>>>
>>> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use
>>> character.
>>>
>>> Truncating this character to 16 bits, which is a Bad Thing™, yields
>>> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either
>>> Don's clipboard or the editor he pasted it into is not fully
>>> Unicode-compliant.
>>>
>>> Don's point about using alternative characters to implement ligatures,
>>> thereby messing up web searches, remains valid.
>>>
>>> --
>>> Doug Ewell | http://ewellic.org | Thornton, CO ����
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160318/90c99aef/attachment.html>