Re: Joined "ti" coded as "Ɵ" in PDF

Don Osborn dzo at bisharat.net
Thu Mar 17 12:45:34 CDT 2016


Thanks Leonardo, that is my initial observation. And it has implications 
for web searches.

And there's more. Apparently this is one of a number of such 
substitutions, which taken together begin to look like the old 
pre-Unicode hacks of 8-bit fonts. And I found some of them via web 
search in a number of Google Books and pages on issuu.com. Evidently 
some kind of font issue, and not random assignments. From the same document:

ff ligature = ī
fl ligature = Ň
ft ligature = Ō
tt ligature = Ʃ

And perhaps others. Seems to defeat the intent of Unicode, as these 
documents and pages will not come up in typical web search on the normal 
spellings (unless maybe Google is incorporating an algorithm to include 
results for say "internaƟonal" in a search on the term "international"?).

Don


On 3/17/2016 1:37 PM, Leonardo Boiko wrote:
> The PDF *displays* correctly.  But try copying the string 'ti' from
> the text another application outside of your PDF viewer, and you'll
> see that the thing that *displays* as 'ti' is *coded* as Ɵ, as Don
> Osborn said.
>
>
> 2016-03-17 14:26 GMT-03:00 Pierpaolo Bernardi <olopierpa at gmail.com>:
>> That document displays correctly for me using both the pdf viewer
>> built into chrome and the standalone Acrobat reader v.11.  The problem
>> could be in your PDF viewer?  What are you viewing the document with?
>>
>> On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn <dzo at bisharat.net> wrote:
>>> Odd result when copy/pasting text from a PDF: For some reason "ti" in the
>>> (English) text of the document at
>>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
>>> is coded as "Ɵ". Looking more closely at the original text, it does appear
>>> that the glyph is a "ti" ligature (which afaik is not coded as such in
>>> Unicode).
>>>
>>> Out of curiosity, did a web search on "internaƟonal" and got over 11k hits,
>>> apparently all PDFs.
>>>
>>> Anyone have any idea what's going on? Am assuming this is not a deliberate
>>> choice by diverse people creating PDFs and wanting "ti" ligatures for
>>> stylistic reasons. Note the document linked above is current, so this is not
>>> (just) an issue with older documents.
>>>
>>> Don Osborn



More information about the Unicode mailing list