Dedotted I and dotlessi

James Kass jameskasskrv at
Mon Aug 17 17:31:53 CDT 2020

On 2020-08-17 6:53 PM, Khaled Hosny via Unicode wrote:
> In short, text extraction from PDF is a mess.

Search engines such as Google index text from PDFs and offer PDF links 
in the search results.  I wonder how Google handles Arabic (and other 
complex scripts) PDFs.  Have they worked out some kind of method, or are 
such PDFs considered non-indexable?  Maybe OCR from the display?

More information about the Unicode mailing list