Dedotted I and dotlessi

Mon Aug 17 18:36:34 CDT 2020

> On Aug 18, 2020, at 12:31 AM, James Kass via Unicode <unicode at unicode.org> wrote:
> 
> 
> 
> On 2020-08-17 6:53 PM, Khaled Hosny via Unicode wrote:
>> In short, text extraction from PDF is a mess.
> 
> Search engines such as Google index text from PDFs and offer PDF links in the search results.  I wonder how Google handles Arabic (and other complex scripts) PDFs.  Have they worked out some kind of method, or are such PDFs considered non-indexable?  Maybe OCR from the display?

I don’t know what Google does, but the result is often just a garbage of meaningless characters. What tools I have seen their code do is try to recognize runs of Arabic text and reverse the strings to get an approximation of the original logical text that is completely loss.