Dedotted I and dotlessi
dr.khaled.hosny at gmail.com
Mon Aug 17 18:36:34 CDT 2020
> On Aug 18, 2020, at 12:31 AM, James Kass via Unicode <unicode at unicode.org> wrote:
> On 2020-08-17 6:53 PM, Khaled Hosny via Unicode wrote:
>> In short, text extraction from PDF is a mess.
> Search engines such as Google index text from PDFs and offer PDF links in the search results. I wonder how Google handles Arabic (and other complex scripts) PDFs. Have they worked out some kind of method, or are such PDFs considered non-indexable? Maybe OCR from the display?
I don’t know what Google does, but the result is often just a garbage of meaningless characters. What tools I have seen their code do is try to recognize runs of Arabic text and reverse the strings to get an approximation of the original logical text that is completely loss.
More information about the Unicode