Breaking barriers

Fri Oct 22 23:31:02 CDT 2021

On 2021-10-22 9:04 PM, David Starner via Unicode wrote:
> "as long as the source display is correctly enabled and the
> translation software handles the source language(s)." So in no
> interesting cases. Project Gutenberg had a Swedish bible translation
> in an unknown encoding (a variant of the DOS encoding that doesn't
> seem to have corresponded to anything documented); getting it to
> display correctly was basically the same challenge as translating it
> to Unicode, which was eventually done by figuring out what the unknown
> codepoints (obviously quotes) must have been. The set of languages in
> PUA and that have reliable transcription and translation is going to
> be virtually empty, and if you care about correctness and you have the
> font, directly convert the encoding.
Yes, it's best to directly convert old source data when it's feasible.

When the source data is in pre-Unicode Indic languages/scripts (or even 
in pre-shaping support Unicode), this can often not be accomplished 
simply.  If you know the font and can find a cross-reference table, then 
you're off to a good start.  If you can't find an existing 
cross-reference and have to "roll your own", it's not as fun as it 
sounds.  Some legacy fonts combine standard encoding with PUA for 
presentation forms, others use ISO-8859-hacks.  Any presentation form 
might be covered with a dedicated glyph in one font, yet the same 
presentation form might be constructed from two or three component 
glyphs in other fonts.  And, crucially, even after you've set up the 
basic cross-reference table, there's still reordering which must be 
accomplished.  (Pre-Unicode Indic was of necessity entered in visual 
order.  Same for pre-shaping Unicode Indic.)

Instead of going through all that rigamarole, most users would probably 
prefer to just take a picture of the text with their phone and be done 
with it.  And if the source data is PDF, in a perfect world the PDF file 
could be dragged and dropped directly into the app, which would then 
prompt the user to choose whether the source should be processed as text 
or graphic.

I don't know enough about the current state of OCR to evaluate the 
challenge of training software to recognize unsupported scripts.  An 
open source OCR system like Tesseract may already be set-up for the 
common Indic scripts, and since it's crowd-sourced might eventually ease 
or simplify the training process, if it hasn't already.