<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">On 10/22/2021 9:31 PM, James Kass via

      Unicode wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:eb0a1cda-6039-ec95-9e83-b425b368eb41@code2001.com">

      <br>

      On 2021-10-22 9:04 PM, David Starner via Unicode wrote:

      <br>

      <blockquote type="cite">"as long as the source display is

        correctly enabled and the

        <br>

        translation software handles the source language(s)." So in no

        <br>

        interesting cases. Project Gutenberg had a Swedish bible

        translation

        <br>

        in an unknown encoding (a variant of the DOS encoding that

        doesn't

        <br>

        seem to have corresponded to anything documented); getting it to

        <br>

        display correctly was basically the same challenge as

        translating it

        <br>

        to Unicode, which was eventually done by figuring out what the

        unknown

        <br>

        codepoints (obviously quotes) must have been. The set of

        languages in

        <br>

        PUA and that have reliable transcription and translation is

        going to

        <br>

        be virtually empty, and if you care about correctness and you

        have the

        <br>

        font, directly convert the encoding.

        <br>

      </blockquote>

      Yes, it's best to directly convert old source data when it's

      feasible.

      <br>

      <br>

      When the source data is in pre-Unicode Indic languages/scripts (or

      even in pre-shaping support Unicode), this can often not be

      accomplished simply.  If you know the font and can find a

      cross-reference table, then you're off to a good start.  If you

      can't find an existing cross-reference and have to "roll your

      own", it's not as fun as it sounds.  Some legacy fonts combine

      standard encoding with PUA for presentation forms, others use

      ISO-8859-hacks.  Any presentation form might be covered with a

      dedicated glyph in one font, yet the same presentation form might

      be constructed from two or three component glyphs in other fonts. 

      And, crucially, even after you've set up the basic cross-reference

      table, there's still reordering which must be accomplished. 

      (Pre-Unicode Indic was of necessity entered in visual order.  Same

      for pre-shaping Unicode Indic.)

      <br>

      <br>

      Instead of going through all that rigamarole, most users would

      probably prefer to just take a picture of the text with their

      phone and be done with it.  And if the source data is PDF, in a

      perfect world the PDF file could be dragged and dropped directly

      into the app, which would then prompt the user to choose whether

      the source should be processed as text or graphic.

      <br>

      <br>

      I don't know enough about the current state of OCR to evaluate the

      challenge of training software to recognize unsupported scripts. 

      An open source OCR system like Tesseract may already be set-up for

      the common Indic scripts, and since it's crowd-sourced might

      eventually ease or simplify the training process, if it hasn't

      already.

      <br>

      <br>

      <br>

    </blockquote>

    <p><font face="Candara">If you have encoded data, and the encoding

        elements are the same, but at different location, the task is

        reasonably straightforward for letters; punctuation can be more

        challenging, because of variations in conventions even for the

        same language; however, there are fewer marks, so you can try

        them after you've cracked the text and see what looks good.<br>

      </font></p>

    <p><font face="Candara">If you know the language, simple frequency

        analysis (perhaps extended to pairs and triplets) should give

        you the transcoding table.</font></p>

    <p><font face="Candara">If you don't know the language, you should

        be able to use the same data, but would need a way to represent

        it that abstracts from the code value, so you can compare this

        to to data from some corpora using a different encoding.</font></p>

    <p><font face="Candara">When the breakdown of the writing into

        encoded elements is unknown, or known to be different from

        Unicode in some way it would become way more challenging.</font></p>

    <p><font face="Candara">The "presentation forms" would map some

        pairs, triplets to single codes; if you know the type of writing

        system, you may guess which features might get encoded as

        singletons.</font></p>

    <p><font face="Candara">In principle, say "fi" ligatures are coded

        using a single code, your statistics would be giving the "fi"

        frequency not in a pair but for that singleton value. It would

        be a cool challenge to see how not knowing whether fi is or

        isn't single-coded would affect recognition of languages in the

        Latin script and or generating transcoding tables.</font></p>

    <p><font face="Candara">I agree, the situation would get

        progressively complex for Indic scripts that have unknown

        encoding models.<br>

        <br>

        Might need some more sophisticated mathematical techniques, but

        should not be harder to break than many encryptions.</font></p>

    <p><font face="Candara">A./<br>

      </font></p>

  </body>

</html>