<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">On 10/23/2021 4:02 PM, James Kass via

      Unicode wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:f74f6b6a-a77a-2034-c1b0-50e90274fb2e@code2001.com">

      <br>

      On 2021-10-23 10:59 PM, Asmus Freytag via Unicode wrote:

      <br>

      <blockquote type="cite">If you know the language, you can play

        with frequency data and try to use guess

        <br>

        mapping tables. You'll probably get most of the singleton to

        singleton mappings

        <br>

        correct, and then you could use various forms of trial and

        error, such as

        <br>

        genetic algorithms to locate and assign n:m mappings.

        <br>

        <br>

        If the language is not known, but among a set of known languages

        for which there

        <br>

        is existing data, I wouldn't be surprised to learn that you

        could adopt simple

        <br>

        language recognition algorithms to be independent of encoding

        details, and

        <br>

        either identify the actual language, or sharply limit the

        candidates.

        <br>

        <br>

        After that, you'd re-run the recognition algorithm with each

        candidate

        <br>

        transcoding table.

        <br>

        <br>

        I'm not an expert on this, but I did cobble together my own toy

        language

        <br>

        recognition code at one time, including using some genetic

        algorithm to improve

        <br>

        its sensitivity. Fun stuff and  I was surprised how well that

        worked with only a

        <br>

        few hours of effort.

        <br>

      </blockquote>

      <br>

      That's a sophisticated approach.  For anyone lacking that level of

      expertise or not having quick access to language

      frequency/identification data, it might be more practical to

      locate the modified font, open it in one of those font editors

      which displays all the glyphs in the font on a grid, open up the

      Unicode charts, and start cross-mapping away.

      <br>

    </blockquote>

    <p>If you have the font and a system to display it on, by all means;

      constructing a cross map manually should not be that difficult. I

      was assuming the more difficult problem of dealing with data that

      you can't even display.<br>

    </p>

    <p>However, you might overestimate the level of skill required to

      crack a problem like that.</p>

    <p>I've never taken a computer science class in my life. The

      "genetic algorithm" I played with, I grabbed from a toy sample in

      a book or blog. The frequency data to train my recognizer I got

      from scanning a few Wikipedia pages. in other words, as crude as

      you can get. Every programmer should be able to duplicate (and

      improve on) what I did. I mentioned the details mainly to

      illustrate that if you can get text samples in a known encoding

      it's amazing what you can do with that information, and how easy

      it is.</p>

    <p>Now, if there simply aren't any data for that language (or a very

      similar language) in a known encoding, but the script is

      supported, then looking up the glyphs is all you can do, of

      course.</p>

    <p>If you have a way of displaying the original data as running

      text, you can compare a proposed transcoding to Unicode to verify

      your guesses, whether those guesses are manual or automated.<br>

      <br>

      I'm not sure, to get back to your earlier post, whether having

      something like an OCR engine is as helpful. One of the issues is

      that forcing the display into a DOS font for display on a terminal

      grid may distort the appearance for some writing systems enough to

      throw of the OCR.<br>

    </p>

    <p>If you can work "blind", that is just based on statistics, you

      can sidestep that issue, or, in fact address directly the question

      of differences not just in code mapping, but in encoding model

      (selection of the elements to be encoded).</p>

    <p>The neat thing about trying to construct a cross mapping table is

      that it should be the same no matter which text fragment you apply

      it to. If your source texts are long enough, you may be able to

      divide them, and try the process over each fragment separately to

      see where it converges. Also, if the table has a localized defect,

      you may be able to display enough of the text on a Unicode system

      to be able to reason about the missing pieces, so you can add

      manual guesses to your automated ones. <br>

    </p>

    <p>All of this reduces the level of perfect required of any tool you

      construct.</p>

    <p>A./</p>

    <br>

  </body>

</html>