<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p><br>

    </p>

    <div class="moz-forward-container">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <div class="moz-cite-prefix">On 10/23/2021 2:44 PM, James Kass via

        Unicode wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:deef7f6e-860b-428f-0e47-8ed50b8a4109@code2001.com"> <br>

        On 2021-10-22 9:04 PM, David Starner via Unicode wrote: <br>

        <blockquote type="cite">Project Gutenberg had a Swedish bible

          translation <br>

          in an unknown encoding (a variant of the DOS encoding that

          doesn't <br>

          seem to have corresponded to anything documented); getting it

          to <br>

          display correctly was basically the same challenge as

          translating it <br>

          to Unicode, which was eventually done by figuring out what the

          unknown <br>

          codepoints (obviously quotes) must have been. <br>

        </blockquote>

        <br>

        Editors for DOS fonts enabled users to create all manner of

        alternate "encodings" for anything which could fit into the

        grid. Newly created/modified fonts could be saved under

        different file names.  A DOS command then enabled users to swap

        the font-in-use. <br>

        <br>

        Here's an example of such an editor written by Adam Twardoch in

        1994: <br>

        <a class="moz-txt-link-freetext"

          href="https://dos-font-utils-wiki.readthedocs.io/en/latest/POLFED/"

          moz-do-not-send="true">https://dos-font-utils-wiki.readthedocs.io/en/latest/POLFED/</a>

        <br>

        <br>

        The Swedish text data which didn't match up with any known code

        page that David Starner encountered must have originally been

        displayed with such a modified font.  There's probably similar

        legacy data still out there which will be challenging to anyone

        trying to preserve it by converting it to Unicode. <br>

        <br>

      </blockquote>

      <p><font face="Candara">If we assume a DOS font, that is a

          collection of shapes, each of which occupies a fixed cell,

          then that limits the way a possible text stream's display can

          be broken up: some digraphs (including "accented" characters)

          may occupy one cell, some multi-graphs might be displayed by

          two (or three) adjacent cells.</font></p>

      <p><font face="Candara">There is much less flexibility in such a

          system than one that uses a custom outline font and custom

          shaping engine.</font></p>

      <p><font face="Candara">That means, the "encryption" of a Unicode

          text stream into such an encoding is constrained.</font></p>

      <p><font face="Candara">If you know the language, you can play

          with frequency data and try to use guess mapping tables.

          You'll probably get most of the singleton to singleton

          mappings correct, and then you could use various forms of

          trial and error, such as genetic algorithms to locate and

          assign n:m mappings.</font></p>

      <p><font face="Candara">If the language is not known, but among a

          set of known languages for which there is existing data, I

          wouldn't be surprised to learn that you could adopt simple

          language recognition algorithms to be independent of encoding

          details, and either identify the actual language, or sharply

          limit the candidates.</font></p>

      <p><font face="Candara">After that, you'd re-run the recognition

          algorithm with each candidate transcoding table.</font></p>

      <p><font face="Candara">I'm not an expert on this, but I did

          cobble together my own toy language recognition code at one

          time, including using some genetic algorithm to improve its

          sensitivity. Fun stuff and  I was surprised how well that

          worked with only a few hours of effort.</font></p>

      <p><font face="Candara">A./<br>

        </font></p>

      <p><br>

      </p>

    </div>

  </body>

</html>