<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p><br>
</p>
<div class="moz-forward-container">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div class="moz-cite-prefix">On 10/23/2021 2:44 PM, James Kass via
Unicode wrote:<br>
</div>
<blockquote type="cite"
cite="mid:deef7f6e-860b-428f-0e47-8ed50b8a4109@code2001.com"> <br>
On 2021-10-22 9:04 PM, David Starner via Unicode wrote: <br>
<blockquote type="cite">Project Gutenberg had a Swedish bible
translation <br>
in an unknown encoding (a variant of the DOS encoding that
doesn't <br>
seem to have corresponded to anything documented); getting it
to <br>
display correctly was basically the same challenge as
translating it <br>
to Unicode, which was eventually done by figuring out what the
unknown <br>
codepoints (obviously quotes) must have been. <br>
</blockquote>
<br>
Editors for DOS fonts enabled users to create all manner of
alternate "encodings" for anything which could fit into the
grid. Newly created/modified fonts could be saved under
different file names. A DOS command then enabled users to swap
the font-in-use. <br>
<br>
Here's an example of such an editor written by Adam Twardoch in
1994: <br>
<a class="moz-txt-link-freetext"
href="https://dos-font-utils-wiki.readthedocs.io/en/latest/POLFED/"
moz-do-not-send="true">https://dos-font-utils-wiki.readthedocs.io/en/latest/POLFED/</a>
<br>
<br>
The Swedish text data which didn't match up with any known code
page that David Starner encountered must have originally been
displayed with such a modified font. There's probably similar
legacy data still out there which will be challenging to anyone
trying to preserve it by converting it to Unicode. <br>
<br>
</blockquote>
<p><font face="Candara">If we assume a DOS font, that is a
collection of shapes, each of which occupies a fixed cell,
then that limits the way a possible text stream's display can
be broken up: some digraphs (including "accented" characters)
may occupy one cell, some multi-graphs might be displayed by
two (or three) adjacent cells.</font></p>
<p><font face="Candara">There is much less flexibility in such a
system than one that uses a custom outline font and custom
shaping engine.</font></p>
<p><font face="Candara">That means, the "encryption" of a Unicode
text stream into such an encoding is constrained.</font></p>
<p><font face="Candara">If you know the language, you can play
with frequency data and try to use guess mapping tables.
You'll probably get most of the singleton to singleton
mappings correct, and then you could use various forms of
trial and error, such as genetic algorithms to locate and
assign n:m mappings.</font></p>
<p><font face="Candara">If the language is not known, but among a
set of known languages for which there is existing data, I
wouldn't be surprised to learn that you could adopt simple
language recognition algorithms to be independent of encoding
details, and either identify the actual language, or sharply
limit the candidates.</font></p>
<p><font face="Candara">After that, you'd re-run the recognition
algorithm with each candidate transcoding table.</font></p>
<p><font face="Candara">I'm not an expert on this, but I did
cobble together my own toy language recognition code at one
time, including using some genetic algorithm to improve its
sensitivity. Fun stuff and I was surprised how well that
worked with only a few hours of effort.</font></p>
<p><font face="Candara">A./<br>
</font></p>
<p><br>
</p>
</div>
</body>
</html>