<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 10/23/2021 4:02 PM, James Kass via
Unicode wrote:<br>
</div>
<blockquote type="cite"
cite="mid:f74f6b6a-a77a-2034-c1b0-50e90274fb2e@code2001.com">
<br>
On 2021-10-23 10:59 PM, Asmus Freytag via Unicode wrote:
<br>
<blockquote type="cite">If you know the language, you can play
with frequency data and try to use guess
<br>
mapping tables. You'll probably get most of the singleton to
singleton mappings
<br>
correct, and then you could use various forms of trial and
error, such as
<br>
genetic algorithms to locate and assign n:m mappings.
<br>
<br>
If the language is not known, but among a set of known languages
for which there
<br>
is existing data, I wouldn't be surprised to learn that you
could adopt simple
<br>
language recognition algorithms to be independent of encoding
details, and
<br>
either identify the actual language, or sharply limit the
candidates.
<br>
<br>
After that, you'd re-run the recognition algorithm with each
candidate
<br>
transcoding table.
<br>
<br>
I'm not an expert on this, but I did cobble together my own toy
language
<br>
recognition code at one time, including using some genetic
algorithm to improve
<br>
its sensitivity. Fun stuff and I was surprised how well that
worked with only a
<br>
few hours of effort.
<br>
</blockquote>
<br>
That's a sophisticated approach. For anyone lacking that level of
expertise or not having quick access to language
frequency/identification data, it might be more practical to
locate the modified font, open it in one of those font editors
which displays all the glyphs in the font on a grid, open up the
Unicode charts, and start cross-mapping away.
<br>
</blockquote>
<p>If you have the font and a system to display it on, by all means;
constructing a cross map manually should not be that difficult. I
was assuming the more difficult problem of dealing with data that
you can't even display.<br>
</p>
<p>However, you might overestimate the level of skill required to
crack a problem like that.</p>
<p>I've never taken a computer science class in my life. The
"genetic algorithm" I played with, I grabbed from a toy sample in
a book or blog. The frequency data to train my recognizer I got
from scanning a few Wikipedia pages. in other words, as crude as
you can get. Every programmer should be able to duplicate (and
improve on) what I did. I mentioned the details mainly to
illustrate that if you can get text samples in a known encoding
it's amazing what you can do with that information, and how easy
it is.</p>
<p>Now, if there simply aren't any data for that language (or a very
similar language) in a known encoding, but the script is
supported, then looking up the glyphs is all you can do, of
course.</p>
<p>If you have a way of displaying the original data as running
text, you can compare a proposed transcoding to Unicode to verify
your guesses, whether those guesses are manual or automated.<br>
<br>
I'm not sure, to get back to your earlier post, whether having
something like an OCR engine is as helpful. One of the issues is
that forcing the display into a DOS font for display on a terminal
grid may distort the appearance for some writing systems enough to
throw of the OCR.<br>
</p>
<p>If you can work "blind", that is just based on statistics, you
can sidestep that issue, or, in fact address directly the question
of differences not just in code mapping, but in encoding model
(selection of the elements to be encoded).</p>
<p>The neat thing about trying to construct a cross mapping table is
that it should be the same no matter which text fragment you apply
it to. If your source texts are long enough, you may be able to
divide them, and try the process over each fragment separately to
see where it converges. Also, if the table has a localized defect,
you may be able to display enough of the text on a Unicode system
to be able to reason about the missing pieces, so you can add
manual guesses to your automated ones. <br>
</p>
<p>All of this reduces the level of perfect required of any tool you
construct.</p>
<p>A./</p>
<br>
</body>
</html>