<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 10/22/2021 9:31 PM, James Kass via
Unicode wrote:<br>
</div>
<blockquote type="cite"
cite="mid:eb0a1cda-6039-ec95-9e83-b425b368eb41@code2001.com">
<br>
On 2021-10-22 9:04 PM, David Starner via Unicode wrote:
<br>
<blockquote type="cite">"as long as the source display is
correctly enabled and the
<br>
translation software handles the source language(s)." So in no
<br>
interesting cases. Project Gutenberg had a Swedish bible
translation
<br>
in an unknown encoding (a variant of the DOS encoding that
doesn't
<br>
seem to have corresponded to anything documented); getting it to
<br>
display correctly was basically the same challenge as
translating it
<br>
to Unicode, which was eventually done by figuring out what the
unknown
<br>
codepoints (obviously quotes) must have been. The set of
languages in
<br>
PUA and that have reliable transcription and translation is
going to
<br>
be virtually empty, and if you care about correctness and you
have the
<br>
font, directly convert the encoding.
<br>
</blockquote>
Yes, it's best to directly convert old source data when it's
feasible.
<br>
<br>
When the source data is in pre-Unicode Indic languages/scripts (or
even in pre-shaping support Unicode), this can often not be
accomplished simply. If you know the font and can find a
cross-reference table, then you're off to a good start. If you
can't find an existing cross-reference and have to "roll your
own", it's not as fun as it sounds. Some legacy fonts combine
standard encoding with PUA for presentation forms, others use
ISO-8859-hacks. Any presentation form might be covered with a
dedicated glyph in one font, yet the same presentation form might
be constructed from two or three component glyphs in other fonts.
And, crucially, even after you've set up the basic cross-reference
table, there's still reordering which must be accomplished.
(Pre-Unicode Indic was of necessity entered in visual order. Same
for pre-shaping Unicode Indic.)
<br>
<br>
Instead of going through all that rigamarole, most users would
probably prefer to just take a picture of the text with their
phone and be done with it. And if the source data is PDF, in a
perfect world the PDF file could be dragged and dropped directly
into the app, which would then prompt the user to choose whether
the source should be processed as text or graphic.
<br>
<br>
I don't know enough about the current state of OCR to evaluate the
challenge of training software to recognize unsupported scripts.
An open source OCR system like Tesseract may already be set-up for
the common Indic scripts, and since it's crowd-sourced might
eventually ease or simplify the training process, if it hasn't
already.
<br>
<br>
<br>
</blockquote>
<p><font face="Candara">If you have encoded data, and the encoding
elements are the same, but at different location, the task is
reasonably straightforward for letters; punctuation can be more
challenging, because of variations in conventions even for the
same language; however, there are fewer marks, so you can try
them after you've cracked the text and see what looks good.<br>
</font></p>
<p><font face="Candara">If you know the language, simple frequency
analysis (perhaps extended to pairs and triplets) should give
you the transcoding table.</font></p>
<p><font face="Candara">If you don't know the language, you should
be able to use the same data, but would need a way to represent
it that abstracts from the code value, so you can compare this
to to data from some corpora using a different encoding.</font></p>
<p><font face="Candara">When the breakdown of the writing into
encoded elements is unknown, or known to be different from
Unicode in some way it would become way more challenging.</font></p>
<p><font face="Candara">The "presentation forms" would map some
pairs, triplets to single codes; if you know the type of writing
system, you may guess which features might get encoded as
singletons.</font></p>
<p><font face="Candara">In principle, say "fi" ligatures are coded
using a single code, your statistics would be giving the "fi"
frequency not in a pair but for that singleton value. It would
be a cool challenge to see how not knowing whether fi is or
isn't single-coded would affect recognition of languages in the
Latin script and or generating transcoding tables.</font></p>
<p><font face="Candara">I agree, the situation would get
progressively complex for Indic scripts that have unknown
encoding models.<br>
<br>
Might need some more sophisticated mathematical techniques, but
should not be harder to break than many encryptions.</font></p>
<p><font face="Candara">A./<br>
</font></p>
</body>
</html>