Batch conversion from "normal" Unicode text to NCRs
Jim DeLaHunt
list+unicode at jdlh.com
Tue Nov 12 00:13:03 CST 2024
Hello, António:
Interesting question.
On 2024-11-11 19:59, António MARTINS-Tuválkin via Unicode wrote:
> ... this fellow has a
> weekly load of a few hundred HTML files created by several people of
> varying skill levels: ... the resulting pages online present some
> times wonky results, due to uneven inclusion of “high byte” characters.
>
> I have been saying in the past couple decades that problems will vanish
> if all files include only “ASCII characters”, by means of NCR escape
> sequences, but some of the aforementioned individual editors seem unable
> to ensure it, so a wholesale “conversion” is the intermediate step that
> needs to be added to the workflow, before uploading.
>
> And my question is: Which tool to use?...
This looks to be one of those tasks which takes a sophisticated tool to
do it right, but could be handled by a really simple solution if you are
lucky.
The right way to do it, it seems to me, is to use a tool which can parse
the HTML syntax, then traverse all elements and look at their text
contents. For each character in the text content, if it is above U+007F,
then express it as a numerical character reference. Maybe you need to
convert attribute values as well. The details might well depend on the
exact contents of the HTML files. One tool that might get you part of
the way is DOMParser
<https://developer.mozilla.org/en-US/docs/Web/API/DOMParser>.
If you are lucky, then the only place where characters above U+007f
occur will be in the text content of the HTML file. Then you can blindly
go through every character of the file, and convert all characters above
0x007f to NCRs. The iconv tool
<https://www.gnu.org/savannah-checkouts/gnu/libiconv/documentation/libiconv-1.17/iconv.1.html>
can do this. You convert to ASCII, and use the --unicode-subst argument
to give the format for NCRs. For example:
% echo 'António' | iconv -f UTF-8 --unicode-subst="&%u;" -t ASCII
Ant&243;nio
This leaves you with the problem of getting iconv, a Unix-origin tool,
onto your Windows machine, and the problem of detecting when you did not
get lucky and you converted something which was not text content.
Does this help? Best regards,
—Jim DeLaHunt
--
. --Jim DeLaHunt, jdlh at jdlh.com http://blog.jdlh.com/ (http://jdlh.com/)
multilingual websites consultant, Vancouver, B.C., Canada
More information about the Unicode
mailing list