Batch conversion from "normal" Unicode text to NCRs

Jim DeLaHunt list+unicode at jdlh.com
Tue Nov 12 00:13:03 CST 2024


Hello, António:

Interesting question.

On 2024-11-11 19:59, António MARTINS-Tuválkin via Unicode wrote:
> ... this fellow has a
> weekly load of a few hundred HTML files created by several people of
> varying skill levels: ... the resulting pages online present some
> times wonky results, due to uneven inclusion of “high byte” characters.
>
> I have been saying in the past couple decades that problems will vanish
> if all files include only “ASCII characters”, by means of NCR escape
> sequences, but some of the aforementioned individual editors seem unable
> to ensure it, so a wholesale “conversion” is the intermediate step that
> needs to be added to the workflow, before uploading.
>
> And my question is: Which tool to use?...

This looks to be one of those tasks which takes a sophisticated tool to 
do it right, but could be handled by a really simple solution if you are 
lucky.

The right way to do it, it seems to me, is to use a tool which can parse 
the HTML syntax, then traverse all elements and look at their text 
contents. For each character in the text content, if it is above U+007F, 
then express it as a numerical character reference. Maybe you need to 
convert attribute values as well. The details might well depend on the 
exact contents of the HTML files. One tool that might get you part of 
the way is DOMParser 
<https://developer.mozilla.org/en-US/docs/Web/API/DOMParser>.

If you are lucky, then the only place where characters above U+007f 
occur will be in the text content of the HTML file. Then you can blindly 
go through every character of the file, and convert all characters above 
0x007f to NCRs. The iconv tool 
<https://www.gnu.org/savannah-checkouts/gnu/libiconv/documentation/libiconv-1.17/iconv.1.html> 
can do this. You convert to ASCII, and use the --unicode-subst argument 
to give the format for NCRs. For example:

% echo 'António' | iconv -f UTF-8 --unicode-subst="&%u;" -t ASCII
Ant&243;nio

This leaves you with the problem of getting iconv, a Unix-origin tool, 
onto your Windows machine, and the problem of detecting when you did not 
get lucky and you converted something which was not text content.

Does this help? Best regards,
      —Jim DeLaHunt


-- 
.   --Jim DeLaHunt, jdlh at jdlh.com     http://blog.jdlh.com/ (http://jdlh.com/)
       multilingual websites consultant, Vancouver, B.C., Canada



More information about the Unicode mailing list