Batch conversion from "normal" Unicode text to NCRs
Giacomo Catenazzi
cate at cateee.net
Tue Nov 12 03:15:57 CST 2024
Hello,
On 2024-11-12 4:59, António MARTINS-Tuválkin via Unicode wrote:
>
> I have been saying in the past couple decades that problems will vanish
> if all files include only “ASCII characters”, by means of NCR escape
> sequences, but some of the aforementioned individual editors seem unable
> to ensure it, so a wholesale “conversion” is the intermediate step that
> needs to be added to the workflow, before uploading.
I'm not sure NCR is the best way to go (also decades ago): it is just a
numeric representation and not semantic (as with other HTML entities),
and so finding problems may become difficult.
In my opinion, you should remove all NCR, else it would be a nightmare
to check wrong encoding (maybe some of NCR where in Latin1, and some in
Unicode, and the problem often we have double encoding). Also it makes
difficult to correct spell. And also it is more simple to handle for
people with all kind of experience. (Now UTF-8 can be used with all tools).
So I would try to transform text as UTF-8 without NCR (web now is
default UTF-8).
Then you can check files (and where) there is a bad encoding (a
transformation with other UTF encodings should give warnings, just
discard the output and check the warnings).
In my experience, one site has common patterns, so the NCR and "bad
characters" are limited on types, and you can use a text substitution
(sed in Linux, macos, and I think various console tools in windows
support it), or other more user friendly tools (see later). I find it
easy and quick. It is not general, but as I said: often a site has
common patterns, not many languages, etc. So I usually go to quick and
dirty which in this case is better than a perfect solution which can
handle all characters).
You may want to consider programmers or developers tools: Usually search
and replace can be done on a tree of directories, with visual
confirmation (e.g. jumping in the right file). They may also get batch
encoding conversion: so possibly the best tools for such task, also if
we do not program.
giacomo
More information about the Unicode
mailing list