Why do webforms often refuse non-ASCII characters?
Alexander Lange
alexander.lange at catrinity-font.de
Wed Jan 29 11:26:13 CST 2025
Hi,
I can see three reasons for this:
1. As you say, modern databases can handle this. But not all databases
currently in use are modern. Especially state agencies, banks and some
other large corporations are often still using pretty old systems, or
need to stay compatible to someone else's old system. A famous example
are airline tickets: They all run through one quirky old system that
can't deal with anything but ASCII letters, forcing many people to
misspell their names when booking a flight.
2. The second reason is what I would call "lazy validation". You have to
make sure your system isn't vulnerable to query injection, code
injection, and perhaps spoofing, i.e. you have to forbid some characters
that have a special function in whatever query language, programming
language(s) and markup language(s) you use. Like e.g. ' for SQL, < and "
for HTML and so on. If you forget any of these, you have a huge security
problem. So the easiest and safest way* to ensure this is to whitelist
just the characters that you know to be safe, and while you can do that
correctly based on Unicode's character properties, I've also seen people
using way too simple regular expressions like /[A-Za-z0-9]+/ which cause
the problem you described.
* Apart from using libraries that already solve the problem properly, of
course. But surprisingly many people keep re-implementing existing
things for some reason.
3. Inconvenience for own staff: Even if a system can handle "special"
characters, they may still be a hassle to work with. I once visited
Japan with a friend whose name was Jürgen, and when they typed in our
names in their system, it took four people discussing for ten minutes
about how to insert the ü. Also checking if things like names are
correct and matching across different documents is way harder if people
can't easily read them.
Kind regards,
Alexander
On 29.01.2025 16:39, Bríd-Áine Parnell via Unicode wrote:
> Hi everyone,
>
> I'm hoping someone can help me out with some information. I'm doing
> some research into the refusal of accents in names (and other
> multicultural naming conventions) in online webforms. For example, in
> Ireland, there was a campaign recently to get the government to
> mandate acceptance of the fada in Irish language names (Seán instead
> of Sean). The campaign was successful, and the law changed in 2022,
> but it's only a requirement for public bodies, companies do not have
> to comply.
>
> During the campaign, reports were made to the Data Protection
> Commissioner on the right to rectify about some of the companies,
> including Bank of Ireland and Aer Lingus. They defended themselves by
> saying that their systems couldn't accept fadas in names.
>
> I'm assuming that its systems on the back end, such as database
> systems, that can't accept the so-called special characters. My
> question is, why would this be, given that Unicode would seem to solve
> this, and modern databases can use Unicode? Does anyone understand
> what the value is in continuing to retain legacy systems that only
> accept ASCII or some ISO variants? Or is there a different problem
> happening?
>
> Appreciate any information that might shed light on this.
>
> Thanks,
>
> *Bríd-Áine Parnell*
>
> Doctoral Researcher | Designing Responsible Natural Language Processing
>
> School of Informatics | Edinburgh Futures Institute
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336. Is e buidheann
> carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba,
> àireamh clàraidh SC005336.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20250129/493ea999/attachment.htm>
More information about the Unicode
mailing list