Why do webforms often refuse non-ASCII characters?

Wed Jan 29 11:26:13 CST 2025

Hi,

I can see three reasons for this:

1. As you say, modern databases can handle this. But not all databases 
currently in use are modern. Especially state agencies, banks and some 
other large corporations are often still using pretty old systems, or 
need to stay compatible to someone else's old system. A famous example 
are airline tickets: They all run through one quirky old system that 
can't deal with anything but ASCII letters, forcing many people to 
misspell their names when booking a flight.

2. The second reason is what I would call "lazy validation". You have to 
make sure your system isn't vulnerable to query injection, code 
injection, and perhaps spoofing, i.e. you have to forbid some characters 
that have a special function in whatever query language, programming 
language(s) and markup language(s) you use. Like e.g. ' for SQL, < and " 
for HTML and so on. If you forget any of these, you have a huge security 
problem. So the easiest and safest way* to ensure this is to whitelist 
just the characters that you know to be safe, and while you can do that 
correctly based on Unicode's character properties, I've also seen people 
using way too simple regular expressions like /[A-Za-z0-9]+/ which cause 
the problem you described.

* Apart from using libraries that already solve the problem properly, of 
course. But surprisingly many people keep re-implementing existing 
things for some reason.

3. Inconvenience for own staff: Even if a system can handle "special" 
characters, they may still be a hassle to work with. I once visited 
Japan with a friend whose name was Jürgen, and when they typed in our 
names in their system, it took four people discussing for ten minutes 
about how to insert the ü. Also checking if things like names are 
correct and matching across different documents is way harder if people 
can't easily read them.

Kind regards,
Alexander

On 29.01.2025 16:39, Bríd-Áine Parnell via Unicode wrote:
> Hi everyone,
>
> I'm hoping someone can help me out with some information. I'm doing 
> some research into the refusal of accents in names (and other 
> multicultural naming conventions) in online webforms. For example, in 
> Ireland, there was a campaign recently to get the government to 
> mandate acceptance of the fada in Irish language names (Seán instead 
> of Sean). The campaign was successful, and the law changed in 2022, 
> but it's only a requirement for public bodies, companies do not have 
> to comply.
>
> During the campaign, reports were made to the Data Protection 
> Commissioner on the right to rectify about some of the companies, 
> including Bank of Ireland and Aer Lingus. They defended themselves by 
> saying that their systems couldn't accept fadas in names.
>
> I'm assuming that its systems on the back end, such as database 
> systems, that can't accept the so-called special characters. My 
> question is, why would this be, given that Unicode would seem to solve 
> this, and modern databases can use Unicode? Does anyone understand 
> what the value is in continuing to retain legacy systems that only 
> accept ASCII or some ISO variants? Or is there a different problem 
> happening?
>
> Appreciate any information that might shed light on this.
>
> Thanks,
>
> *Bríd-Áine Parnell*
>
> Doctoral Researcher | Designing Responsible Natural Language Processing
>
> School of Informatics | Edinburgh Futures Institute
> The University of Edinburgh is a charitable body, registered in 
> Scotland, with registration number SC005336. Is e buidheann 
> carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, 
> àireamh clàraidh SC005336. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20250129/493ea999/attachment.htm>