Why do webforms often refuse non-ASCII characters?
Bríd-Áine Parnell
bridaine.parnell at ed.ac.uk
Thu Jan 30 04:21:36 CST 2025
Hi all,
Thanks for the replies! This is what I figured. With the airlines, maybe it also has to match the machine-readable bit of the passport that runs along the bottom, which is always without special characters, even if the name above that has fadas, or umlauts or whatever.
So, follow-on question, surely this makes it difficult when moving between systems and when trying to identify people? What I mean is, if for example Seán is spelled Seán on some systems and Sean on others, wouldn't that mean that some attempts to summon up his records would fail? And for financial systems, especially with open banking attempting to make data sharing easier, would it be difficult to perform know-your-client checks or collect records for credit scoring if newer banks/digital banks are using Unicode, but older systems aren't?
I also wonder if people start using natural language systems, e.g. natural language to SQL, to query databases they might run into issues with how names are recorded...
Bríd-Áine Parnell
Doctoral Researcher | Designing Responsible Natural Language Processing
School of Informatics | Edinburgh Futures Institute
________________________________
From: Phil Smith III <lists at akphs.com>
Sent: 29 January 2025 6:24 PM
To: 'Alexander Lange' <alexander.lange at catrinity-font.de>; unicode at corp.unicode.org <unicode at corp.unicode.org>; Bríd-Áine Parnell <bridaine.parnell at ed.ac.uk>
Subject: RE: Why do webforms often refuse non-ASCII characters?
This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that the email is genuine and the content is safe.
Airlines are a perfect example of #1. Many/most airlines—and at least some parts of the shared ecosystem—run their scheduling systems on IBM z/TPF, which is a high-performing transactional system on IBM mainframes; it was originally called ACP, for Airline Control Program. See https://en.wikipedia.org/wiki/Transaction_Processing_Facility
Key here is that IBM mainframes (and TPF) use EBCDIC for encoding. Now, EBCDIC has a rich set of code pages—but modulo the much-hated and rarely used DBCS (Double-Byte Character Set, which uses shift-in/shift-out characters to go into and out of double-byte mode), an EBCDIC code page is 256 characters. So you have a byte; supposed it’s x'43' (aka 0x42), and if you’re “in” code page 1047 (a common U.S. code page), that’s a lowercase a-with-umlaut. If you’re in code page 829 (math symbols), it’s a capital A-with-Angstrom. So it’s context-dependent.
This is a bit of a mess, needless to say. Back in 1964 when the IBM 360 was being developed, the plan was to support both EBCDIC (because of older systems) and ASCII. There was even a hardware bit in the program status word (the location counter, among other things) that said “Hey, we’re in ASCII mode!” However, due to the rushed nature of the project, the ASCII parts got left behind and were never resurrected. I have on my list for when I finish my time machine to fix that (along with “no null-terminated strings”, “no case sensitivity in UNIX filenames”, “forward slashes in DOS and Windows paths”, and for God’s sake, “consistent line endings across operating systems”!!)
The point here is that as others have noted, you just cannot assume Unicode support across the ecosystem. Worse, you can’t even assume that a given set of characters can coexist: if someone has a first name that contains a Cyrillic character and a Greek last name, you simply *cannot* represent that in EBCDIC, without metadata indicating that the two names use different code pages (which I’ve never seen anyone actually do, given the rarity of such use cases). Technically, if you were to require that support, you’d want code pages *per character*, since someone could have a made-up name that includes characters from two disparate EBCDIC code pages.
Thus the limitations on characters will remain for the foreseeable future. There’s clearly a Western-centric slant here, but that’s historical. I assume that part of being an Asiana Airlines gate agent or FA includes a requirement to be able to at least fumble your way through reading basic ASCII names. Consider the inverse: a Korean name written in Korean glyphs would completely stump the average American Airlines employee. That’s not a justification, just a description of how it is.
From: Unicode <unicode-bounces at corp.unicode.org> On Behalf Of Alexander Lange via Unicode
Sent: Wednesday, January 29, 2025 12:26 PM
To: unicode at corp.unicode.org
Subject: Re: Why do webforms often refuse non-ASCII characters?
Hi,
I can see three reasons for this:
1. As you say, modern databases can handle this. But not all databases currently in use are modern. Especially state agencies, banks and some other large corporations are often still using pretty old systems, or need to stay compatible to someone else's old system. A famous example are airline tickets: They all run through one quirky old system that can't deal with anything but ASCII letters, forcing many people to misspell their names when booking a flight.
2. The second reason is what I would call "lazy validation". You have to make sure your system isn't vulnerable to query injection, code injection, and perhaps spoofing, i.e. you have to forbid some characters that have a special function in whatever query language, programming language(s) and markup language(s) you use. Like e.g. ' for SQL, < and " for HTML and so on. If you forget any of these, you have a huge security problem. So the easiest and safest way* to ensure this is to whitelist just the characters that you know to be safe, and while you can do that correctly based on Unicode's character properties, I've also seen people using way too simple regular expressions like /[A-Za-z0-9]+/ which cause the problem you described.
* Apart from using libraries that already solve the problem properly, of course. But surprisingly many people keep re-implementing existing things for some reason.
3. Inconvenience for own staff: Even if a system can handle "special" characters, they may still be a hassle to work with. I once visited Japan with a friend whose name was Jürgen, and when they typed in our names in their system, it took four people discussing for ten minutes about how to insert the ü. Also checking if things like names are correct and matching across different documents is way harder if people can't easily read them.
Kind regards,
Alexander
On 29.01.2025 16:39, Bríd-Áine Parnell via Unicode wrote:
Hi everyone,
I'm hoping someone can help me out with some information. I'm doing some research into the refusal of accents in names (and other multicultural naming conventions) in online webforms. For example, in Ireland, there was a campaign recently to get the government to mandate acceptance of the fada in Irish language names (Seán instead of Sean). The campaign was successful, and the law changed in 2022, but it's only a requirement for public bodies, companies do not have to comply.
During the campaign, reports were made to the Data Protection Commissioner on the right to rectify about some of the companies, including Bank of Ireland and Aer Lingus. They defended themselves by saying that their systems couldn't accept fadas in names.
I'm assuming that its systems on the back end, such as database systems, that can't accept the so-called special characters. My question is, why would this be, given that Unicode would seem to solve this, and modern databases can use Unicode? Does anyone understand what the value is in continuing to retain legacy systems that only accept ASCII or some ISO variants? Or is there a different problem happening?
Appreciate any information that might shed light on this.
Thanks,
Bríd-Áine Parnell
Doctoral Researcher | Designing Responsible Natural Language Processing
School of Informatics | Edinburgh Futures Institute
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20250130/ed264726/attachment-0001.htm>
More information about the Unicode
mailing list