Is the Unicode Standard "The foundation for all modern software and communications around the world"?

Richard Wordingham via Unicode unicode at unicode.org
Wed Nov 20 16:48:52 CST 2019


On Tue, 19 Nov 2019 20:02:55 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> On 2019-11-19 6:59 PM, Costello, Roger L. via Unicode wrote:
> > Today I received an email from the Unicode organization. The email
> > said this: (italics and yellow highlighting are mine)
> >
> > The Unicode Standard is the foundation for all modern software and
> > communications around the world, including all modern operating
> > systems, browsers, laptops, and smart phones-plus the Internet and
> > Web (URLs, HTML, XML, CSS, JSON, etc.).
> >
> > That is a remarkable statement! But is it entirely true? Isn't it
> > assuming that everything is text? What about binary information
> > such as JPEG, GIF, MPEG, WAV; those are pretty core items to the
> > Web, right? The Unicode Standard is silent about them, right? Isn't
> > the above quote a bit misleading? 
> A bit, perhaps.  But think of it as a press release.
> 
> The statement smacks of hyperbole at first blush, but "foundation"
> can mean basis or starting point.  File names (and URLs) of *.WAV,
> *.MPG, etc. are stored and exchanged via Unicode.  Likewise, the tags 
> (metadata) for audio/video files are stored (and displayed) via 
> Unicode.  So fields such as Title, Artist, Comments/Notes, Release
> Date, Label, Composer, and so forth aren't limited to ASCII data.

But file names, URLs and syntax tags are still mostly in ASCII.  It's
only when you come to text data that you get to Unicode; the usual
unreliable assumption is that the recipient has the means to display
that text.  Now, a feature of a *modern* system is that file names and
(sometimes) syntax tags can be in Unicode.  But have the nightmares
of file names and canonical equivalence come to an end?  And remember
that canonical equivalence isn't just a matter of precomposed letters.

Moving away from communications, I still find that if I use 'sort -u' to
eliminate repeated lines in unordered lines of text, I have to ensure
that I'm using binary identity for comparison - too many collations
still treat unknown characters as identical.  And this is with a
distribution that has UTF-8 as its basic encoding.

There's now a looming threat to passwords in truly complex scripts.
Keyboards are coming that will prevent certain sequences of characters
- Thais have long faced such constraints.  Some people may discover that
an upgrade of their keyboards renders them unable to type their
passwords!

Richard.



More information about the Unicode mailing list