a mug

Philippe Verdy verdy_p at wanadoo.fr
Mon Jul 13 05:53:25 CDT 2015

2015-07-13 11:15 GMT+02:00 Marcel Schneider <charupdate at orange.fr>:

> It's roughly the same problem with the CSS and UTF-8 malfunctioning that
> is laughed at with the other merchandising items brought in by Umesh:
> http://www.zazzle.com/cheap_css_is_awesome_mug-168565401817501350
> http://www.zazzle.com/css_is_awesome_with_java-script_mug-168685521846695550
> and Karl Williamson <public at khwilliamson.com> (On Sat, Jul 11, 2015,
> 19:42):
> http://i1.cpcache.com/product/27297813/utf8_value_tshirt.jpg
> Personally the only time CSS was awesome to me is when I'd written bad
> code. In truth, CSS is very smart and allows browsers to adapt the box
> width to the content, if not hindered in doing so by some fixed-width. We
> can write bad code in any language, but then we should rather laugh at our
> own incapacity.
> Idem with charsets. The only time I saw UTF-8 like on the T-shirt, was
> when opening UTF-8 files that didn't specify charset=UTF-8. The thing to do
> was to add the charset in the file header.
Or simply add a leading BOM. All browsers will autodetect it. This only
concerns HTML files (on a local filesystem).

BOMs are not recommended for UTF-8 encoded javascripts: if your HTML local
file references a local javascript file, it can specify the expected file
type in addition to the local URL of the script file itself: this is an
HTML attribute to add to the HTML "script" element. If your page needs to
perform JSON requests, the JSON is normally served by a webserver that will
deliver the MIME type and charset in metadata. Some JSON parsers can also
be set to autodetect the BOM and then discard it from the visible content.

That's just the first 3 bytes to check in the input stream before sending
the stream data to the parser which can then be instantiated and
initialized directly with the correct charset.

For pages served by webservers, you add it in the metadata of your shared
folder to associate some files with MIME types. This can even be a global
setting of the server if all your pages and scripts are UTF-8 encoded, or
this can be set on the main folder and changed for specific folders for
files that should not be sent with the UTF-8 MIME metadata but with another

Or you can add the autodetection feature in Apache which will autodetect
the BOM in the file, then serve the UTF-8 file without this leading BOM but
with the corrected filesize and the correct MIME type with its charset

It is more complicate for files hosted on FTP as there's no MIME metadata:
for that the BOM is still the easiest option (but it will be up to the FTP
client to perform the autodetection. Autodetecting a BOM is much more
efficient than autodetecting an HTML meta tag in the header (this requires
aborting the curent parsing in the middle and restart it, this uses more
memory that will need to be garbage collected, and requires some
miliseconds and more CPU resources as HTML parsers are very costly in terms
of CPU-processing)..

If you place the charset in a meta tag of the HTML page, make sure that
this tag is near the begining of the HTML header (it should be fully within
the first 4KB, and even before the mandatory <title> element). In my
opinion this meta tag should ve the first child element of the <head>
element which is otself the first element of the <html> element that
immediately follows the optional HTML doc type declaration. If your page is
XHTML, you should use the leading XML declaration line to put that charset
indication: putting the indication in the first 4KB allows some charset
guessers to identify the charset faster without actually starting to
instanciate a parser and abort it in the middle. 4KB is typically the size
of a single memory page, so that page will remain in CPU/bus caches without
using paging I/O. The CPU cost will be minimal if the charset can be
autodetected very early in a few nanoseconds by just scanning the content
of a single memory page. 4KB is much large enough so that any placement of
the autodetected signatures will succeed without having to wait for long.

Actually I even think that the tag should be in the first 1400 bytes (to
match the maximum size of a single TCP packet with the smallest MTU:it will
minimize the networking I/O delays: aborting a parser and restartging it
has a significant processing time that could delay even more the processing
of the next TCP packet, which coudl then be paged out by the OS if there
are concurrent networking streams used by concurrent processes, such as
large file downloads or an active streamed video).

I just wonder why HTML5 did not deprecate the old meta tag of HTML4 in
favor of an attribute directly in the <html> root element, or even in its
recommended DOCTYPE declaration. But if you use the abbreviated HTML5
doctype line, its default should be UTF-8 and no indeication is necessary
(charset guessers should not be used with HTML5, except in case of parsing
failure only as a possible recovery solution, in which case the meta tag
may be processed. If there's no parsing error for the main document,
excluding all other referenced documents suc has scripts or inner frames,
the meta tag should better be ignored even if its present and specifies
something else).

May be in some future, there will be an HTML6 that enforces the use of a
single charset and possibly a more compact encoding. We've seen similar
radical changes including for core protocols such as HTTP(S) itself. this
could become a single unified protocol mixing this new generation HTTP and
HTML capabilities, but with more capabilities such as dynamic parallel
streams, encryption, authentication, simplified and more efficient data
signature, real time constraints and QoS management of streams for web
applications, and a more efficient support for encapsulated binary data
(notably audio/video/images, or even nearly native executable scripts,
precompiled by the server for the target client when its processing
capabilities are constrained, notably smartphones to save energy in their
battery). That future of HTML will focus muich more on its API, the
effective encoding may be autoadapted or negociated and cached (given that
we need security now everywhere on the web, negociation protocols are
already used: this is for now just for authenticating and exchanging
encryption pairs, but it could negociate in the same roundtrip some
presentation formats such as the MIME type and charset encoding,
compression levels, and binary compatibility of the clients for receiving
precompiled executable contents, or for sharing tasks and CPU/GPU resources
or local/remote storage, or synchronization of cached data)


We'll rapidely need in the future a true "network-centered OS" where
applications can run on one or more devices in parallel, owned by the
client or by the service provider, and allowing on-demand allocation and
sharing of processing ressources available locally or remotely. On that OS,
there will no longer be the concept of a host (or it will just be a virtual
delocalized host), the concept of "local" may be replaced by the concept of
personal user environment which will autoadapt to the capacilities of
devices around him and the available networking bandwidths.

At that time, this virtualized OS will certainly be 128-bit (and not 64-bit
as of today), and it will manage many terabytes of virtual memory,
including the environment of other users located anywhere. Clients and
servers will share or demand resources to that network dynamically and the
core element of this OS will be to manage caches, automatic
synchronization, and bandwidth allocations, and nobody will know "where"
the code is actually running physically. All devices will then exchange
indifferently code or data, or will perform computing tasks delegated to
them by other members in the network (including transformation codecs). The
network OS will provide the necessary isolation for security and the
architecture will be more peer-to-peer, working in a collaborative grid
computing architecture. It will be also failure resistant, with implicit
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150713/2baf72c4/attachment.html>

More information about the Unicode mailing list