Tag characters and in-line graphics (from Tag characters)

Philippe Verdy verdy_p at wanadoo.fr
Tue Jun 2 01:01:25 CDT 2015


2015-06-01 1:33 GMT+02:00 Chris <idou747 at gmail.com>:

>
> Of course, anyone can invent a character set. The difficult bit is having
> a standard way of combining custom character sets. That’s why a standard
> would be useful.
>
> And while stuff like this can, to some extent, be recognised by magic
> numbers, and unique strings in headers, such things are unreliable. Just
> because example.net/mycharset/ appears near the start of a document,
> doesn’t necessarily mean it was meant to define a character set. Maybe it
> was a document discussing character sets.
>

That's not what I described. I spoke about using a MIME-compatible private
charset identifier, and how  such private identifier can be made
reasonnably unique by binding it to a domain name or URI.

If you had read more carefully I also said that it was absolutely not
necessary to dereference that URL: there are many XML schemas binding their
namespaces to a URI which is itself not a webpage or to any downloadable
DTD or XML schema or XML stylesheet. Google and Microsoft are using this a
lot in lots of schemas (which are not described and documented at this URL
if they are documented).

The URI by itself is just an identifier, it becomes a webpage only when you
use it in a web page with an href attribute to create an hyperlink, or to
perform some query to a service returning some data. An identifier for a
private charset does not need to perform any request to be usable by
itself, we just have the identifier which is sufficient by itself. The URI
can be also only a base URI for a collection of resources (whose URLs start
by this base URI, with conventional extensions appended to get the
character properties, or a font; but the best way is to embed this data in
your document, in some header or footer, if your document using the private
charset is not part of a collection of docs using the same private charset)

In that case, you don't need a new UTF: UTF-8 remains usable and you can
map your private charset to standard PUAs (and/or to "hacked" characters)
according to the private charset needs. The charset indicated in your
document (by some meta header) should be sufficient to avoid collisions
with other private conventions, it will define the scope of your private
charset as the document itself, which will then be interchangeable (and
possibly mixable with other documents with some renumbering if there a
collisions of assignments between two distinct private charsets: in the
document header; add to the charset identifier the range of PUAs which is
used, then with two documents colling on this range, you can reencode one
automatically by creating a compound charset with subranges of PUAs
remapped differently to other ranges).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150602/85e7af97/attachment.html>


More information about the Unicode mailing list