Why do binary files contain text but text files don't contain binary?
Ken Whistler via Unicode
unicode at unicode.org
Fri Feb 21 10:28:27 CST 2020
On 2/21/2020 7:53 AM, Costello, Roger L. via Unicode wrote:
> Text files may indeed contain binary (i.e., bytes that are not
> interpretable as characters). Namely, text files may contain newlines,
> tabs, and some other invisible things.
> Question: "characters" are defined as only the visible things, right?
No. You've gone astray right there. Please read Chapter 2 of the Unicode
Standard, and in particular, Section 2.4, Code Points and Characters:
All of those types of characters can occur in Unicode plain text. (With
the exception of surrogate code points.)
> I conclude:
> Binary files may contain arbitrary text.
Binary files can contain *whatever*, including text.
> Text files may contain binary, but only a restricted set of binary.
The distinction is definitional. A text file contains *only* characters,
interpretable by a specific character encoding (usually Unicode, these
But a text file need not be "plain text". An HTML file is an example of
a text file (it contains only a sequence of characters, whose identity
and interpretation is all clearly specified by looking them up in the
Unicode Standard), but it is not *plain* text. It is *rich* text,
consisting of markup tags interspersed with runs of plain text.
Another distinction that may be leading you astray is the distinction
between binary file transfer and text file transfer. If you are using
ftp, for example, you can specify use of binary file transfer, *even if*
the file you are transferring is actually a text file. That simply means
that the file transfer will agree to treat the entire file as a binary
blob and transfer it byte-for-byte intact. A text file transfer, on the
other hand, may look for "lines" in a text file and may adjust line
endings to suit the receiving platform conventions.
> Do you agree?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode