Why do binary files contain text but text files don't contain binary?

Ken Whistler via Unicode unicode at unicode.org
Fri Feb 21 10:28:27 CST 2020


On 2/21/2020 7:53 AM, Costello, Roger L. via Unicode wrote:
>
> Text files may indeed contain binary (i.e., bytes that are not 
> interpretable as characters). Namely, text files may contain newlines, 
> tabs, and some other invisible things.
>
> Question: "characters" are defined as only the visible things, right?
>
No. You've gone astray right there. Please read Chapter 2 of the Unicode 
Standard, and in particular, Section 2.4, Code Points and Characters:

https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564

All of those types of characters can occur in Unicode plain text. (With 
the exception of surrogate code points.)

> I conclude:
>
> Binary files may contain arbitrary text.
>
Binary files can contain *whatever*, including text.
>
> Text files may contain binary, but only a restricted set of binary.
>
The distinction is definitional. A text file contains *only* characters, 
interpretable by a specific character encoding (usually Unicode, these 
days).

But a text file need not be "plain text". An HTML file is an example of 
a text file (it contains only a sequence of characters, whose identity 
and interpretation is all clearly specified by looking them up in the 
Unicode Standard), but it is not *plain* text. It is *rich* text, 
consisting of markup tags interspersed with runs of plain text.

Another distinction that may be leading you astray is the distinction 
between binary file transfer and text file transfer. If you are using 
ftp, for example, you can specify use of binary file transfer, *even if* 
the file you are transferring is actually a text file. That simply means 
that the file transfer will agree to treat the entire file as a binary 
blob and transfer it byte-for-byte intact. A text file transfer, on the 
other hand, may look for "lines" in a text file and may adjust line 
endings to suit the receiving platform conventions.

> Do you agree?
>
No.

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20200221/221e289b/attachment.html>


More information about the Unicode mailing list