A file contains text data and binary data ... Is it a text file or a binary file?
Steffen Nurpmeso
steffen at sdaoden.eu
Mon Sep 14 16:38:05 CDT 2020
Roger L Costello wrote in
<SA9PR09MB51049422058DA65D4B54AB9EC8230 at SA9PR09MB5104.namprd09.prod.outl\
ook.com>:
|Thank you for your outstanding responses!
|
|I am a bit confused about something that Marcus said:
|
|Why do you need to give it a single attribute of "text" or "binary"? \
|It's a bit like asking about the single language of a text that contains \
|paragraphs in different languages.
|
|And Doug said:
|
|I think this is a false dichotomy.
|--------------------------------------------
|I realize there are binary files that contain text. For example, EXE \
|files contain binary data with islands of text scattered here and there. \
|But the key point, I think, is that EXE is categorized as a binary \
|file and not a text file. And thus EXE files should be displayed/edited \
|using an appropriate hex editor, not a text editor.
|
|Based on the many potential problems you described with using a text \
|editor to display a file that contains text data and binary data, I \
|draw this conclusion: If a file contains binary data, it is a binary \
|file; only if the file contains purely text data is it a text file." \
|Do you agree with this conclusion?
POSIX has the definitions (resorted)
2729 3.403 Text File
2730 A file that contains characters organized into zero or more lines. The lines do not contain NUL
2731 characters and none can exceed {LINE_MAX} bytes in length, including the <newline>
2732 character. Although POSIX.1-2017 does not distinguish between text files and binary files (see
2733 the ISO C standard), many utilities only produce predictable or meaningful output when
2734 operating on text files. The standard utilities that have such restrictions always specify ``text
2735 files’’ in their STDIN or INPUT FILES sections.
2284 3.288 Printable File
2285 A text file consisting only of the characters included in the print and space character
2286 classifications of the LC_CTYPE category and the <backspace>, all in the current locale.
2287 Note: The LC_CTYPE category is defined in detail in Section 7.3.1 (on page 139).
For the mailer i maintain the MIME classification checks for NUL
and other control characters, but does not treat as binary
/* If there is a escape sequence in reverse solidus notation defined
* for this in ANSI X3.159-1989 (ANSI C89), do not treat it as
* a control for real. I.e., \a=\x07=BEL, \b=\x08=BS, \t=\x09=HT.
* Do not follow libmagic(1) in respect to \v=\x0B=VT. \f=\x0C=NP; do
* ignore \e=\x1B=ESC */
if((c >= '\x07' && c <= '\x0D') || c == '\x1B')
continue;
Plus carriage-return and newline.
The above only 8-bit, ASCII compatible.
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)
More information about the Unicode
mailing list