A file contains text data and binary data ... Is it a text file or a binary file?

Steffen Nurpmeso steffen at sdaoden.eu
Mon Sep 14 16:38:05 CDT 2020


Roger L Costello wrote in
 <SA9PR09MB51049422058DA65D4B54AB9EC8230 at SA9PR09MB5104.namprd09.prod.outl\
 ook.com>:
 |Thank you for your outstanding responses!
 |
 |I am a bit confused about something that Marcus said:
 |
 |Why do you need to give it a single attribute of "text" or "binary"? \
 |It's a bit like asking about the single language of a text that contains \
 |paragraphs in different languages.
 |
 |And Doug said:
 |
 |I think this is a false dichotomy.
 |--------------------------------------------
 |I realize there are binary files that contain text. For example, EXE \
 |files contain binary data with islands of text scattered here and there. \
 |But the key point, I think, is that EXE is categorized as a binary \
 |file and not a text file. And thus EXE files should be displayed/edited \
 |using an appropriate hex editor, not a text editor.
 |
 |Based on the many potential problems you described with using a text \
 |editor to display a file that contains text data and binary data, I \
 |draw this conclusion: If a file contains binary data, it is a binary \
 |file; only if the file contains purely text data is it a text file." \
 |Do you agree with this conclusion?

POSIX has the definitions (resorted) 

2729   3.403 Text File
2730          A file that contains characters organized into zero or more lines. The lines do not contain NUL
2731          characters and none can exceed {LINE_MAX} bytes in length, including the <newline>
2732          character. Although POSIX.1-2017 does not distinguish between text files and binary files (see
2733          the ISO C standard), many utilities only produce predictable or meaningful output when
2734          operating on text files. The standard utilities that have such restrictions always specify ``text
2735          files’’ in their STDIN or INPUT FILES sections.

2284   3.288 Printable File
2285          A text file consisting only of the characters included in the print and space character
2286          classifications of the LC_CTYPE category and the <backspace>, all in the current locale.
2287          Note:     The LC_CTYPE category is defined in detail in Section 7.3.1 (on page 139).

For the mailer i maintain the MIME classification checks for NUL
and other control characters, but does not treat as binary

     /* If there is a escape sequence in reverse solidus notation defined
      * for this in ANSI X3.159-1989 (ANSI C89), do not treat it as
      * a control for real.  I.e., \a=\x07=BEL, \b=\x08=BS, \t=\x09=HT.
      * Do not follow libmagic(1) in respect to \v=\x0B=VT.  \f=\x0C=NP; do
      * ignore \e=\x1B=ESC */
     if((c >= '\x07' && c <= '\x0D') || c == '\x1B')
        continue;

Plus carriage-return and newline.
The above only 8-bit, ASCII compatible.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



More information about the Unicode mailing list