A file contains text data and binary data ... Is it a text file or a binary file?

Harriet Riddle harjitmoe at outlook.com
Mon Sep 14 15:13:00 CDT 2020


In answer to your second question:

Loading and saving as a text file is only guaranteed, per the current C spec, to round-trip it if it contains only printing characters, native line breaks, horizontal tabs, and spaces which are not immediately followed by line breaks.

Some editors will round trip, some might truncate at a NUL (e.g. leafpad) and/or ^Z when loading, some will corrupt it by newline conversion (e.g. reading both CRLF and LF as LF), and so forth.

(※ ^Z is normally SUB, but DOS encodings by their IBM/ICU mappings pivot FS–SUB–DEL for some reason, so it arguably counts as FS here…)

So if you intend to preserve the binary data, loading it as text is not advised. (It may also confound encoding detection, or throw an error in a strict-mode UTF-8 reader, et cetera, and as such potentially complicate the ability to read non-ASCII text, depending on the details of your case.)

And yes, a binary file can contain text file segments while still being a binary file, but not the other way around (a tar archive is always a binary file, and most ar archives are also, but an ar (.a) archive of only UTF-8 text files is itself arguably a UTF-8 text file—though it usually isn't treated as such since this would only work in that special case, while treating ar as a binary format like tar works always).

--Har.

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Unicode <unicode-bounces at unicode.org> on behalf of Roger L Costello via Unicode <unicode at unicode.org>
Sent: Monday, September 14, 2020 8:46:07 PM
To: Unicode Discussion <unicode at unicode.org>
Subject: A file contains text data and binary data ... Is it a text file or a binary file?

Hi Folks,

A file contains a long series of text data and at the end is binary data. The binary data is not encoded as base64 text or anything like that. It is raw, unfiltered, unencoded binary data.

Is it a text file or a binary file?

A colleague argues that it may be legitimately treated as a text file. After all, it can be opened in a text editor. The text editor might display odd-looking characters such as this:

ÿª¼ T

But that is harmless.

Is there a practical, real-world problem with treating it as a text file?

/Roger


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20200914/659655e7/attachment.htm>


More information about the Unicode mailing list