Is the binaryness/textness of a data format a property?
Richard Wordingham via Unicode
unicode at unicode.org
Fri Mar 20 09:49:24 CDT 2020
On Fri, 20 Mar 2020 13:46:25 +0100
Adam Borowski via Unicode <unicode at unicode.org> wrote:
> On Fri, Mar 20, 2020 at 12:21:26PM +0000, Costello, Roger L. via
> Unicode wrote:
> > [Definition] Property: an attribute, quality, or characteristic of
> > something.
> > JPEG is a binary data format.
> > CSV is a text data format.
> > Question #1: Is the binaryness/textness of a data format a
> > property?
> > Question #2: If the answer to Question #1 is yes, then what is the
> > name of this binaryness/textness property?
I'd suggest 'texthood' as the correct English term.
> I'm afraid this question is too fuzzy to have a proper answer.
> For example, most Unix-heads will tell you that UTF16LE is a binary
> rather than text format. Microsoft employees and some members of
> this list will disagree.
Some files change type on changing operating system. Digital's old RMS
formats included as basic text files in which each record (roughly a
line) started with a binary 2-byte length field. Text records on
magnetic tape typically started with an ASCII length count!
> Then you have Postscript -- nothing but basic ASCII, yet utterly
> unreadable for a (sane) human.
No worse than a hex dump - in fact, a lot more readable. Indeed, are
you not aware of the concept of a write-only programming language?
> If you want _my_ definition of a file being _technically_ text, it's:
> * no bytes 0..31 other than newlines and tabs (even form feeds are out
> * correctly encoded for the expected charset (and nowadays, if that's
> not UTF-8 Unicode, you're doing it wrong)
> * no invalid characters
Unassigned characters are perfectly reasonable in a text file. Surely
you aren't saying that a text file using the characters new to Unicode
13.0 should, at present, usually be regarded as a binary file?
> But besides this narrow technical meaning -- is a Word document
> "text"? And if it is, why not Powerpoint? This all falls apart.
Well, a .docx file isn't text - it's a variety of ZIP file, which is
binary. Indeed, as word files naturally include pictures, it very much
isn't a text file. A .doc file is more like an image dump of a file
system. A .rtf file on the other hand, probably is a text file -
though I've a feeling there are variants that aren't *A*SCII.
More information about the Unicode