Is the binaryness/textness of a data format a property?

Richard Wordingham via Unicode unicode at unicode.org
Fri Mar 20 09:49:24 CDT 2020


On Fri, 20 Mar 2020 13:46:25 +0100
Adam Borowski via Unicode <unicode at unicode.org> wrote:

> On Fri, Mar 20, 2020 at 12:21:26PM +0000, Costello, Roger L. via
> Unicode wrote:
> > [Definition] Property: an attribute, quality, or characteristic of
> > something.
> > 
> > JPEG is a binary data format.
> > CSV is a text data format.
> > 
> > Question #1: Is the binaryness/textness of a data format a
> > property? 
> > 
> > Question #2: If the answer to Question #1 is yes, then what is the
> > name of this binaryness/textness property?  

I'd suggest 'texthood' as the correct English term.

> I'm afraid this question is too fuzzy to have a proper answer.
> 
> For example, most Unix-heads will tell you that UTF16LE is a binary
> rather than text format.  Microsoft employees and some members of
> this list will disagree.

Some files change type on changing operating system.  Digital's old RMS
formats included as basic text files in which each record (roughly a
line) started with a binary 2-byte length field.  Text records on
magnetic tape typically started with an ASCII length count!

> Then you have Postscript -- nothing but basic ASCII, yet utterly
> unreadable for a (sane) human.

No worse than a hex dump - in fact, a lot more readable.  Indeed, are
you not aware of the concept of a write-only programming language? 

> If you want _my_ definition of a file being _technically_ text, it's:
> * no bytes 0..31 other than newlines and tabs (even form feeds are out
>   nowadays)
> * correctly encoded for the expected charset (and nowadays, if that's
> not UTF-8 Unicode, you're doing it wrong)
> * no invalid characters

Unassigned characters are perfectly reasonable in a text file.  Surely
you aren't saying that a text file using the characters new to Unicode
13.0 should, at present, usually be regarded as a binary file?

> But besides this narrow technical meaning -- is a Word document
> "text"? And if it is, why not Powerpoint?  This all falls apart.

Well, a .docx file isn't text - it's a variety of ZIP file, which is
binary.  Indeed, as word files naturally include pictures, it very much
isn't a text file.  A .doc file is more like an image dump of a file
system.  A .rtf file on the other hand, probably is a text file -
though I've a feeling there are variants that aren't *A*SCII.

Richard.


More information about the Unicode mailing list