Word processors and file formats

I’ve always been interested in file formats from the perspective of long-term access to information. These have been interesting times.

To much gnashing of teeth, Apple recently rolled out an update to its iWork suite—Pages, Numbers, and Keynote, which are its alternatives to the MS Office trinity of Word, Excel, and Powerpoint. The update on the Mac side seems to have been driven by the web and iPad versions. Not only in the features (or lack thereof), but in the new file format, which is completely unrelated to the old one. The new version can import the files from the old one, but it’s definitely an importation process, and complex documents will break in the new apps.

The file format for all the new iWork apps, Pages included, is based on Google’s protocol buffers. The documentation for protocol buffers states

However, protocol buffers are not always a better solution than XML – for instance, protocol buffers would not be a good way to model a text-based document with markup (e.g. HTML), since you cannot easily interleave structure with text. In addition, XML is human-readable and human-editable; protocol buffers, at least in their native format, are not. XML is also – to some extent – self-describing. A protocol buffer is only meaningful if you have the message definition (the .proto file).

Guess what we have here. Like I said, this has been driven by the iPad and web versions. Apple is assuming that you’re going to want to sync to iCloud, and they chose a file format optimized for that use case, rather than for, say, compatibility or human-readability. My use case is totally different. I’ve had clients demand that I not store their work in the cloud.

What’s interesting is that this bears some philosophical similarities to the Word file format, whose awfulness is the stuff of legend. Awful, but perhaps not awful for the sake of being awful. From Joel Spolsky:

The first thing to understand is that the binary file formats were designed with very different design goals than, say, HTML.

They were designed to be fast on very old computers.
…
They were designed to use libraries.
…
They were not designed with interoperability in mind.

New computers are not old, obviously, but running a full-featured word processor in a Javascript interpreter inside your web browser is the next best thing; transferring your data over a wireless network is probably the modern equivalent of a slow hard drive in terms of speed.

There is a perfectly good public file format for documents out there, Rich Text Format or RTF. But curiously, Apple’s RTF parser doesn’t do as good a job with complex documents as its Word parser—if you create a complex document in Word and save it as both .rtf and .doc, Pages or Preview will show the .doc version with better fidelity. Which makes a bit of a joke out of having a “standard” file format. Since I care about file formats and future-proofing, I saved my work in RTF for a while. Until I figured out that it wasn’t as well supported.

What about something more basic than RTF? Plain text is, well, too plain: I need to insert commentary, tables, that sort of thing. Writing HTML by hand is too much of a PITA, although it should have excellent future-proofing.

What about Markdown? I like Markdown a lot. I’m actually typing in it right now. It doesn’t take long before it becomes second nature. Having been messing around with HTML for a long time, I prefer the idea of putting the structure of my document into the text rather than the appearance.

But Markdown by itself isn’t good enough for paying work. It has been extended in various ways to allow for footnotes, commentary, tables, etc. I respect the effort to implement all the features that a well-rounded word processor might support through plain, human-readable text, but at some point it just gets to be too much trouble. Markdown has two main benefits: it is highly portable and fast to type—actually faster than messing around with formatting features in a word processor. These extensions are still highly portable, but they are slow to type—slower than invoking the equivalent functions in a typical WYSIWYG word processor. The extensions are also more limited: the table markup doesn’t accommodate some of the insane tables that I need to deal with, and doesn’t include any mechanism for specifying column widths. Footnotes don’t let me specify whether they’re footnotes or endnotes (indeed, Markdown is really oriented toward flowed onscreen documents, where the distinction between footnotes and endnotes is meaningless, rather than paged documents). CriticMarkup, the extension to Markdown that allows commentary, starts looking a little ungainly. There’s a bigger philosophical problem with it though. I could imagine using Markdown internally for my own work and exporting to Word format (that’s easy enough thanks to Pandoc), but in order to use CriticMarkup, I’d need to convince my clients to get on board, and I don’t think that’s going to happen.

I can imagine a word processor that used some kind of super-markdown as a file format, let the user type in Markdown when convenient, but added WYSIWYG tools for those parts of a document that are too much trouble to type by hand. But I’m not holding my breath. Maybe I should learn LaTeX.

Leave a Comment

Your email address will not be published. Required fields are marked *