Portable Documents for the Open Web (Part 1)

Having been involved for over two decades with the intersection of technology and publishing, I’m looking forward to being an occasional writer for the TOC blog. At Joe Wikert’s invitation, I’m starting out with my personal vision for the future of portable documents and the Web, including the relationship between EPUB 3, HTML5 and PDF. This post is the first in a three-part series. Part two can be found here and part three here.

What’s up with HTML5 and EPUB 3? (and, is EPUB even important in an increasingly cloud-centric world?)

EPUB is the well-known open standard XML-based format for eBooks and other digital publications, based on HTML and CSS. EPUB is the primary distribution format for B&N Nook, Kobo, Apple iBooks, Sony Reader, and many other eBook platforms, and is supported by Amazon as an ingestion format for Kindle (whose distribution format is proprietary).

Until recently EPUB has been primarily used for text-centric publications, with pagination and formatting generally applied “on the fly” by reading systems. But the latest EPUB 3, whose specification was completed last Fall, now supports complex fixed layouts as well as audio and video, interactivity, MathML, and many other new features. It’s definitely not your father’s EPUB any more. But, since the vast majority of this added goodness comes from HTML5 and related Web Standards, rather than anything intrinsic to EPUB, some have questioned whether EPUB is a necessary or beneficial ingredient for the future of books in an increasingly cloud-oriented digital world. Jani Patokallio, Publishing Platform Architect at Lonely Planet, has even argued that eBooks are obsolete and we should “throw away the artificial shackles of ePub” and develop websites instead (the comment thread that I contributed to there was the original stimulus for me to write this article).

As the Executive Director of the International Digital Publishing Forum (IDPF), the group responsible for the EPUB standard, I’m certainly not a disinterested party. But, professional affiliation aside, my personal vision for EPUB, informed by a career largely spent developing publishing technology and standards, is drastically different than Jani’s. I fully agree with him and others that publishers need to think and act a lot more like web developers. I also agree that for some types of real-time content (including a lot of what Lonely Planet publishes) websites and web-based databases will over time replace books (whether p- or e-).

But I believe that EPUB has a critical role to play that isn’t filled by plain websites. In a nutshell, EPUB is the portable document format for the Open Web. But, that begs a couple of fundamental questions – what is a portable document anyway, and why do we still need them (or files at all) in a world rapidly evolving towards a cloud-centric model of content distribution and access?

The Enduring Need for Portable Documents

Back in the early 1990’s, Adobe successfully framed its Acrobat format as the Portable Document Format – hence its name, PDF. PDF is defined by two fundamentally distinct attributes. First, a PDF file is fundamentally a fixed sequence of static page images: an electronic equivalent of paper, or “What You See Is What You Get” (WYWISYG). But, more fundamentally, a PDF file is self-contained and usable across devices and operating systems. This was the core meaning of “portable” and underscored by Adobe’s original tagline for Acrobat: “view and print anywhere”.

We’ll come back to the WYSIWYG bit. But, the first question is whether in a Web world, we will still need to package and download portable documents? As Jani’s colleague Gus Balbontin wrote in the comment thread in the aforementioned post: “If I can use a great website (read: UX, content, functionality) online and offline…on any of my devices… I don’t see a reason why we wouldn’t migrate to all our requirements being fulfilled by pointing our browser to a specific address”.

I think Jani and Gus’s arguments gloss over a critical distinction between websites and documents.

The Web’s fundamental architecture – REST – depends on two-way transfer of information between servers and clients. While caching is part of that architecture, and thus websites can indeed potentially be used offline, the core assumption of the Web includes the opportunity to dynamically determine the data sent from a server to a client. The fact that the server “knows” the clients it’s communicating with (even though at times mediated by a cache) is at the heart of the Web’s distributed architecture. So web pages are rarely designed to be long-lived entities, and increasingly never exist as static objects at all, instead being on the server from database queries or even on the client from JavaScript applications in the browser.

The REST architectural style is great for a number of things, but content portability is not one of them: the coupling of server and client is loose, but it still exists. Complex websites rarely work properly other than on browsers with which they are explicitly tested. That means current browsers: not older ones, and not necessarily newer ones. More fundamentally, it’s just not possible to deterministically take a modern, rich website – a collection of markup, stylesheets, assets, and server and client programmatic elements – and usably transfer it to either an end user or to another entity for subsequent redistribution. In some sense a website doesn’t really ever exist as a reified object. Instead, a web server is called upon – in the context of some specific web client – to serve it up, piece by piece. Or, really, multiple web servers, because part of the architecture of the web is its distributed nature. A given website, like Lonely Planet’s, may be served up by a host of affiliated servers for content, social data, advertising, analytics, etc. Notionally, what is triggered by (as Gus wrote) “pointing our browser at a given address” is a stream of unique, tailored requests and responses, operating in parallel with execution of downloaded JavaScript code in the browser’s virtual machine.

Generally speaking, this is OK. Online web applications are not necessarily expected to be archived for years on end. You engage with Expedia.com today, on yesterday and today’s browsers. Wayback Machine notwithstanding, one doesn’t expect to stash away a copy of Expedia.com and visit it with future browsers in two year’s time. Again while some web experiences may be able to be cached for offline execution this is a special case of a generally online-centric architecture. And, one that is typically temporarily limited to timely content and only works well in practice on systems where it’s carefully vetted and tested (it’s not coincidental that the signature example of an offline Web application, the Financial Times, only works on one specific environment, iOS, and is focused on “fresh” content).

A portable document by contrast is, fundamentally, a single entity that reliably contains its constituent content. There may be links to the broader corpus of Web content, but it’s clear what’s “inside” a portable document package. And, the portable document is not generated ephemerally for a single client system, but as something that can be reliable archived, moved across different devices, and redistributed via multiple channels.

But, ontology aside, Jani and Gus’s core question remains: will we still need such self-contained content entities in an increasingly cloud-based world?

Well, it sure looks like portable documents aren’t going away. In the market for consumer eBooks, reflowable formats are prevalent, primarily the open standard EPUB and its proprietary evil twin, Amazon Kindle’s .mobi format. Online browser-baesd viewing solutions are widely available, but have only a tiny proportion of eyeballs. And, for other digital publications and ad hoc document sharing by end users, PDF remains a hugely popular format. PDF support is built-in to major operating systems, including OS/X and iOS, Microsoft Office, and thousands of other software programs. While there’s certainly many times more web pages than PDF files published on the Web, there may well be more net total text in the PDFs. And many enterprise, and most of our personal hard drives, continue to hold more PDFs than web pages.

Motivations for this continued use of portable documents (and, more generally, content objects reified into interoperable files) come from two perspectives: the content producer (publisher) and the content consumer (end user).

From a publisher’s perspective, a universal goal is cost-effective publishing. A big part of this is the simple one-button “print to PDF”. But there’s also the reality that once the PDF is made, the publisher doesn’t have to worry about what system or device the recipient is using: it just works. A website that “just works” everywhere can certainly be made, but only at the expense of a visual presentation that is highly dumbed-down and likely to be a poor experience for many users. A reified content object in an inteoperable format, unlike a website, can also be reliably delivered indirectly through channels.

From the end user’s perspective, the advantage harkens back to the original “view anywhere” tagline of Acrobat. Certainly we aren’t yet at a point where the cloud can be depended on 24/7, particularly not for immersive reading. Immersive reading needs to be, well, immersive. Not being able to go to the next page because your Internet connection is down is the opposite of immersive. And, most consumers expect to be able to download and store content locally, particularly content that they’ve purchased. And if “owned” content is going to be stored in a cloud, consumers will want it to be the cloud of their choice, and the cloud to be an option not a required intermediary for consumption. Google Docs is the signature example of an online cloud-centric solution, but file upload and download are front-and-center features.

There’s no denying the trend towards the cloud. I’m sure that over time there will continue to be an increasing ability to conveniently publish directly to the cloud, as well as increasing acceptance by end users of cloud-based consumption. Perhaps someday the idea of a “file” will even become obsolete. But, at a minimum for many years to come – and possibly forever – it seems obvious that there will continue to be a significant role for reified content objects, particularly portable documents. This naturally leads to the question of whether PDF, the incumbent portable document format, can continue to fill that role indefinitely. That is the focus of part 2 of this series.

Popular topics:

TOC

Stay Connected

More O'Reilly Sites

Popular topics:

What role does EPUB play in the cloud-centric world?