Portable Documents for the Open Web (Part 2)

Why PDF is not the future of portable documents

Part 1 of this three-part series argued that there will be an enduring need for portable documents even in a world that’s evolving towards cloud-based content distribution and storage. OK fine, but we have PDF: aren’t we done? The blog post from from Jani Patokallio that inspired this series suggested that “for your regular linear fiction novel, or even readable tomes of non-fiction, a no-frills PDF does the job just fine”. In this second part I take a hard look at PDF’s shortcomings as a generalized portable document format. These limitations inspired EPUB in the first place and are in my opinion fatal handicaps in the post-paper era. Is it crazy to imagine that a format as widely-adopted as PDF could be relegated to legacy status? Read on and let me know what you think.

For over two decades, PDF has been the dominant file format for portable documents. PDF remains extremely popular, particularly for professional and technical books. O’Reilly Media earlier this year reported that it’s still their most popular download format. But, the key point is that PDF’s share dropped from over 90% to 50% in a scant three years. And that’s for technical books with relatively complex layouts, often read on large-screen PCs. For publishers of fiction eBooks, PDF market share has declined to negligible levels. So the big question is, why did PDF market share fall off a cliff? For that matter, if “a no-frills PDF does the job just fine”, why is Jani’s employer, Lonely Planet, selling travel guides in EPUB and other formats?

Undoubtedly, the primary reason for the rapid fall-off in PDF market share for digital publications is lack of reflow. PDF documents are a sequence of final form pages, typeset “at the factory”. While PDF stands for the “Portable Document Format”, it could more accurately be termed the “Portable Print Preview Format”. For many years, the Adobe group responsible for PDF was titled the “ePaper Business Unit”. By contrast, the formats that now represent the vast majority of eBook market share (EPUB and MOBI) are designed to, by default, be formatted dynamically at the point of consumption. This means that you can change font size on a tablet or PC screen to a comfortable level, change to “night mode” (white text on black screen”, etc. None of which is doable with PDF. And on smartphones and other smaller-screen devices, PDFs aren’t even readable in the first place, other than via “pan and zoom hell”. As more reading moves to digital devices, and printing diminishes in importance, the need for reflow – for optimizing content for context – increasingly trumps being a faithful replica of paper.

A related issue is that PDF documents are poorly accessible. In PDF, character glyphs are essentially spray-painted onto pages at (x,y) positions, without any clear relationship to reading order and not necessarily in any intelligible encoding. Accessibility of course relates to making content available to the blind or those with other reading disabilities, or who for whatever reason want a larger font size or to listen to an aural rendition (I hope you’re not reading while driving!). But accessible content is also reusable content: data, not just presentation. There is a means in PDF to add on accessibility information (“Tagged PDF”), but it’s a complex and fragile and ultimately somewhat of a hack (trying to tack structure onto final-form typeset pages is, fundamentally, a backwards approach). In practice very few PDF files contain structure enabling even determining basic reading order – including most PDFs created via Adobe’s own software.

The PDF format is also monolithic and complex, attributes that make it much less approachable. PDF’s specialized binary format entangles “packaging” with the content representation. Creating and manipulating PDF requires use of specialized, heavy-weight software libraries. It is impossible to “hand code” PDF. PDF has hundreds of proprietary scripting APIs (with all the security exposure that entails), but not a true runtime DOM (Document Object Model). And after more than two decades of monotonic feature growth, PDF is not, in any sense, lean (the spec is 750 pages, not including supplements, scripting API documents, etc.).

Feature bloat is one problem (to which the Open Web is not entirely immune), but a more critical impediment is that PDF is a proprietary technology stack, not based on Open Web Standards. Some documents need rich media (audio, video, etc.), interactivity, forms, and other capabilities. To do this in PDF requires depending on specialized PDF-only technologies in these areas, utilizing a proprietary scripting language and APIs that are not well-supported other than in Adobe’s own Acrobat and Reader software. That means if you are using Preview on a Mac or iOS device, or other PDF reader software like FoxIt on a PC, these features just won’t work. More fundamentally it means you can’t leverage the skills, staff and technologies you already have for website development for PDF. Adobe has pushed over a number of years a concept of the “Interactive PDF”. To call resulting adoption a “niche market” would be charitable: “failure” is more like it. The signature indictment of the PDF format has to be Adobe’s graft-on of a second proprietary forms format, “XFA” (additive to the original, also proprietary, “AcroForms”). AcroForms support is thin on the ground, but no PDF software other than Adobe’s own has ever bothered to support XFA, and not even Adobe’s own PDF readers support it across the board (it doesn’t work in Adobe’s Reader Mobile SDK or mobile Reader apps, for example).

None of this is very surprising: PDF is a single vendor’s solution, designed over 20 years ago, before the Internet and XML, based on the limitations of then-current computers, to enable its print-centric products. The opening sentence of the original PDF specification is telling: “PDF [is] the native file format of the Adobe Acrobat family of products” (that sentence dropped away in later versions of the spec). Adobe eventually opened up PDF as an ISO standard, but the perception is that it still belongs to Adobe, who still brand it as “Adobe PDF”. Even the PDF file icon is licensed for use  only for files created by an Adobe product. And ISO 32000 is essentially a “rubber stamp” standard: no substantive changes from Adobe’s spec, and hundreds of pages that describe features that are only implemented by Adobe. A single vendor solution is dependent on that vendor’s ongoing support and more fundamentally just can’t keep up with the rapid pace of evolution of the Open Web, which is fueled by the virtuous circle of competing browsers and an ecosystem of developers and solution providers many orders of magnitude larger. As evidenced by the failure of Adobe Flash and Microsoft Silverlight, the Web has become the universal experience delivery platform. And since it’s clear Flash and Silverlight can’t compete with the Open Web, it’s a stretch to imagine that a revived “Interactive PDF” could.

I don’t mean to suggest that PDF has no utility or is going the way of the dinosaur. PDF has ably filled the niche of “ePaper”, and remains one of the most widely adopted file formats ever. I’m proud to have contributed to its development. And for the specialized but very important use case of print production workflows, where faithfully replicating paper is a given and the other issues less pertinent, PDF’s dominant market share will likely continue for the foreseeable future. But when it comes to immersive digital reading experiences, with print fidelity increasingly secondary to N-screen support, it’s another story, as evidenced by the rapid rise of EPUB as the primary eBook format. And increasing need for rich media, interactivity, and accessibility should only accelerate migration to EPUB, given the Web Standards foundation of the latest version, EPUB 3. That’s the topic of part 3 , the conclusion of this three-part series.

tags: , , , ,