Portable Documents for the Open Web (Part 3)

The first two parts of this three-part series covered the enduring need for portable documents and why PDF’s fundamental architecture is too dated and too limited to fill this need. In this final part, we’ll take a look at EPUB, the format that has rapidly emerged as the open standard for eBooks. It’s my contention that EPUB, not PDF, represents the future of portable documents in our increasingly Web-based world. Why? In short, EPUB addresses all the key limitations of PDF. EPUB is reflowable, accessible, modular (with packaging and content cleanly separated), and based on HTML5 and related Web Standards. It’s a truly open format, developed in a collaborative process to meet global requirements rather than by a single vendor to support its proprietary products. Let’s take a harder look at these points.

EPUB 3 brings the Open Web to portable documents

EPUB 3, the latest version of EPUB, fully embraces HTML5 and related modern Web Standards. Previous versions of EPUB were based on specific subset profiles of HTML and CSS, in effect “borrowing” from these standards. But as a result they were effectively frozen in time. EPUB 2.0.1, completed in 2010, adopts modules from XHTML 1.1, which was completed in 2001 and is based on the Web and browsers of the mid-1990’s.

During the development of EPUB 3 we made a key decision to tightly align with HTML5, SVG, CSS 3, and related modern Web Standards. Rather than defining “frozen” profiles of these standards, the general approach of EPUB 3 is to normatively reference the relevant standards in their entirety. This means that if it’s legal HTML5, it’s legal EPUB 3, period. And as HTML5 evolves, EPUB 3 is committed to evolving with it. This decision has made EPUB 3 much more fundamentally a portable document (as generally defined earlier in this article) packaging of Web content, rather than a distinct format.

As a portable document, an .epub file has within it all of the assets needed to render the publication – content, stylesheets, media, scripts, fonts – in a well-defined, structured, interoperable manner. You might say that a .epub file is a website “in a box”, one that’s been “domesticated” so that it can be distribute through channels, and used online as well as offline. You can put any website and its dependent assets in a ZIP file, but what makes EPUB is that the content is well-defined: it has a logical reading order, navigable structure, and metadata. You can do more than just toss it to a browser and execute it, seeing what happens: you can process it as data.

EPUB also leverages the universality of Web Standards and open formats like ZIP and XML. With EPUB you can utilize Web development skills and technologies for things like interactivity, forms, and rich media. You get a real runtime DOM (Document Object Model) and associated standard APIs. Critically, you don’t add to a user’s security exposure beyond what they already get with a web browser, and the browser stack is heavily scrutinized by both vendors and 3rd parties, and any exploits are rapidly patched.

And because EPUB 3 now supports fixed-layout as well as reflowable content, you can represent in EPUB a superset of the range of content that can be represented in PDF. Including content that requires precision in display and printing. And with SVG being in effect an XML representation of the PostScript/PDF imaging model, moving from PDF to EPUB can be lossless (as HW-accelerated SVG proliferates in modern browsers you can expect to see much more seamless PDF-to-EPUB conversion tools, and ultimately print-driver support a la Acrobat for EPUB generation from any type of document).

Of course the specifics of EPUB 3 are not the only conceivable way one might package HTML5 and related content into a portable document format. Several proprietary HTML5-based solutions have sprouted up in parallel with EPUB 3, including Inkling’s S9ML and the format of Apple iBooks Author. But market pressures tend to create convergence to a single solution for key standards, particularly when interoperability is involved. There were originally multiple competitors for PostScript and PDF, but each rose to become the unchallenged technology standard of its era. In the HTML5 era, it seems clear that a single portable document packaging of Web Content will maximize interoperability, and that an open standard, if adequately supported, is likely to prevail over proprietary solutions.

And EPUB 3 has another key advantage over both PDF and proprietary alternatives: accessibility. EPUB 3 has been designed in close collaboration with the DAISY Consortium to ensure that requirements for accessibility to the blind and others with print disabilities become part of the mainstream digital publication format. PDF is notoriously difficult to make fully accessible, and the reflow-centric EPUB is in the process of being mandated as a standard format for education institutions, governments, and others. Arbitrary HTML5 websites and proprietary alternatives will be very unlikely to have EPUB 3’s array of accessibility capabilities, and they won’t have the critical mass of accessibility stakeholders who are converging on EPUB 3 as the means to make accessibility part of mainstream digital publications.

As a side note, stepping up to be the key format for accessible publications was a primary reason for the expanded scope of EPUB to cover all kinds of publications. As EPUB 3 was being developed there was some debate about whether EPUB should focus only on eBooks, or even more narrowly only on text-centric “trade” eBooks (where EPUB’s original use cases were centered). But, the accessibility community needs all documents to be accessible and it was clear that this implied stepping up to eventually become the global standard horizontally across the many market segments of digital publishing.

Last but not least EPUB has from its origins been developed in a collaborative open process. The IPDF is a democratically governed member-driven organization with over 350 members from 36 different countries. That means that EPUB is designed to meet a broad set of digital publication use cases and requirements, not to enable one vendor’s proprietary products. As EPUB’s use cases and applicability has expanded, so has IDPF’s membership, which now has a majority of members from outside North America, and is gaining members focused on corporate publishing, magazine and comic publishing, and other parts of the publishing universe beyond book publishing. IDPF has also forged a close partnership with the organization responsible for broader Web Standards, the W3C, which enables us to influence the development of these broader standards, and the browsers that implements them.

EPUB and the Semantic Web

I’ve outlined why there will be an ongoing need for portable documents, and presented an argument for why EPUB is becoming the next-generation portable document format for the Web, not just for eBooks but on track to displace many use cases presently fulfilled by PDF. Some might consider that a sufficiently big vision, but actually I see EPUB’s utility as transcending even the need for portable document packaging. And, let’s face it, the need for representing content and documents as concrete files may indeed ultimately fade away, even if it’s going to take long enough that it won’t be my own kids asking Hey Dad, What’s a File?

To me, what makes EPUB special is not that it is packaged into a single ZIP-based file, but that is enables structured, metadata-enhanced content that can be created & manipulated reliably with automated tools and distributed through multiple channels. It ensures that Web content is declarative data – that can be presented in different ways, sliced & diced, and reused – rather than programmatic spaghetti that can only be rendered to see what happens.

Even if we someday all use cloud-based services, and never need to download monolithic “.epub” files, I’m convinced that a declarative approach to complex document data will remain valuable – even if the use cases are only around syndication that document data across cloud-based services.

Fundamentally this is the vision of the Semantic Web. One might argue that the W3C’s explicit focus in recent years on Web Applications, while necessary to stay relevant and support very real needs in the broader IT arena, have come at the expense of failing to pay enough attention to the needs of documents and content as data. A Web browser is not just a virtual machine for JavaScript. EPUB in effect takes the Wild, Wild Web and tames it. EPUB for example requires use of the XML serialization of HTML5 (XHTML5), rather than “Tag Soup” aka “Street” HTML. This means that EPUB content, unlike arbitrary web pages, can be reliably created and manipulated with XML tool chains. EPUB defined Reading System conformance more tightly than HTML5 defines for browser User Agents, pinning down things that are under-specified in the union of W3C standards. For example, conforming EPUB 3 Reading Systems are required to support both OpenType and WOFF font formats, to support MathML, and SVG. The result is, in theory, a much higher degree of reliability of content and interoperability across Reading Systems, including temporally (i.e., content created today will continue to work into the future). This is something that PDF pretty much nailed; the Web, not so much.

In effect EPUB is both a distribution and syndication format. If down the road we don’t have to distribute downloadable files any more, then it would still have a useful (if not necessarily central) role to play as a reliable, structured syndication format. Even within a self-contained web property, EPUB could play a useful role as a well-defined profile of HTML5-based content. The surrounding app might be Web-app spaghetti, designed to work on today’s browsers, but for representing rich content assets, EPUB provides a well-defined “contract”.

EPUB in the Real-World

The core mission of the IDPF, the organization responsible for EPUB, is to establish a global, interoperable, accessible standard for eBooks and other publications to help foster a growing digital publishing industry.

In this article I’ve presented a personal vision for why we’ll still need portable documents for the foreseeable future, why EPUB is becoming the next-generation portable document format based on HTML5 and the Open Web, and why ultimately EPUB, as a way to think of Web content as declarative data, can be viewed as an important part of the broader Semantic Web.

But it’s clear we’re not yet sipping mai-tai’s on a sandy beach of universal platform harmony. Circa August 2012, the industry is smack in the middle of a painful transition from EPUB 2, which was based on a 10-year-old version of HTML and limited to text-centric content, to the much more capable EPUB 3, which is based on the latest Web Standards. While several eBook reading system vendors (notable Apple, Kobo, and VitalSource) have substantially delivered on EPUB 3 support, and others (including Sony, B&N and Google) have publicly endorsed EPUB 3, we certainly aren’t yet all the way there. And being based on the latest Web Standards – many of which are not finalized, and some of which are even in danger of forking – has its downsides: they don’t call it “bleeding-edge” for nothing! And, EPUB 3.0 is not the end of the road: a number of new features need to be developed to fully realize the broader goals. So in the very short term it may well look like EPUB support is getting less, rather than more, consistent, similar to when the first HTML5-based browsers came to market.

But, I’m personally convinced the migration to EPUB 3 will ultimately yield a higher level of conformance, just as all modern browsers now support HTML5 and exhibit a higher degree of overall conformance to standards than in the bad old days of browser-specific websites. And, while some publishers might be happy to stick with proprietary platforms, and others might find plain websites sufficient, the benefits of an open, interoperable, global, accessible platform should ultimately lead to the largest set of tools and services for publishers, and the most consumption choices for consumers. Ultimately these things are what matter: a format is just an enabler for a larger ecosystem.

To make this ecosystem thrive we need to successfully navigate the current transition, while continuing to evolve EPUB to meet global requirements across the publishing industry. The IDPF is an inclusive organization and welcomes the support of organizations large and small in advancing our mission. There’s also a number of related open source initiatives: including an EPUB validator, EPUB 3 samples, and even an open source reading system. So if you agree with me that a universal, accessible digital publication format based on the Open Web makes sense, I hope you’ll consider supporting IDPF and its activities to help make sure it happens!

Popular topics:

EPUB 3: The future of digital publications

TOC

Stay Connected

More O'Reilly Sites